Newest 'gradient-descent' Questions

6 votes

1 answer

150 views

Why do “good” loss functions in ML need both Lipschitz continuity and smoothness?

I’m trying to understand the common assumptions in machine-learning optimization theory, where a “well-behaved” loss function is often required to be both L-Lipschitz and β-smooth (i.e., have β-...

Antonios Sarikas

881

asked Nov 26 at 17:39

2 votes

0 answers

25 views

What causes the degradation problem - the higher training error in much deeper networks?

In the paper "Deep Residual Learning for Image Recognition", it's been mentioned that "When deeper networks are able to start converging, a degradation problem has been exposed: with ...

Vignesh N

21

asked Oct 11 at 12:28

0 votes

0 answers

41 views

Why does LightGBM use the factor (1-a)/b in GOSS?

LightGBM is a specific implementation of gradient boosted decision trees. One notable difference is how samples used for calculating variance gain in split points are picked. In the algorithm, ...

yanis-falaki

1

asked Jul 21 at 7:15

2 votes

1 answer

59 views

Running SGD multiple times and picking the best result: keywords / name for this practice?

When fitting neural networks, I often run stochastic gradient descent multiple times and take the run with the lowest training loss. I'm trying to look up research literature on this practice, but I'm ...

Jacob Maibach

147

asked Jun 10 at 20:26

4 votes

1 answer

89 views

Stochastic Gradient Descent for Multilayer Networks

I was going through the algorithm for Stochastic Gradient decent in mulilayer network from the book Machine Learning by Tom Mitchell, and it shows the formulae for weight update rule. However, I dont ...

Machine123

49

asked May 10 at 14:13

10 votes

3 answers

2k views

Is Backpropagation faulty?

Consider a neural network with 2 or more layers. After we update the weights in layer 1, the input to layer 2 ($a^{(1)}$) has changed, so ∂z/∂w is no longer correct, as z has changed to z* and z* $\...

Yaron

109

asked Apr 27 at 0:44

1 vote

0 answers

85 views

cost function behaves erratically [closed]

Trying to learn basic machine learning, I wrote my own code for logistic regression where I minimize the usual log likelihood using gradient descent. This is the plot of the error function through a ...

user470820

11

asked Mar 27 at 14:14

2 votes

0 answers

55 views

Estimator bias implies Gradient Bias

Say I have a biased estimator, for example estimating $\log \mathbb{E}[f_\theta(x)]$ using Monte Carlo Does this implies that $\nabla_\theta \log \mathbb{E}[f_\theta(x)]$ is also biased if estimated ...

Alberto

1,561

asked Mar 9 at 21:35

5 votes

1 answer

117 views

For a linear problem $Ax=b$, is gradient descent a lot faster than least squares (any approach)?

Context There are many methods to solve least squares, but most of them involve $k n^3$ flops. Using gradient descent, one runs $A x_i$ and uses the error to update $x_{i+1} = x_i - c \times \mathrm{g}...

uranus

51

asked Mar 4 at 16:49

0 votes

0 answers

53 views

Nesterov Accelerated Gradient Descent Stalling with High Regularization in Extreme Learning Machine

I'm implementing Nesterov Accelerated Gradient Descent (NAG) on an Extreme Learning Machine (ELM) with one hidden layer. My loss function is the Mean Squared Error (MSE) with $L^2$ regularization. The ...

Paolo Pedinotti

1

asked Mar 2 at 13:48

3 votes

1 answer

80 views

Do deep learning frameworks "look ahead" when calculating gradient in Nesterov optimization?

The whole point behind Nesterov optimization is to calculate the gradient not at the current parameter values $\theta_t$, but at $\theta_t + \beta m$, where $\beta$ is the momentum coefficient and $m$ ...

Antonios Sarikas

881

asked Feb 22 at 22:20

0 votes

0 answers

55 views

If the main benefit of BatchNorm is loss landscape smoothing, why do we use z-score normalisation instead of min-max?

According to recent papers, the main reason why BatchNorm works is because it smooths the loss landscape. So if the main benefit is loss landscape smoothing, why do we need mean subtraction at all? ...

FadiBenz

31

asked Jan 31 at 10:00

1 vote

0 answers

62 views

Do weights update less towards the start of a neural network?

That is, because the error is coming from the end of the neural network (ie at the output layer) and trickles back via backpropagation to the start of the neural network, does that mean that the ...

Null Six

11

asked Jan 22 at 18:10

14 votes

6 answers

3k views

Why are so many problems linear and how would one solve nonlinear problems?

I am taking a deep learning in Python class this semester and we are doing linear algebra. Last lecture we "invented" linear regression with gradient descent (did least squares the lecture ...

Lukas

141

asked Jan 4 at 19:41

1 vote

0 answers

45 views

Batch Normalization and the effect of scaled weights on the gradients

I have been reading the following paper: https://arxiv.org/pdf/1706.05350, and I am having a hard time with some claims and derivations made in the paper. First of all, the main thing I am interested ...

kklaw

554

asked Dec 26, 2024 at 11:21

1 vote

0 answers

58 views

Why do machine learning courses on regression mostly focus on gradient descient although we have the closed form estimator $(X'X)^{-1}X'Y$? [duplicate]

In many online machine learning courses and videos(such as Andrew Ng's coursera course), when it comes to regression (for example regressing $Y$ on features $X$), althouth we have the closed form ...

ExcitedSnail

3,090

asked Dec 13, 2024 at 14:08

3 votes

1 answer

243 views

Custom Loss function Overfits to the Wrong Output but MSE Doesn't

I have a simple function that I want to approximate with a neural network: N(1) = -1 N(2) = -1 N(3) = 1 N(4) = -1 Instead of using the MSE or cross-entropy losses, ...

Andrew Baker

33

asked Nov 14, 2024 at 2:05

0 votes

0 answers

48 views

Is there a compact set of sufficient and necessary criteria a function can have that guarantees that gradient descent finds global minima?

We obviously need the function in question to be differentiable for this notion to make sense. Now, convexity is a sufficient condition. (strong) Quasi convexity is a weaker, but I think still ...

hmmmmmmm

511

asked Oct 28, 2024 at 9:46

1 vote

0 answers

49 views

Using XGBoost as a submodel for a hybrid ML/process-based model

I am trying to fit a Chapman-Richards growth curve: $$ B = A*(1-e^{-kt}) $$ Where B is the biomass of a forest, A is the asymptote, k is the growth rate, and t is forest age. I expect the growth rate ...

Ana Catarina Vitorino

43

asked Oct 25, 2024 at 14:17

1 vote

0 answers

80 views

The gradient method based attack does not seem make sense for neural networks because the training error is non-convex

There are several gradient-based attack methods. Let $J$ be the training error, then for instance the projected gradient attack is, $$ \widetilde{x} = \Pi( x + \epsilon \nabla_x J(\theta, x, y) ) $$ ...

Your neighbor Todorovich

707

asked Oct 11, 2024 at 14:53

1 vote

0 answers

40 views

(MacKay) How can regularization constants can be optimized on-line in a tractable way?

According to MacKay in his book Information Theory, Inference, and Learning Algorithms, Chapter 44, "Supervised Learning in Multilayer Networks" (page 531), he claims that an advantage of ...

Mashe Burnedead

111

asked Oct 9, 2024 at 22:40

0 votes

0 answers

138 views

Should the target be standardized in gradient descent?

Suppose that we have a general loss function that depends on some parameters $w$ (e.g. neural network weights): $$L_w =\frac{1}{N} \sum_i \ell(\hat{y}_i, y_i)$$ Is it beneficial to standardize the ...

Antonios Sarikas

881

asked Aug 2, 2024 at 23:09

1 vote

1 answer

90 views

Adding of Baseline parmter in derivation of Gradient Bandit Algorithm

In the derivation of the Gradient Bandit Algorithm in Chapter 2.8 of the Reinforcement Learning book by Sutton & Barto they introduce a introduce a baseline term $B_t$ and I can't seem to figure ...

Rafay Khan

121

asked Jul 16, 2024 at 11:02

1 vote

2 answers

136 views

ADALINE simple implementation with 2 features bug

I am reading Machine Learning with PyTorch and Ski-kit learn book by Sebastian Raschka While plotting the decision boundary (a line in this case, since the number of features considered = 2) I can't ...

tripma

21

asked Jul 2, 2024 at 4:23

2 votes

1 answer

89 views

Do convergence rates for (convex) gradient descent apply when domain is (convex) subset of reals?

I have a convex multi-variate optimization problem where each variable lies on the domain $[x, \infty)$ for some positive number $x$. I know the problem has a unique finite solution in the domain, ...

BaileyA

133

asked Jun 28, 2024 at 12:59

1 vote

0 answers

81 views

SVRG vs full gradient descent

Stochastic gradient descent allows us to avoid the computation of full gradients at the expense of introducing a noise floor to convergence. To decrease this noise floor, SGD requires a decrease in ...

hegash

111

asked Jun 7, 2024 at 20:06

1 vote

1 answer

131 views

Problem with high learning rate in model training

In this lecture notes Fig. 6.5, it illustrates the effect to training loss by using different learning rates: I don't understand why it seems a high enough non-divergent learning rate can converge to ...

Sam

413

asked Apr 27, 2024 at 15:00

0 votes

1 answer

128 views

Question on the Partial Derivative of the Cross-Entropy Loss in SGD for Neural Networks

I'm currently learning about neural networks and stumbled upon a confusion related to the use of Stochastic Gradient Descent (SGD) in training. Specifically, I'm puzzled about the computation of the ...

John Title

asked Mar 30, 2024 at 11:27

0 votes

2 answers

447 views

Why do we maximize likelihood (sum of logs) and not simply maximize sum of probabilities? [duplicate]

In logistic regression we find the maximum likelihood estimator - $\max \prod_{i} p(y_i \mid x_i)$. Which in practice means maximizing the sum of log likelihoods. This makes sense, I understand MLE. ...

Amnon Attali

9

asked Mar 8, 2024 at 16:06

2 votes

1 answer

103 views

Computing gradient over all examples in gradient descent

I am studying about Gradient Descent and Stochastic Gradient Descent, and the text says that one of the advantages of sgd over gd is, that gd can be computationally expensive for large datasets. In ...

WalaWizon

103

asked Mar 1, 2024 at 12:52

1 vote

0 answers

78 views

Group Lasso optimization

I read that, for the group lasso, to solve the zero subgradient equations, one approach involves keeping all block vectors fixed, denoted as $\{\hat\theta_k, k \ne j\}$, and then solving for $ \hat \...

Jenny

261

asked Jan 19, 2024 at 20:35

2 votes

0 answers

74 views

Solving a system of equalities using a neural network

Assume $P$ is a set of pairs $(x, y)$, where both $x$ and $y$ are in $\mathbb{R}^n$. Assume $P'$ is a subset of $P$. I want to train a neural network $N: \mathbb{R}^n \to \mathbb{R}^m$ such that, for ...

Mahyar

21

asked Jan 9, 2024 at 17:38

0 votes

0 answers

78 views

How does the chain-rule look for the gradient of a loss function?

When we are computing the gradient of the loss function, $L$, of a Word2Vec model, for the context word-embedding, $w_i$, and the target word-embedding, $t$. Where the loss function, $L$, looks like: $...

ZenPyro

1

asked Jan 4, 2024 at 23:18

0 votes

1 answer

79 views

What loss function should I use to fit a distribution of points with a function with latent variables?

For simplicity I am going start with a toy example. Lets suppose we have a set of $n$ points $\vec{Y}$ in the 2d space, distributed with the shape of the letter M. ...

Iván

101

asked Dec 25, 2023 at 7:52

1 vote

0 answers

85 views

Neural ODE and Adjoint method

I am trying to understand the paper NeuralODE which is very interesting. I get the general idea and the proof they give about the dynamics of the adjoint are fairly simple. They define a network $f$, ...

Julien Séveno-Piltant

111

asked Dec 18, 2023 at 12:19

0 votes

2 answers

327 views

A Simple Toy ML problem that surprisingly fails to converge (or even "try"!)

This is a much simplified network from a real problem that, to me, has a surprising INability to learn a simple task via backprop, ie, it can't overfit or learn at all. This simple version has come at ...

Josh.F

107

asked Dec 10, 2023 at 18:22

3 votes

1 answer

75 views

Implement Nesterov's acceleration for SVM

I am trying to implement Nestrov's acceleration gradient descent for SVM. The objective function I need to minimize is $$\frac{1}{2}\lVert Au-Bv\rVert_2^2$$ with constraints $\sum_{i}u_i=\sum_{j}v_j=1$...

struggleinmath

31

asked Nov 26, 2023 at 21:49

1 vote

0 answers

51 views

Residuals in Online Gradient Descent

Does anyone know if online linear learning assumes white noise to the residuals? My thought process is that serial correlation can arise due to the fact that the fit at time t uses the information ...

sebHan1234

31

asked Nov 7, 2023 at 16:09

2 votes

0 answers

271 views

How to include parameter constraints in the Levenberg-Marquardt algorithm?

Every resource I found online doesn’t say what to do if the constraints aren’t satisfied. If my updated parameters is given by: $$\theta_{k+1} = \theta_k - (JJ^T + \lambda I)^{-1}Jr, $$ where J is the ...

THATS MY QUANT MY QUANTITATIVE

131

asked Nov 3, 2023 at 10:28

12 votes

2 answers

2k views

Why go through the trouble of expectation maximization and not use gradient descent?

In expectation maximization first a lower bound of the likelihood is found and then a 2 step iterative algorithm kicks in where first we try to find the weights (the probability that a data point ...

figs_and_nuts

2,787

asked Oct 14, 2023 at 10:56

2 votes

0 answers

142 views

Do common implementations of mini-batch gradient descent violate the i.i.d assumption needed for unbiased estimation?

When we perform mini-batch GD, we estimate the true gradient: $$\nabla L = \frac{1}{N} \sum_i \nabla L_i$$ with: $$\nabla_B L = \frac{1}{B} \sum_{i \in B} \nabla L_i$$ where $B$ is the batch size. ...

Antonios Sarikas

881

asked Oct 11, 2023 at 11:15

1 vote

0 answers

154 views

Is my custom loss function differentiable?

Consider the following loss function. loss = ( ( torch.where(d > threshold, torch.sqrt(d), 0) * t ) + ( torch.where(d <= threshold, (1 - d), 0) * (1 - t) ) ) ...

Adel

305

asked Oct 2, 2023 at 12:17

0 votes

0 answers

195 views

Convergence in Logistic Regression

Hey I'm taking a deeper dive into logistic regression. Specifically the following loss function with L2 regularization, $$l(w)=\frac{1}{n}\sum_n \log(1+\exp(-y_i \cdot x_i^Tw))+\frac{\lambda}{2}||w||^...

zzz

1

asked Sep 20, 2023 at 13:53

1 vote

2 answers

142 views

How does feature scaling improve convergence in gradient descent?

Based on answers online, it appears that feature scaling helps to (1) ensure balanced step sizes and (2) make the cost function more symmetrical. Why would an imbalanced step size be an issue, since a ...

Michael

11

asked Sep 2, 2023 at 10:14

1 vote

0 answers

61 views

Why I get spikes during training with vanilla gradient descent? [closed]

I developed my own NN toolbox, and it seems it works fine. But I am not sure why I get these spikes in my loss during training: I a training for a classification task of 2 inputs and 2 classes, ...

z_tjona

119

asked Aug 18, 2023 at 3:21

4 votes

2 answers

186 views

Is optimization without function evaluations (and with only gradient evaluations) possible?

I am implementing an unconstrained optimization algorithm using gradient descent. I am evaluating a cost function at a given point, evaluating the gradient at this point, and selecting the next ...

Joaquin Rapela

43

asked Aug 17, 2023 at 12:08

1 vote

0 answers

386 views

Understanding Backpropagation with Softmax and Quadratic Error

I'm trying to understand how to compute the derivative of the Softmax activation function in order to compute the gradient of the quadratic cost function w.r.t. the weights of the last layer. ...

Dario Ranieri

11

asked Jul 18, 2023 at 14:36

2 votes

0 answers

92 views

How do we know if we are in a local or global minimum when using gradient descent?

I have been learning about Real Analysis and recently undergone a stats module on regressions etc.. We have come across gradient descent and I was curious about, if we have a complicated loss function,...

ILE2091

45

asked Jul 17, 2023 at 21:52

0 votes

0 answers

16 views

What's the point of using gradient descent for linear regression if you can calculate the coefficients directly using the least squares method? [duplicate]

Gradient descent involves significant computational effort, whereas the method of least squares enables direct and accurate calculation. Does gradient descent offer any advantages over least squares ...

Dawid

33

asked Jun 25, 2023 at 23:50

1 vote

0 answers

69 views

Imputation method for missing values that are irrelevant

I have a data set $\mathbf X$, with around 20 predictors, which is a matrix of parameters of a surrogate model. For each observation $\mathbf i$ of $\mathbf X$, the surrogate model was trained to ...

Florent H

153

asked Jun 21, 2023 at 18:42

Questions tagged [gradient-descent]