Questions tagged [gradient-descent]
Gradient descent is a first-order iterative optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. For stochastic gradient descent there is also the [sgd] tag.
998 questions
6
votes
1
answer
150
views
Why do “good” loss functions in ML need both Lipschitz continuity and smoothness?
I’m trying to understand the common assumptions in machine-learning optimization theory, where a “well-behaved” loss function is often required to be both L-Lipschitz and β-smooth (i.e., have β-...
2
votes
0
answers
25
views
What causes the degradation problem - the higher training error in much deeper networks?
In the paper "Deep Residual Learning for Image Recognition", it's been mentioned that
"When deeper networks are able to start converging, a degradation problem has been exposed: with ...
0
votes
0
answers
41
views
Why does LightGBM use the factor (1-a)/b in GOSS?
LightGBM is a specific implementation of gradient boosted decision trees. One notable difference is how samples used for calculating variance gain in split points are picked.
In the algorithm, ...
2
votes
1
answer
59
views
Running SGD multiple times and picking the best result: keywords / name for this practice?
When fitting neural networks, I often run stochastic gradient descent multiple times and take the run with the lowest training loss. I'm trying to look up research literature on this practice, but I'm ...
4
votes
1
answer
89
views
Stochastic Gradient Descent for Multilayer Networks
I was going through the algorithm for Stochastic Gradient decent in mulilayer network from the book Machine Learning by Tom Mitchell, and it shows the formulae for weight update rule. However, I dont ...
10
votes
3
answers
2k
views
Is Backpropagation faulty?
Consider a neural network with 2 or more layers. After we update the weights in layer 1, the input to layer 2 ($a^{(1)}$) has changed, so ∂z/∂w is no longer correct, as z has changed to z* and z* $\...
1
vote
0
answers
85
views
cost function behaves erratically [closed]
Trying to learn basic machine learning, I wrote my own code for logistic regression where I minimize the usual log likelihood using gradient descent. This is the plot of the error function through a ...
2
votes
0
answers
55
views
Estimator bias implies Gradient Bias
Say I have a biased estimator, for example estimating $\log \mathbb{E}[f_\theta(x)]$ using Monte Carlo
Does this implies that $\nabla_\theta \log \mathbb{E}[f_\theta(x)]$ is also biased if estimated ...
5
votes
1
answer
117
views
For a linear problem $Ax=b$, is gradient descent a lot faster than least squares (any approach)?
Context
There are many methods to solve least squares, but most of them involve $k n^3$ flops.
Using gradient descent, one runs $A x_i$ and uses the error to update $x_{i+1} = x_i - c \times \mathrm{g}...
0
votes
0
answers
53
views
Nesterov Accelerated Gradient Descent Stalling with High Regularization in Extreme Learning Machine
I'm implementing Nesterov Accelerated Gradient Descent (NAG) on an Extreme Learning Machine (ELM) with one hidden layer. My loss function is the Mean Squared Error (MSE) with $L^2$ regularization.
The ...
3
votes
1
answer
80
views
Do deep learning frameworks "look ahead" when calculating gradient in Nesterov optimization?
The whole point behind Nesterov optimization is to calculate the gradient not at the current parameter values $\theta_t$, but at $\theta_t + \beta m$, where $\beta$ is the momentum coefficient and $m$ ...
0
votes
0
answers
55
views
If the main benefit of BatchNorm is loss landscape smoothing, why do we use z-score normalisation instead of min-max?
According to recent papers, the main reason why BatchNorm works is because it smooths the loss landscape. So if the main benefit is loss landscape smoothing, why do we need mean subtraction at all? ...
1
vote
0
answers
62
views
Do weights update less towards the start of a neural network?
That is, because the error is coming from the end of the neural network (ie at the output layer) and trickles back via backpropagation to the start of the neural network, does that mean that the ...
14
votes
6
answers
3k
views
Why are so many problems linear and how would one solve nonlinear problems?
I am taking a deep learning in Python class this semester and we are doing linear algebra.
Last lecture we "invented" linear regression with gradient descent (did least squares the lecture ...
1
vote
0
answers
45
views
Batch Normalization and the effect of scaled weights on the gradients
I have been reading the following paper: https://arxiv.org/pdf/1706.05350, and I am having a hard time with some claims and derivations made in the paper.
First of all, the main thing I am interested ...
1
vote
0
answers
58
views
Why do machine learning courses on regression mostly focus on gradient descient although we have the closed form estimator $(X'X)^{-1}X'Y$? [duplicate]
In many online machine learning courses and videos(such as Andrew Ng's coursera course), when it comes to regression (for example regressing $Y$ on features $X$), althouth we have the closed form ...
3
votes
1
answer
243
views
Custom Loss function Overfits to the Wrong Output but MSE Doesn't
I have a simple function that I want to approximate with a neural network:
N(1) = -1
N(2) = -1
N(3) = 1
N(4) = -1
Instead of using the MSE or cross-entropy losses, ...
0
votes
0
answers
48
views
Is there a compact set of sufficient and necessary criteria a function can have that guarantees that gradient descent finds *global* minima?
We obviously need the function in question to be differentiable for this notion to make sense.
Now, convexity is a sufficient condition. (strong) Quasi convexity is a weaker, but I think still ...
1
vote
0
answers
49
views
Using XGBoost as a submodel for a hybrid ML/process-based model
I am trying to fit a Chapman-Richards growth curve:
$$
B = A*(1-e^{-kt})
$$
Where B is the biomass of a forest, A is the asymptote, k is the growth rate, and t is forest age. I expect the growth rate ...
1
vote
0
answers
80
views
The gradient method based attack does not seem make sense for neural networks because the training error is non-convex
There are several gradient-based attack methods. Let $J$ be the training error, then for instance the projected gradient attack is,
$$
\widetilde{x} = \Pi( x + \epsilon \nabla_x J(\theta, x, y) )
$$
...
1
vote
0
answers
40
views
(MacKay) How can regularization constants can be optimized on-line in a tractable way?
According to MacKay in his book Information Theory, Inference, and Learning Algorithms, Chapter 44, "Supervised Learning in Multilayer Networks" (page 531), he claims that an advantage of ...
0
votes
0
answers
138
views
Should the target be standardized in gradient descent?
Suppose that we have a general loss function that depends on some parameters $w$ (e.g. neural network weights):
$$L_w =\frac{1}{N} \sum_i \ell(\hat{y}_i, y_i)$$
Is it beneficial to standardize the ...
1
vote
1
answer
90
views
Adding of Baseline parmter in derivation of Gradient Bandit Algorithm
In the derivation of the Gradient Bandit Algorithm in Chapter 2.8 of the Reinforcement Learning book by Sutton & Barto they introduce a introduce a baseline term $B_t$ and I can't seem to figure ...
1
vote
2
answers
136
views
ADALINE simple implementation with 2 features bug
I am reading Machine Learning with PyTorch and Ski-kit learn book by Sebastian Raschka
While plotting the decision boundary (a line in this case, since the number of features considered = 2) I can't ...
2
votes
1
answer
89
views
Do convergence rates for (convex) gradient descent apply when domain is (convex) subset of reals?
I have a convex multi-variate optimization problem where each variable lies on the domain $[x, \infty)$ for some positive number $x$. I know the problem has a unique finite solution in the domain, ...
1
vote
0
answers
81
views
SVRG vs full gradient descent
Stochastic gradient descent allows us to avoid the computation of full gradients at the expense of introducing a noise floor to convergence. To decrease this noise floor, SGD requires a decrease in ...
1
vote
1
answer
131
views
Problem with high learning rate in model training
In this lecture notes Fig. 6.5, it illustrates the effect to training loss by using different learning rates:
I don't understand why it seems a high enough non-divergent learning rate can converge to ...
0
votes
1
answer
128
views
Question on the Partial Derivative of the Cross-Entropy Loss in SGD for Neural Networks
I'm currently learning about neural networks and stumbled upon a confusion related to the use of Stochastic Gradient Descent (SGD) in training. Specifically, I'm puzzled about the computation of the ...
0
votes
2
answers
447
views
Why do we maximize likelihood (sum of logs) and not simply maximize sum of probabilities? [duplicate]
In logistic regression we find the maximum likelihood estimator - $\max \prod_{i} p(y_i \mid x_i)$. Which in practice means maximizing the sum of log likelihoods. This makes sense, I understand MLE.
...
2
votes
1
answer
103
views
Computing gradient over all examples in gradient descent
I am studying about Gradient Descent and Stochastic Gradient Descent, and the text says that one of the advantages of sgd over gd is, that gd can be computationally expensive for large datasets. In ...
1
vote
0
answers
78
views
Group Lasso optimization
I read that, for the group lasso, to solve the zero subgradient equations, one approach involves keeping all block vectors fixed, denoted as $\{\hat\theta_k, k \ne j\}$, and then solving for $ \hat \...
2
votes
0
answers
74
views
Solving a system of equalities using a neural network
Assume $P$ is a set of pairs $(x, y)$,
where both $x$ and $y$ are in $\mathbb{R}^n$.
Assume $P'$ is a subset of $P$.
I want to train a neural network
$N: \mathbb{R}^n \to \mathbb{R}^m$
such that, for ...
0
votes
0
answers
78
views
How does the chain-rule look for the gradient of a loss function?
When we are computing the gradient of the loss function, $L$, of a Word2Vec model, for the context word-embedding, $w_i$, and the target word-embedding, $t$. Where the loss function, $L$, looks like:
$...
0
votes
1
answer
79
views
What loss function should I use to fit a distribution of points with a function with latent variables?
For simplicity I am going start with a toy example.
Lets suppose we have a set of $n$ points $\vec{Y}$ in the 2d space, distributed with the shape of the letter M. ...
1
vote
0
answers
85
views
Neural ODE and Adjoint method
I am trying to understand the paper NeuralODE which is very interesting.
I get the general idea and the proof they give about the dynamics of the adjoint are fairly simple.
They define a network $f$, ...
0
votes
2
answers
327
views
A Simple Toy ML problem that surprisingly fails to converge (or even "try"!)
This is a much simplified network from a real problem that, to me, has a surprising INability to learn a simple task via backprop, ie, it can't overfit or learn at all. This simple version has come at ...
3
votes
1
answer
75
views
Implement Nesterov's acceleration for SVM
I am trying to implement Nestrov's acceleration gradient descent for SVM. The objective function I need to minimize is $$\frac{1}{2}\lVert Au-Bv\rVert_2^2$$ with constraints $\sum_{i}u_i=\sum_{j}v_j=1$...
1
vote
0
answers
51
views
Residuals in Online Gradient Descent
Does anyone know if online linear learning assumes white noise to the residuals?
My thought process is that serial correlation can arise due to the fact that the fit at time t uses the information ...
2
votes
0
answers
271
views
How to include parameter constraints in the Levenberg-Marquardt algorithm?
Every resource I found online doesn’t say what to do if the constraints aren’t satisfied. If my updated parameters is given by:
$$\theta_{k+1} = \theta_k - (JJ^T + \lambda I)^{-1}Jr, $$ where J is the ...
12
votes
2
answers
2k
views
Why go through the trouble of expectation maximization and not use gradient descent?
In expectation maximization first a lower bound of the likelihood is found and then a 2 step iterative algorithm kicks in where first we try to find the weights (the probability that a data point ...
2
votes
0
answers
142
views
Do common implementations of mini-batch gradient descent violate the i.i.d assumption needed for unbiased estimation?
When we perform mini-batch GD, we estimate the true gradient:
$$\nabla L = \frac{1}{N} \sum_i \nabla L_i$$
with:
$$\nabla_B L = \frac{1}{B} \sum_{i \in B} \nabla L_i$$
where $B$ is the batch size. ...
1
vote
0
answers
154
views
Is my custom loss function differentiable?
Consider the following loss function.
loss = ( ( torch.where(d > threshold, torch.sqrt(d), 0) * t ) + ( torch.where(d <= threshold, (1 - d), 0) * (1 - t) ) )
...
0
votes
0
answers
195
views
Convergence in Logistic Regression
Hey I'm taking a deeper dive into logistic regression. Specifically the following loss function with L2 regularization,
$$l(w)=\frac{1}{n}\sum_n \log(1+\exp(-y_i \cdot x_i^Tw))+\frac{\lambda}{2}||w||^...
1
vote
2
answers
142
views
How does feature scaling improve convergence in gradient descent?
Based on answers online, it appears that feature scaling helps to (1) ensure balanced step sizes and (2) make the cost function more symmetrical.
Why would an imbalanced step size be an issue, since a ...
1
vote
0
answers
61
views
Why I get spikes during training with vanilla gradient descent? [closed]
I developed my own NN toolbox, and it seems it works fine.
But I am not sure why I get these spikes in my loss during training:
I a training for a classification task of 2 inputs and 2 classes, ...
4
votes
2
answers
186
views
Is optimization without function evaluations (and with only gradient evaluations) possible?
I am implementing an unconstrained optimization algorithm using gradient descent. I am evaluating a cost function at a given point, evaluating the gradient at this point, and selecting the next ...
1
vote
0
answers
386
views
Understanding Backpropagation with Softmax and Quadratic Error
I'm trying to understand how to compute the derivative of the Softmax activation function in order to compute the gradient of the quadratic cost function w.r.t. the weights of the last layer.
...
2
votes
0
answers
92
views
How do we know if we are in a local or global minimum when using gradient descent?
I have been learning about Real Analysis and recently undergone a stats module on regressions etc.. We have come across gradient descent and I was curious about, if we have a complicated loss function,...
0
votes
0
answers
16
views
What's the point of using gradient descent for linear regression if you can calculate the coefficients directly using the least squares method? [duplicate]
Gradient descent involves significant computational effort, whereas the method of least squares enables direct and accurate calculation. Does gradient descent offer any advantages over least squares ...
1
vote
0
answers
69
views
Imputation method for missing values that are irrelevant
I have a data set $\mathbf X$, with around 20 predictors, which is a matrix of parameters of a surrogate model. For each observation $\mathbf i$ of $\mathbf X$, the surrogate model was trained to ...