Questions tagged [gradient-descent]
Gradient descent is a first-order iterative optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. For stochastic gradient descent there is also the [sgd] tag.
998 questions
2
votes
0
answers
144
views
Validation loss falls but train loss remains constant? [closed]
My validation loss (left) falls to near 0, while my training loss (right) remains basically unchanged (gradient step is on the abscissa). This is the opposite of the typical error in which train loss ...
8
votes
2
answers
3k
views
matrix-calculus - Understanding numerator/denominator layouts
Also see this question for more external references!
Consider the following machine-learning model:
Here, $J = \frac{1}{m} \sum_{i = 1}^{m} L(\hat{y}^{(i)}, y^{(i)})$, and $m$ is the number of ...
1
vote
1
answer
89
views
Simple Gradient Descent Project plausibility
I am currently in a numerical analysis class at my university and wanted to tackle a project applying gradient descent. Fair warning: I am new to machine learning, but my professor believed in me, so ...
0
votes
1
answer
247
views
How to show that the gradient of the smoothed surrogate loss function leads to perceptron update?
This is about the contents of section 1.2.1 and 1.2.1.1 of the book "Neural Networks and Deep Learning: A Textbook". The link to the sections is here. The question arises from the following ...
1
vote
0
answers
314
views
Does always gradients in mini-batch SGD have to be unbiased in order to prove convergence?
I am currently reading this paper [1] and [2].
The authors state that:
Our analytical results include almost all of the unbiased compression
techniques.
And also:
(i) gradient compression must be ...
1
vote
0
answers
88
views
How should I set $\vec{\mu}$ in NAdam optimization?
In Dozat 2016 they introduce a sequence of hyperparameters $\mu_0, \cdots, \mu_T$ where $T$ is the total number of iterations. Naturally $T$ is dependent on the convergence of the parameters, so it ...
2
votes
0
answers
88
views
Vanishing partial derivative of least squares w.r.t. Verhulst growth parameter
The Verhulst growth model can be given as
$$P(t) = \frac{k}{1+ \left( \frac{k-P_0}{P_0} \right)\exp(-rt)}$$
where $P(t)$ is the population size at time $t$, $k$ is the carrying capacity, $P_0$ is the ...
7
votes
2
answers
1k
views
In GD-optimisation, if the gradient of the error function is w.r.t to the weights, isn't the target value dropped since it's a lone constant?
Suppose we have the absolute difference as an error function:
$\mathit{loss}(w) = |m_x(w) - t|$
where $m_x$ is simply some model with input $x$ and weight setting $w$, and $t$ is the target value.
In ...
0
votes
0
answers
376
views
Gradient descent - no way to find the global optimum if the model is stuck at local optima?
When I was learning about gradient descent a few minutes ago, I looked at the equation
This is supposed to find the slope of a point on the cost function J(θ0, θ0), and then go in the opposite ...
3
votes
2
answers
2k
views
Gradient descent - why the partial derivative?
I'm quite new to AI/ML, and I was learning about gradient descent. I saw this equation that explained the gradient descent algorithm:
I quite understood everything except the reason this equation ...
0
votes
1
answer
195
views
Gradient descent to solve regressions with large features
We know that the closed-form solution for linear regression is $\beta = (X'X)^{-1}X'Y$.
$X$ is a $N\times M$ matrix, where N is the number of observations and M is the number of features.
However, in ...
2
votes
1
answer
468
views
Proving that momentum gradient descent converges for function $f(x) = x^2$
It is well-known that using vanilla gradient descent on $f(x) = x^2$ can lead to ping-ponging and non-convergence. I would like to show that convergence can occur for momentum gradient descent.
We ...
1
vote
1
answer
646
views
Why does stochastic gradient descent lead us to a minimum at all?
Why do we think that stochastic gradient descent is going to find a minimum at all? I mean on each iteration SGD moves in the direction that reduces only current batch's error (SGD doesn't care about ...
3
votes
1
answer
965
views
The reason (and intuition) behind why stochastic gradient descent can get stuck on a local minimum
Suppose you want to find $k$ that minimises your cost function $J(k)$. We may want to apply batch gradient descent or stochastic gradient descent. Let's deliberately initialise $k$ with the same ...
2
votes
0
answers
50
views
Existing limitations of solutions to the Vanishing Gradient Problem
In a feedforward neural network, the main causes for the VGP are saturation of activation functions and poor initialisation of weights. From what I have read, using non-saturating activation functions,...
8
votes
1
answer
5k
views
Is it possible to learn with batch size = 1?
Due to OOM error, I can only set the batch size to be 2 or 1.
Is it possible to learn with such a low batch size?
Thanks!
2
votes
0
answers
78
views
Quantitatively define "small gradient" when checking convergence
When checking if a gradient descent (GD) has reached a minimum, it's a common practice to check the gradient of the cost function at the final iterate (also one might check if the Hessian is positive ...
3
votes
1
answer
193
views
Projected Gradient Descent for Quadratic Programming Problem
I am trying to find
$$ \min_W \|Y-XW \|_F^2$$ $$s.t. \forall ij, W_{ij}\geq0 $$
where X is input data and Y is the output data we try to fit to. This is a convex optimization problem that can be ...
1
vote
0
answers
51
views
Vanishing gradient problem and choice of cost function
I am reading chapter 6.2.7 vanishing gradient problem in the book Ovidiu Calin - Deep Learning Architectures - A Mathematical Approach. On page 187 the author mentioned one of the causes, i.e. the ...
3
votes
1
answer
904
views
Neural Networks: How to get the gradient vector for the xOr problem?
I'm reading about neural networks, but the material I find is sometimes very abstract or just copies of something. Well, when considering the $xOr$ problem, I have a network in the following structure
...
1
vote
2
answers
3k
views
Why are non-linear activation functions required in multilayer perceptron classification? [duplicate]
Solution: for some reason, I had forgotten that the non-linear activation function is applied at every layer of the neural network, not just at the output layer. Hopefully to others reading my ...
2
votes
0
answers
1k
views
Mathematical formalism of Gradient Boosting Decision Trees (GBDT) algorithms
I'm trying to better figure out some formalism behind the Gradient Boosting Decision Trees (GBDT) algorithms.
Given a dataset $\mathcal{D}$ and a loss function $L : \mathbb{R}^2 \rightarrow \mathbb{R}$...
6
votes
1
answer
632
views
Which approaches exist for optimization in machine learning?
From this blog post:
For any Optimization problem with respect to Machine Learning, there can be either a numerical approach or an analytical approach. The numerical problems are Deterministic, ...
3
votes
1
answer
953
views
Misconception about ReLu
I have already gone through the post and this post, but they didn't clear my doubt. Let us say if I have a deep neural network like (having more layers about 50):
Now, my question is:
If I'm using an ...
2
votes
2
answers
344
views
How does gradient descent help SVM learn a linearly separable hyperplane?
So I see the Perceptron Algorithm applied to learning an SVM, where $\theta$ is the normal vector to the linearly separating hyperplane. How does the update
$$\theta^{t+1}\leftarrow\theta^t+\alpha ...
1
vote
1
answer
235
views
For MSE equation does order of $y$ and $\hat{y}$ in the residual $(y-\hat{y})$ matter?
So the equation for MSE is $\frac{1}{2N}\sum(y-\hat{y})^2$. If you switch the order as in $\frac{1}{2N}\sum(\hat{y} - y)^2$ does that affect anything? The only thing I think it potentially effects is ...
1
vote
0
answers
3k
views
When are very small learning rates useful?
I just wondered if there are cases where small or very small learning rates in gradient descent based optimization are useful?
A large learning rate allows the model to explore a much larger portion ...
0
votes
0
answers
935
views
Why does the loss of a neural net flat-line and then suddenly drop?
The loss graph for my neural net looks like this:
Blue is validation data loss and green is the training data loss.
As you can see, the loss remains almost flat for the first 600 epochs and then it ...
0
votes
0
answers
110
views
Gradient Descent Algorithm for Interdependent parameters
Suppose I have $n$ data points ($X_i$,$y_i$) where $X_i$ is a vector and $y_i$ is a scalar, $1 \le i \le n$. By defining $\hat{\boldsymbol{Y}} = \boldsymbol{\Theta} \boldsymbol{X} + \boldsymbol{b}$ ...
0
votes
0
answers
116
views
Why do all activation functions have positive slope?
I am wondering why all the common activation functions tend to increase with x (or stay flat like ReLU). I have not come across any that are inversely proportional to x, or that have some other shape. ...
0
votes
1
answer
291
views
What are the variations of Expectation Maximization?
To explain my question better, I will use this analogy:
In the case of the Gradient-Descent method, we have multiple variations/expansions for the main algorithm, like stochastic gradient descent (SGD)...
1
vote
0
answers
37
views
Initial weights Feed Forward NN
I am trying to understand the purpose of Xavier's initialization of the weights in an ANN. I get that the main reason is that we don't like our linear combinations in the units to be very large as the ...
0
votes
0
answers
19
views
How to deal with the loss exploding for LSTM regression task [duplicate]
I am training a LSTM for regression problems, but the loss function randomly shoots up as in the picture below:
I tried multiple things to prevent this, adjusting the learning rate, adjusting the ...
2
votes
0
answers
166
views
How to Efficiently Finding All Local Maxima in a Large Parameter Space
I am working in 8-D parameter space, where every parameter is on the interval [0, 1]. The number of local maxima in this space and how they are positioned relative to one another is way more ...
5
votes
2
answers
2k
views
Bayesian Optimization vs. gradient descent
I don't know if this is the right place to ask this question. If you think this question is better asked in another StackExchange, please point me to that.
This question is about the sampling ...
0
votes
0
answers
131
views
Main idea behind reparametrization trick (distribution to function)
If I got the idea correctly, one of the main concepts behind the reparametrization trick, first presented in Kingma, D. P., & Welling, M. (2013), Auto-encoding variational bayes (ArXiv Preprint ...
2
votes
0
answers
249
views
LSTM backpropagation gradient regarding vanishing and exploding gradients problem
I was looking around for a good explanation as to why LSTMs are better able to handle vanishing and exploding gradients compared to vanilla RNNs. I know it is due to the cell memory $c_t$ acting as a ...
2
votes
1
answer
820
views
Constrained optimization with gradient descent
Suppose I want to maximize the likelihood $L(\theta_1, \theta_2)$ for some constraint for example $\theta_1 + \theta_2 = 1$ and no other constraints
Can I just replace $\theta_2$ by $1 - \theta_1$ in ...
2
votes
0
answers
141
views
Problem of jensen shannon
In GAN we want to minimize Jensen-Shannon distance and we use gradient descent. When can't we use this approach? What attribute might the training data and the distribution of the generating network ...
0
votes
0
answers
54
views
Almost Perfect Accuracy in Both Training and Validation Sets, but Nothing Showed Up in All But One of the Classes' Saliency Map
My convolutional neural network (with 5 layers: first 3 are Conv2D, last 2 are FC's) to classify four different classes of protein images resulted in very high accuracies and low losses in both ...
2
votes
1
answer
419
views
Gradient descent and Backpropagation
I think I understood the principles of gradient descent and backpropagation. But I think, so far, I'm not sure how they work together.
Gradient descent itself is "just" an optimization ...
0
votes
1
answer
82
views
When do Adaptive Optimization Algorithms modify their parameters?
When do "Ada" optimizers (e.g. Adagrad, Adam, etc...) "adapt" their parameters? Is it at the end of each mini-batch or epoch?
3
votes
1
answer
851
views
Does gradient boosted trees actually use regression trees for classification, and if so, what does the gradient update?
I have often read that gradient boosting algorithms fit sequential models to the overall model's residuals, but I can't make sense of this for classification problems (for instance, what is the "...
1
vote
1
answer
457
views
What is the Purpose of calculating SSE, MSE (or other metrics) if linear regression (OLS) is minimizes sum of squared errors?
Ordinary Least Squares regression is defined as minimizing the sum of squared errors. So after doing this regression (OLS) then what is the purpose of optimizing SSE (or MSE, RMSE etc.) if linear ...
3
votes
1
answer
2k
views
When to use weight decay for ADAM optimiser?
If you use weight decay for gradient descent (ADAM specifically) do you need to use regularisation for loss function?
I believe the answer is yes since the gradient descent involves the ...
1
vote
2
answers
346
views
Why do we need gradient in gradient descent?
In gradient descent algorithm, the update rule of vector parameter is as follow:
From this formula, i think that the update rule only depends on the sign of
the gradient. So why don't we just use ...
4
votes
2
answers
142
views
Why does it appear impossible to fit Gaussians to arbitrary probability density functions $p$?
I want to fit a Gaussian $q$ to a pdf $p$ by minimizing the energy $E = -\int q(x) \log p(x) dx$. This should result in a "delta function" Gaussian with $\sigma \rightarrow 0$ and $\mu \...
4
votes
1
answer
4k
views
Is it normal for training loss to plateau before decreasing?
Is this training loss graph normal - where it flattens for quite a while before dropping? This is something that I am seeing when I train my neural net every time.
Because whenever I read papers the ...
2
votes
1
answer
133
views
Convergece of Steepest Descent
Why does Steepest Descent converge? I know that will be take the objective $f$ and walk it through direction $-\nabla f$ with step size $\alpha_k$ but step size seems able to be negative and it does ...
3
votes
1
answer
174
views
gradient descent in neural network
Given that almost all the activation functions in neural networks are increasing, by the gradient descent rule, all parameters should be updated in the same direction (negative direction). Then how ...