Skip to main content

Questions tagged [gradient-descent]

Gradient descent is a first-order iterative optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. For stochastic gradient descent there is also the [sgd] tag.

Filter by
Sorted by
Tagged with
2 votes
0 answers
144 views

My validation loss (left) falls to near 0, while my training loss (right) remains basically unchanged (gradient step is on the abscissa). This is the opposite of the typical error in which train loss ...
Rylan Schaeffer's user avatar
8 votes
2 answers
3k views

Also see this question for more external references! Consider the following machine-learning model: Here, $J = \frac{1}{m} \sum_{i = 1}^{m} L(\hat{y}^{(i)}, y^{(i)})$, and $m$ is the number of ...
x.projekt's user avatar
  • 240
1 vote
1 answer
89 views

I am currently in a numerical analysis class at my university and wanted to tackle a project applying gradient descent. Fair warning: I am new to machine learning, but my professor believed in me, so ...
Mobius.Drip's user avatar
0 votes
1 answer
247 views

This is about the contents of section 1.2.1 and 1.2.1.1 of the book "Neural Networks and Deep Learning: A Textbook". The link to the sections is here. The question arises from the following ...
zzzhhh's user avatar
  • 333
1 vote
0 answers
314 views

I am currently reading this paper [1] and [2]. The authors state that: Our analytical results include almost all of the unbiased compression techniques. And also: (i) gradient compression must be ...
Complicated's user avatar
1 vote
0 answers
88 views

In Dozat 2016 they introduce a sequence of hyperparameters $\mu_0, \cdots, \mu_T$ where $T$ is the total number of iterations. Naturally $T$ is dependent on the convergence of the parameters, so it ...
Galen's user avatar
  • 10.1k
2 votes
0 answers
88 views

The Verhulst growth model can be given as $$P(t) = \frac{k}{1+ \left( \frac{k-P_0}{P_0} \right)\exp(-rt)}$$ where $P(t)$ is the population size at time $t$, $k$ is the carrying capacity, $P_0$ is the ...
Galen's user avatar
  • 10.1k
7 votes
2 answers
1k views

Suppose we have the absolute difference as an error function: $\mathit{loss}(w) = |m_x(w) - t|$ where $m_x$ is simply some model with input $x$ and weight setting $w$, and $t$ is the target value. In ...
b0neval's user avatar
  • 679
0 votes
0 answers
376 views

When I was learning about gradient descent a few minutes ago, I looked at the equation This is supposed to find the slope of a point on the cost function J(θ0, θ0), and then go in the opposite ...
Adith Raghav's user avatar
3 votes
2 answers
2k views

I'm quite new to AI/ML, and I was learning about gradient descent. I saw this equation that explained the gradient descent algorithm: I quite understood everything except the reason this equation ...
Adith Raghav's user avatar
0 votes
1 answer
195 views

We know that the closed-form solution for linear regression is $\beta = (X'X)^{-1}X'Y$. $X$ is a $N\times M$ matrix, where N is the number of observations and M is the number of features. However, in ...
vpy's user avatar
  • 73
2 votes
1 answer
468 views

It is well-known that using vanilla gradient descent on $f(x) = x^2$ can lead to ping-ponging and non-convergence. I would like to show that convergence can occur for momentum gradient descent. We ...
Yardel's user avatar
  • 21
1 vote
1 answer
646 views

Why do we think that stochastic gradient descent is going to find a minimum at all? I mean on each iteration SGD moves in the direction that reduces only current batch's error (SGD doesn't care about ...
mathgeek's user avatar
  • 551
3 votes
1 answer
965 views

Suppose you want to find $k$ that minimises your cost function $J(k)$. We may want to apply batch gradient descent or stochastic gradient descent. Let's deliberately initialise $k$ with the same ...
mathgeek's user avatar
  • 551
2 votes
0 answers
50 views

In a feedforward neural network, the main causes for the VGP are saturation of activation functions and poor initialisation of weights. From what I have read, using non-saturating activation functions,...
siegfried's user avatar
  • 330
8 votes
1 answer
5k views

Due to OOM error, I can only set the batch size to be 2 or 1. Is it possible to learn with such a low batch size? Thanks!
Johnny Tam's user avatar
2 votes
0 answers
78 views

When checking if a gradient descent (GD) has reached a minimum, it's a common practice to check the gradient of the cost function at the final iterate (also one might check if the Hessian is positive ...
CWC's user avatar
  • 290
3 votes
1 answer
193 views

I am trying to find $$ \min_W \|Y-XW \|_F^2$$ $$s.t. \forall ij, W_{ij}\geq0 $$ where X is input data and Y is the output data we try to fit to. This is a convex optimization problem that can be ...
CWC's user avatar
  • 290
1 vote
0 answers
51 views

I am reading chapter 6.2.7 vanishing gradient problem in the book Ovidiu Calin - Deep Learning Architectures - A Mathematical Approach. On page 187 the author mentioned one of the causes, i.e. the ...
siegfried's user avatar
  • 330
3 votes
1 answer
904 views

I'm reading about neural networks, but the material I find is sometimes very abstract or just copies of something. Well, when considering the $xOr$ problem, I have a network in the following structure ...
David's user avatar
  • 110
1 vote
2 answers
3k views

Solution: for some reason, I had forgotten that the non-linear activation function is applied at every layer of the neural network, not just at the output layer. Hopefully to others reading my ...
User's user avatar
  • 13
2 votes
0 answers
1k views

I'm trying to better figure out some formalism behind the Gradient Boosting Decision Trees (GBDT) algorithms. Given a dataset $\mathcal{D}$ and a loss function $L : \mathbb{R}^2 \rightarrow \mathbb{R}$...
James Arten's user avatar
6 votes
1 answer
632 views

From this blog post: For any Optimization problem with respect to Machine Learning, there can be either a numerical approach or an analytical approach. The numerical problems are Deterministic, ...
Saucy Goat's user avatar
3 votes
1 answer
953 views

I have already gone through the post and this post, but they didn't clear my doubt. Let us say if I have a deep neural network like (having more layers about 50): Now, my question is: If I'm using an ...
Bits's user avatar
  • 221
2 votes
2 answers
344 views

So I see the Perceptron Algorithm applied to learning an SVM, where $\theta$ is the normal vector to the linearly separating hyperplane. How does the update $$\theta^{t+1}\leftarrow\theta^t+\alpha ...
user8714896's user avatar
1 vote
1 answer
235 views

So the equation for MSE is $\frac{1}{2N}\sum(y-\hat{y})^2$. If you switch the order as in $\frac{1}{2N}\sum(\hat{y} - y)^2$ does that affect anything? The only thing I think it potentially effects is ...
user8714896's user avatar
1 vote
0 answers
3k views

I just wondered if there are cases where small or very small learning rates in gradient descent based optimization are useful? A large learning rate allows the model to explore a much larger portion ...
Gilfoyle's user avatar
  • 681
0 votes
0 answers
935 views

The loss graph for my neural net looks like this: Blue is validation data loss and green is the training data loss. As you can see, the loss remains almost flat for the first 600 epochs and then it ...
Dylan Kerler's user avatar
0 votes
0 answers
110 views

Suppose I have $n$ data points ($X_i$,$y_i$) where $X_i$ is a vector and $y_i$ is a scalar, $1 \le i \le n$. By defining $\hat{\boldsymbol{Y}} = \boldsymbol{\Theta} \boldsymbol{X} + \boldsymbol{b}$ ...
Amin Kaveh's user avatar
0 votes
0 answers
116 views

I am wondering why all the common activation functions tend to increase with x (or stay flat like ReLU). I have not come across any that are inversely proportional to x, or that have some other shape. ...
Raisin's user avatar
  • 101
0 votes
1 answer
291 views

To explain my question better, I will use this analogy: In the case of the Gradient-Descent method, we have multiple variations/expansions for the main algorithm, like stochastic gradient descent (SGD)...
Amin Kaveh's user avatar
1 vote
0 answers
37 views

I am trying to understand the purpose of Xavier's initialization of the weights in an ANN. I get that the main reason is that we don't like our linear combinations in the units to be very large as the ...
J3lackkyy's user avatar
  • 745
0 votes
0 answers
19 views

I am training a LSTM for regression problems, but the loss function randomly shoots up as in the picture below: I tried multiple things to prevent this, adjusting the learning rate, adjusting the ...
uom-tracy's user avatar
2 votes
0 answers
166 views

I am working in 8-D parameter space, where every parameter is on the interval [0, 1]. The number of local maxima in this space and how they are positioned relative to one another is way more ...
E Tam's user avatar
  • 299
5 votes
2 answers
2k views

I don't know if this is the right place to ask this question. If you think this question is better asked in another StackExchange, please point me to that. This question is about the sampling ...
Truong's user avatar
  • 211
0 votes
0 answers
131 views

If I got the idea correctly, one of the main concepts behind the reparametrization trick, first presented in Kingma, D. P., & Welling, M. (2013), Auto-encoding variational bayes (ArXiv Preprint ...
TheQuantumMan's user avatar
2 votes
0 answers
249 views

I was looking around for a good explanation as to why LSTMs are better able to handle vanishing and exploding gradients compared to vanilla RNNs. I know it is due to the cell memory $c_t$ acting as a ...
somefellow's user avatar
2 votes
1 answer
820 views

Suppose I want to maximize the likelihood $L(\theta_1, \theta_2)$ for some constraint for example $\theta_1 + \theta_2 = 1$ and no other constraints Can I just replace $\theta_2$ by $1 - \theta_1$ in ...
wut's user avatar
  • 177
2 votes
0 answers
141 views

In GAN we want to minimize Jensen-Shannon distance and we use gradient descent. When can't we use this approach? What attribute might the training data and the distribution of the generating network ...
mohammad B's user avatar
0 votes
0 answers
54 views

My convolutional neural network (with 5 layers: first 3 are Conv2D, last 2 are FC's) to classify four different classes of protein images resulted in very high accuracies and low losses in both ...
Daniel Duncan's user avatar
2 votes
1 answer
419 views

I think I understood the principles of gradient descent and backpropagation. But I think, so far, I'm not sure how they work together. Gradient descent itself is "just" an optimization ...
Ben's user avatar
  • 3,533
0 votes
1 answer
82 views

When do "Ada" optimizers (e.g. Adagrad, Adam, etc...) "adapt" their parameters? Is it at the end of each mini-batch or epoch?
Marsellus Wallace's user avatar
3 votes
1 answer
851 views

I have often read that gradient boosting algorithms fit sequential models to the overall model's residuals, but I can't make sense of this for classification problems (for instance, what is the "...
Josh's user avatar
  • 308
1 vote
1 answer
457 views

Ordinary Least Squares regression is defined as minimizing the sum of squared errors. So after doing this regression (OLS) then what is the purpose of optimizing SSE (or MSE, RMSE etc.) if linear ...
yonasboson's user avatar
3 votes
1 answer
2k views

If you use weight decay for gradient descent (ADAM specifically) do you need to use regularisation for loss function? I believe the answer is yes since the gradient descent involves the ...
Robert Lewis's user avatar
1 vote
2 answers
346 views

In gradient descent algorithm, the update rule of vector parameter is as follow: From this formula, i think that the update rule only depends on the sign of the gradient. So why don't we just use ...
i_love_thu_ha's user avatar
4 votes
2 answers
142 views

I want to fit a Gaussian $q$ to a pdf $p$ by minimizing the energy $E = -\int q(x) \log p(x) dx$. This should result in a "delta function" Gaussian with $\sigma \rightarrow 0$ and $\mu \...
actinidia's user avatar
  • 145
4 votes
1 answer
4k views

Is this training loss graph normal - where it flattens for quite a while before dropping? This is something that I am seeing when I train my neural net every time. Because whenever I read papers the ...
Dylan Kerler's user avatar
2 votes
1 answer
133 views

Why does Steepest Descent converge? I know that will be take the objective $f$ and walk it through direction $-\nabla f$ with step size $\alpha_k$ but step size seems able to be negative and it does ...
Davi Américo's user avatar
3 votes
1 answer
174 views

Given that almost all the activation functions in neural networks are increasing, by the gradient descent rule, all parameters should be updated in the same direction (negative direction). Then how ...
XXX's user avatar
  • 205

1 2 3
4
5
20