Newest 'gradient-descent' Questions - Page 4

2 votes

0 answers

144 views

Validation loss falls but train loss remains constant? [closed]

My validation loss (left) falls to near 0, while my training loss (right) remains basically unchanged (gradient step is on the abscissa). This is the opposite of the typical error in which train loss ...

Rylan Schaeffer

1,088

asked Nov 29, 2021 at 18:11

8 votes

2 answers

3k views

matrix-calculus - Understanding numerator/denominator layouts

Also see this question for more external references! Consider the following machine-learning model: Here, $J = \frac{1}{m} \sum_{i = 1}^{m} L(\hat{y}^{(i)}, y^{(i)})$, and $m$ is the number of ...

x.projekt

240

asked Nov 27, 2021 at 14:14

1 vote

1 answer

89 views

Simple Gradient Descent Project plausibility

I am currently in a numerical analysis class at my university and wanted to tackle a project applying gradient descent. Fair warning: I am new to machine learning, but my professor believed in me, so ...

Mobius.Drip

113

asked Nov 21, 2021 at 10:06

0 votes

1 answer

247 views

How to show that the gradient of the smoothed surrogate loss function leads to perceptron update?

This is about the contents of section 1.2.1 and 1.2.1.1 of the book "Neural Networks and Deep Learning: A Textbook". The link to the sections is here. The question arises from the following ...

zzzhhh

333

asked Nov 20, 2021 at 14:29

1 vote

0 answers

314 views

Does always gradients in mini-batch SGD have to be unbiased in order to prove convergence?

I am currently reading this paper [1] and [2]. The authors state that: Our analytical results include almost all of the unbiased compression techniques. And also: (i) gradient compression must be ...

Complicated

43

asked Nov 19, 2021 at 1:56

1 vote

0 answers

88 views

How should I set $\vec{\mu}$ in NAdam optimization?

In Dozat 2016 they introduce a sequence of hyperparameters $\mu_0, \cdots, \mu_T$ where $T$ is the total number of iterations. Naturally $T$ is dependent on the convergence of the parameters, so it ...

Galen

10.1k

asked Nov 12, 2021 at 17:42

2 votes

0 answers

88 views

Vanishing partial derivative of least squares w.r.t. Verhulst growth parameter

The Verhulst growth model can be given as $$P(t) = \frac{k}{1+ \left( \frac{k-P_0}{P_0} \right)\exp(-rt)}$$ where $P(t)$ is the population size at time $t$, $k$ is the carrying capacity, $P_0$ is the ...

Galen

10.1k

asked Nov 11, 2021 at 19:42

7 votes

2 answers

1k views

In GD-optimisation, if the gradient of the error function is w.r.t to the weights, isn't the target value dropped since it's a lone constant?

Suppose we have the absolute difference as an error function: $\mathit{loss}(w) = |m_x(w) - t|$ where $m_x$ is simply some model with input $x$ and weight setting $w$, and $t$ is the target value. In ...

b0neval

679

asked Nov 11, 2021 at 0:45

0 votes

0 answers

376 views

Gradient descent - no way to find the global optimum if the model is stuck at local optima?

When I was learning about gradient descent a few minutes ago, I looked at the equation This is supposed to find the slope of a point on the cost function J(θ0, θ0), and then go in the opposite ...

Adith Raghav

149

asked Nov 7, 2021 at 14:20

3 votes

2 answers

2k views

Gradient descent - why the partial derivative?

I'm quite new to AI/ML, and I was learning about gradient descent. I saw this equation that explained the gradient descent algorithm: I quite understood everything except the reason this equation ...

Adith Raghav

149

asked Nov 6, 2021 at 16:18

0 votes

1 answer

195 views

Gradient descent to solve regressions with large features

We know that the closed-form solution for linear regression is $\beta = (X'X)^{-1}X'Y$. $X$ is a $N\times M$ matrix, where N is the number of observations and M is the number of features. However, in ...

vpy

73

asked Nov 4, 2021 at 2:13

2 votes

1 answer

468 views

Proving that momentum gradient descent converges for function $f(x) = x^2$

It is well-known that using vanilla gradient descent on $f(x) = x^2$ can lead to ping-ponging and non-convergence. I would like to show that convergence can occur for momentum gradient descent. We ...

Yardel

21

asked Nov 2, 2021 at 20:56

1 vote

1 answer

646 views

Why does stochastic gradient descent lead us to a minimum at all?

Why do we think that stochastic gradient descent is going to find a minimum at all? I mean on each iteration SGD moves in the direction that reduces only current batch's error (SGD doesn't care about ...

mathgeek

551

asked Oct 6, 2021 at 2:30

3 votes

1 answer

965 views

The reason (and intuition) behind why stochastic gradient descent can get stuck on a local minimum

Suppose you want to find $k$ that minimises your cost function $J(k)$. We may want to apply batch gradient descent or stochastic gradient descent. Let's deliberately initialise $k$ with the same ...

mathgeek

551

asked Oct 4, 2021 at 23:07

2 votes

0 answers

50 views

Existing limitations of solutions to the Vanishing Gradient Problem

In a feedforward neural network, the main causes for the VGP are saturation of activation functions and poor initialisation of weights. From what I have read, using non-saturating activation functions,...

siegfried

330

asked Sep 22, 2021 at 13:08

8 votes

1 answer

5k views

Is it possible to learn with batch size = 1?

Due to OOM error, I can only set the batch size to be 2 or 1. Is it possible to learn with such a low batch size? Thanks!

Johnny Tam

373

asked Sep 15, 2021 at 1:04

2 votes

0 answers

78 views

Quantitatively define "small gradient" when checking convergence

When checking if a gradient descent (GD) has reached a minimum, it's a common practice to check the gradient of the cost function at the final iterate (also one might check if the Hessian is positive ...

CWC

290

asked Sep 13, 2021 at 14:43

3 votes

1 answer

193 views

Projected Gradient Descent for Quadratic Programming Problem

I am trying to find $$ \min_W \|Y-XW \|_F^2$$ $$s.t. \forall ij, W_{ij}\geq0 $$ where X is input data and Y is the output data we try to fit to. This is a convex optimization problem that can be ...

CWC

290

asked Sep 9, 2021 at 15:44

1 vote

0 answers

51 views

Vanishing gradient problem and choice of cost function

I am reading chapter 6.2.7 vanishing gradient problem in the book Ovidiu Calin - Deep Learning Architectures - A Mathematical Approach. On page 187 the author mentioned one of the causes, i.e. the ...

siegfried

330

asked Sep 6, 2021 at 5:21

3 votes

1 answer

904 views

Neural Networks: How to get the gradient vector for the xOr problem?

I'm reading about neural networks, but the material I find is sometimes very abstract or just copies of something. Well, when considering the $xOr$ problem, I have a network in the following structure ...

David

110

asked Aug 28, 2021 at 15:34

1 vote

2 answers

3k views

Why are non-linear activation functions required in multilayer perceptron classification? [duplicate]

Solution: for some reason, I had forgotten that the non-linear activation function is applied at every layer of the neural network, not just at the output layer. Hopefully to others reading my ...

User

13

asked Aug 27, 2021 at 19:06

2 votes

0 answers

1k views

Mathematical formalism of Gradient Boosting Decision Trees (GBDT) algorithms

I'm trying to better figure out some formalism behind the Gradient Boosting Decision Trees (GBDT) algorithms. Given a dataset $\mathcal{D}$ and a loss function $L : \mathbb{R}^2 \rightarrow \mathbb{R}$...

James Arten

669

asked Aug 25, 2021 at 12:48

6 votes

1 answer

632 views

Which approaches exist for optimization in machine learning?

From this blog post: For any Optimization problem with respect to Machine Learning, there can be either a numerical approach or an analytical approach. The numerical problems are Deterministic, ...

Saucy Goat

189

asked Aug 17, 2021 at 11:23

3 votes

1 answer

953 views

Misconception about ReLu

I have already gone through the post and this post, but they didn't clear my doubt. Let us say if I have a deep neural network like (having more layers about 50): Now, my question is: If I'm using an ...

Bits

221

asked Jul 26, 2021 at 18:30

2 votes

2 answers

344 views

How does gradient descent help SVM learn a linearly separable hyperplane?

So I see the Perceptron Algorithm applied to learning an SVM, where $\theta$ is the normal vector to the linearly separating hyperplane. How does the update $$\theta^{t+1}\leftarrow\theta^t+\alpha ...

user8714896

740

asked Jul 25, 2021 at 2:23

1 vote

1 answer

235 views

For MSE equation does order of $y$ and $\hat{y}$ in the residual $(y-\hat{y})$ matter?

So the equation for MSE is $\frac{1}{2N}\sum(y-\hat{y})^2$. If you switch the order as in $\frac{1}{2N}\sum(\hat{y} - y)^2$ does that affect anything? The only thing I think it potentially effects is ...

user8714896

740

asked Jul 22, 2021 at 3:22

1 vote

0 answers

3k views

When are very small learning rates useful?

I just wondered if there are cases where small or very small learning rates in gradient descent based optimization are useful? A large learning rate allows the model to explore a much larger portion ...

Gilfoyle

681

asked Jul 15, 2021 at 19:48

0 votes

0 answers

935 views

Why does the loss of a neural net flat-line and then suddenly drop?

The loss graph for my neural net looks like this: Blue is validation data loss and green is the training data loss. As you can see, the loss remains almost flat for the first 600 epochs and then it ...

Dylan Kerler

351

asked Jul 15, 2021 at 14:42

0 votes

0 answers

110 views

Gradient Descent Algorithm for Interdependent parameters

Suppose I have $n$ data points ($X_i$,$y_i$) where $X_i$ is a vector and $y_i$ is a scalar, $1 \le i \le n$. By defining $\hat{\boldsymbol{Y}} = \boldsymbol{\Theta} \boldsymbol{X} + \boldsymbol{b}$ ...

Amin Kaveh

65

asked Jul 15, 2021 at 13:05

0 votes

0 answers

116 views

Why do all activation functions have positive slope?

I am wondering why all the common activation functions tend to increase with x (or stay flat like ReLU). I have not come across any that are inversely proportional to x, or that have some other shape. ...

Raisin

101

asked Jul 11, 2021 at 1:13

0 votes

1 answer

291 views

What are the variations of Expectation Maximization?

To explain my question better, I will use this analogy: In the case of the Gradient-Descent method, we have multiple variations/expansions for the main algorithm, like stochastic gradient descent (SGD)...

Amin Kaveh

65

asked Jul 9, 2021 at 13:50

1 vote

0 answers

37 views

Initial weights Feed Forward NN

I am trying to understand the purpose of Xavier's initialization of the weights in an ANN. I get that the main reason is that we don't like our linear combinations in the units to be very large as the ...

J3lackkyy

745

asked Jul 3, 2021 at 16:06

0 votes

0 answers

19 views

How to deal with the loss exploding for LSTM regression task [duplicate]

I am training a LSTM for regression problems, but the loss function randomly shoots up as in the picture below: I tried multiple things to prevent this, adjusting the learning rate, adjusting the ...

uom-tracy

21

asked Jun 29, 2021 at 21:00

2 votes

0 answers

166 views

How to Efficiently Finding All Local Maxima in a Large Parameter Space

I am working in 8-D parameter space, where every parameter is on the interval [0, 1]. The number of local maxima in this space and how they are positioned relative to one another is way more ...

E Tam

299

asked Jun 28, 2021 at 5:51

5 votes

2 answers

2k views

Bayesian Optimization vs. gradient descent

I don't know if this is the right place to ask this question. If you think this question is better asked in another StackExchange, please point me to that. This question is about the sampling ...

Truong

211

asked Jun 24, 2021 at 17:33

0 votes

0 answers

131 views

Main idea behind reparametrization trick (distribution to function)

If I got the idea correctly, one of the main concepts behind the reparametrization trick, first presented in Kingma, D. P., & Welling, M. (2013), Auto-encoding variational bayes (ArXiv Preprint ...

TheQuantumMan

201

asked Jun 19, 2021 at 23:24

2 votes

0 answers

249 views

LSTM backpropagation gradient regarding vanishing and exploding gradients problem

I was looking around for a good explanation as to why LSTMs are better able to handle vanishing and exploding gradients compared to vanilla RNNs. I know it is due to the cell memory $c_t$ acting as a ...

somefellow

21

asked Jun 18, 2021 at 20:47

2 votes

1 answer

820 views

Constrained optimization with gradient descent

Suppose I want to maximize the likelihood $L(\theta_1, \theta_2)$ for some constraint for example $\theta_1 + \theta_2 = 1$ and no other constraints Can I just replace $\theta_2$ by $1 - \theta_1$ in ...

wut

177

asked Jun 18, 2021 at 0:10

2 votes

0 answers

141 views

Problem of jensen shannon

In GAN we want to minimize Jensen-Shannon distance and we use gradient descent. When can't we use this approach? What attribute might the training data and the distribution of the generating network ...

mohammad B

21

asked Jun 15, 2021 at 12:53

0 votes

0 answers

54 views

Almost Perfect Accuracy in Both Training and Validation Sets, but Nothing Showed Up in All But One of the Classes' Saliency Map

My convolutional neural network (with 5 layers: first 3 are Conv2D, last 2 are FC's) to classify four different classes of protein images resulted in very high accuracies and low losses in both ...

Daniel Duncan

1

asked Jun 7, 2021 at 4:48

2 votes

1 answer

419 views

Gradient descent and Backpropagation

I think I understood the principles of gradient descent and backpropagation. But I think, so far, I'm not sure how they work together. Gradient descent itself is "just" an optimization ...

Ben

3,533

asked Jun 4, 2021 at 19:50

0 votes

1 answer

82 views

When do Adaptive Optimization Algorithms modify their parameters?

When do "Ada" optimizers (e.g. Adagrad, Adam, etc...) "adapt" their parameters? Is it at the end of each mini-batch or epoch?

Marsellus Wallace

155

asked Jun 3, 2021 at 16:09

3 votes

1 answer

851 views

Does gradient boosted trees actually use regression trees for classification, and if so, what does the gradient update?

I have often read that gradient boosting algorithms fit sequential models to the overall model's residuals, but I can't make sense of this for classification problems (for instance, what is the "...

Josh

308

asked May 28, 2021 at 17:15

1 vote

1 answer

457 views

What is the Purpose of calculating SSE, MSE (or other metrics) if linear regression (OLS) is minimizes sum of squared errors?

Ordinary Least Squares regression is defined as minimizing the sum of squared errors. So after doing this regression (OLS) then what is the purpose of optimizing SSE (or MSE, RMSE etc.) if linear ...

yonasboson

11

asked May 27, 2021 at 13:55

3 votes

1 answer

2k views

When to use weight decay for ADAM optimiser?

If you use weight decay for gradient descent (ADAM specifically) do you need to use regularisation for loss function? I believe the answer is yes since the gradient descent involves the ...

Robert Lewis

31

asked May 21, 2021 at 13:12

1 vote

2 answers

346 views

Why do we need gradient in gradient descent?

In gradient descent algorithm, the update rule of vector parameter is as follow: From this formula, i think that the update rule only depends on the sign of the gradient. So why don't we just use ...

i_love_thu_ha

567

asked May 19, 2021 at 7:16

4 votes

2 answers

142 views

Why does it appear impossible to fit Gaussians to arbitrary probability density functions $p$?

I want to fit a Gaussian $q$ to a pdf $p$ by minimizing the energy $E = -\int q(x) \log p(x) dx$. This should result in a "delta function" Gaussian with $\sigma \rightarrow 0$ and $\mu \...

actinidia

145

asked May 14, 2021 at 18:36

4 votes

1 answer

4k views

Is it normal for training loss to plateau before decreasing?

Is this training loss graph normal - where it flattens for quite a while before dropping? This is something that I am seeing when I train my neural net every time. Because whenever I read papers the ...

Dylan Kerler

351

asked May 8, 2021 at 8:27

2 votes

1 answer

133 views

Convergece of Steepest Descent

Why does Steepest Descent converge? I know that will be take the objective $f$ and walk it through direction $-\nabla f$ with step size $\alpha_k$ but step size seems able to be negative and it does ...

Davi Américo

1,270

asked May 7, 2021 at 22:30

3 votes

1 answer

174 views

gradient descent in neural network

Given that almost all the activation functions in neural networks are increasing, by the gradient descent rule, all parameters should be updated in the same direction (negative direction). Then how ...

XXX

205

asked May 1, 2021 at 12:55

Questions tagged [gradient-descent]