Skip to main content

Questions tagged [gradient-descent]

Gradient descent is a first-order iterative optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. For stochastic gradient descent there is also the [sgd] tag.

Filter by
Sorted by
Tagged with
6 votes
1 answer
150 views

I’m trying to understand the common assumptions in machine-learning optimization theory, where a “well-behaved” loss function is often required to be both L-Lipschitz and β-smooth (i.e., have β-...
Antonios Sarikas's user avatar
2 votes
0 answers
25 views

In the paper "Deep Residual Learning for Image Recognition", it's been mentioned that "When deeper networks are able to start converging, a degradation problem has been exposed: with ...
Vignesh N's user avatar
0 votes
0 answers
41 views

LightGBM is a specific implementation of gradient boosted decision trees. One notable difference is how samples used for calculating variance gain in split points are picked. In the algorithm, ...
yanis-falaki's user avatar
2 votes
1 answer
59 views

When fitting neural networks, I often run stochastic gradient descent multiple times and take the run with the lowest training loss. I'm trying to look up research literature on this practice, but I'm ...
Jacob Maibach's user avatar
4 votes
1 answer
89 views

I was going through the algorithm for Stochastic Gradient decent in mulilayer network from the book Machine Learning by Tom Mitchell, and it shows the formulae for weight update rule. However, I dont ...
Machine123's user avatar
10 votes
3 answers
2k views

Consider a neural network with 2 or more layers. After we update the weights in layer 1, the input to layer 2 ($a^{(1)}$) has changed, so ∂z/∂w is no longer correct, as z has changed to z* and z* $\...
Yaron's user avatar
  • 109
1 vote
0 answers
85 views

Trying to learn basic machine learning, I wrote my own code for logistic regression where I minimize the usual log likelihood using gradient descent. This is the plot of the error function through a ...
user470820's user avatar
2 votes
0 answers
55 views

Say I have a biased estimator, for example estimating $\log \mathbb{E}[f_\theta(x)]$ using Monte Carlo Does this implies that $\nabla_\theta \log \mathbb{E}[f_\theta(x)]$ is also biased if estimated ...
Alberto's user avatar
  • 1,561
5 votes
1 answer
117 views

Context There are many methods to solve least squares, but most of them involve $k n^3$ flops. Using gradient descent, one runs $A x_i$ and uses the error to update $x_{i+1} = x_i - c \times \mathrm{g}...
uranus's user avatar
  • 51
0 votes
0 answers
53 views

I'm implementing Nesterov Accelerated Gradient Descent (NAG) on an Extreme Learning Machine (ELM) with one hidden layer. My loss function is the Mean Squared Error (MSE) with $L^2$ regularization. The ...
Paolo Pedinotti's user avatar
3 votes
1 answer
80 views

The whole point behind Nesterov optimization is to calculate the gradient not at the current parameter values $\theta_t$, but at $\theta_t + \beta m$, where $\beta$ is the momentum coefficient and $m$ ...
Antonios Sarikas's user avatar
0 votes
0 answers
55 views

According to recent papers, the main reason why BatchNorm works is because it smooths the loss landscape. So if the main benefit is loss landscape smoothing, why do we need mean subtraction at all? ...
FadiBenz's user avatar
1 vote
0 answers
62 views

That is, because the error is coming from the end of the neural network (ie at the output layer) and trickles back via backpropagation to the start of the neural network, does that mean that the ...
Null Six's user avatar
14 votes
6 answers
3k views

I am taking a deep learning in Python class this semester and we are doing linear algebra. Last lecture we "invented" linear regression with gradient descent (did least squares the lecture ...
Lukas's user avatar
  • 141
1 vote
0 answers
45 views

I have been reading the following paper: https://arxiv.org/pdf/1706.05350, and I am having a hard time with some claims and derivations made in the paper. First of all, the main thing I am interested ...
kklaw's user avatar
  • 554
1 vote
0 answers
58 views

In many online machine learning courses and videos(such as Andrew Ng's coursera course), when it comes to regression (for example regressing $Y$ on features $X$), althouth we have the closed form ...
ExcitedSnail's user avatar
  • 3,090
3 votes
1 answer
243 views

I have a simple function that I want to approximate with a neural network: N(1) = -1 N(2) = -1 N(3) = 1 N(4) = -1 Instead of using the MSE or cross-entropy losses, ...
Andrew Baker's user avatar
0 votes
0 answers
48 views

We obviously need the function in question to be differentiable for this notion to make sense. Now, convexity is a sufficient condition. (strong) Quasi convexity is a weaker, but I think still ...
hmmmmmmm's user avatar
  • 511
1 vote
0 answers
49 views

I am trying to fit a Chapman-Richards growth curve: $$ B = A*(1-e^{-kt}) $$ Where B is the biomass of a forest, A is the asymptote, k is the growth rate, and t is forest age. I expect the growth rate ...
Ana Catarina Vitorino's user avatar
1 vote
0 answers
80 views

There are several gradient-based attack methods. Let $J$ be the training error, then for instance the projected gradient attack is, $$ \widetilde{x} = \Pi( x + \epsilon \nabla_x J(\theta, x, y) ) $$ ...
Your neighbor Todorovich's user avatar
1 vote
0 answers
40 views

According to MacKay in his book Information Theory, Inference, and Learning Algorithms, Chapter 44, "Supervised Learning in Multilayer Networks" (page 531), he claims that an advantage of ...
Mashe Burnedead's user avatar
0 votes
0 answers
138 views

Suppose that we have a general loss function that depends on some parameters $w$ (e.g. neural network weights): $$L_w =\frac{1}{N} \sum_i \ell(\hat{y}_i, y_i)$$ Is it beneficial to standardize the ...
Antonios Sarikas's user avatar
1 vote
1 answer
90 views

In the derivation of the Gradient Bandit Algorithm in Chapter 2.8 of the Reinforcement Learning book by Sutton & Barto they introduce a introduce a baseline term $B_t$ and I can't seem to figure ...
Rafay Khan's user avatar
1 vote
2 answers
136 views

I am reading Machine Learning with PyTorch and Ski-kit learn book by Sebastian Raschka While plotting the decision boundary (a line in this case, since the number of features considered = 2) I can't ...
tripma's user avatar
  • 21
2 votes
1 answer
89 views

I have a convex multi-variate optimization problem where each variable lies on the domain $[x, \infty)$ for some positive number $x$. I know the problem has a unique finite solution in the domain, ...
BaileyA's user avatar
  • 133
1 vote
0 answers
81 views

Stochastic gradient descent allows us to avoid the computation of full gradients at the expense of introducing a noise floor to convergence. To decrease this noise floor, SGD requires a decrease in ...
hegash's user avatar
  • 111
1 vote
1 answer
131 views

In this lecture notes Fig. 6.5, it illustrates the effect to training loss by using different learning rates: I don't understand why it seems a high enough non-divergent learning rate can converge to ...
Sam's user avatar
  • 413
0 votes
1 answer
128 views

I'm currently learning about neural networks and stumbled upon a confusion related to the use of Stochastic Gradient Descent (SGD) in training. Specifically, I'm puzzled about the computation of the ...
user avatar
0 votes
2 answers
447 views

In logistic regression we find the maximum likelihood estimator - $\max \prod_{i} p(y_i \mid x_i)$. Which in practice means maximizing the sum of log likelihoods. This makes sense, I understand MLE. ...
Amnon Attali's user avatar
2 votes
1 answer
103 views

I am studying about Gradient Descent and Stochastic Gradient Descent, and the text says that one of the advantages of sgd over gd is, that gd can be computationally expensive for large datasets. In ...
WalaWizon's user avatar
  • 103
1 vote
0 answers
78 views

I read that, for the group lasso, to solve the zero subgradient equations, one approach involves keeping all block vectors fixed, denoted as $\{\hat\theta_k, k \ne j\}$, and then solving for $ \hat \...
Jenny's user avatar
  • 261
2 votes
0 answers
74 views

Assume $P$ is a set of pairs $(x, y)$, where both $x$ and $y$ are in $\mathbb{R}^n$. Assume $P'$ is a subset of $P$. I want to train a neural network $N: \mathbb{R}^n \to \mathbb{R}^m$ such that, for ...
Mahyar's user avatar
  • 21
0 votes
0 answers
78 views

When we are computing the gradient of the loss function, $L$, of a Word2Vec model, for the context word-embedding, $w_i$, and the target word-embedding, $t$. Where the loss function, $L$, looks like: $...
ZenPyro's user avatar
0 votes
1 answer
79 views

For simplicity I am going start with a toy example. Lets suppose we have a set of $n$ points $\vec{Y}$ in the 2d space, distributed with the shape of the letter M. ...
Iván's user avatar
  • 101
1 vote
0 answers
85 views

I am trying to understand the paper NeuralODE which is very interesting. I get the general idea and the proof they give about the dynamics of the adjoint are fairly simple. They define a network $f$, ...
Julien Séveno-Piltant's user avatar
0 votes
2 answers
327 views

This is a much simplified network from a real problem that, to me, has a surprising INability to learn a simple task via backprop, ie, it can't overfit or learn at all. This simple version has come at ...
Josh.F's user avatar
  • 107
3 votes
1 answer
75 views

I am trying to implement Nestrov's acceleration gradient descent for SVM. The objective function I need to minimize is $$\frac{1}{2}\lVert Au-Bv\rVert_2^2$$ with constraints $\sum_{i}u_i=\sum_{j}v_j=1$...
struggleinmath's user avatar
1 vote
0 answers
51 views

Does anyone know if online linear learning assumes white noise to the residuals? My thought process is that serial correlation can arise due to the fact that the fit at time t uses the information ...
sebHan1234's user avatar
2 votes
0 answers
271 views

Every resource I found online doesn’t say what to do if the constraints aren’t satisfied. If my updated parameters is given by: $$\theta_{k+1} = \theta_k - (JJ^T + \lambda I)^{-1}Jr, $$ where J is the ...
THATS MY QUANT MY QUANTITATIVE's user avatar
12 votes
2 answers
2k views

In expectation maximization first a lower bound of the likelihood is found and then a 2 step iterative algorithm kicks in where first we try to find the weights (the probability that a data point ...
figs_and_nuts's user avatar
2 votes
0 answers
142 views

When we perform mini-batch GD, we estimate the true gradient: $$\nabla L = \frac{1}{N} \sum_i \nabla L_i$$ with: $$\nabla_B L = \frac{1}{B} \sum_{i \in B} \nabla L_i$$ where $B$ is the batch size. ...
Antonios Sarikas's user avatar
1 vote
0 answers
154 views

Consider the following loss function. loss = ( ( torch.where(d > threshold, torch.sqrt(d), 0) * t ) + ( torch.where(d <= threshold, (1 - d), 0) * (1 - t) ) ) ...
Adel's user avatar
  • 305
0 votes
0 answers
195 views

Hey I'm taking a deeper dive into logistic regression. Specifically the following loss function with L2 regularization, $$l(w)=\frac{1}{n}\sum_n \log(1+\exp(-y_i \cdot x_i^Tw))+\frac{\lambda}{2}||w||^...
zzz's user avatar
  • 1
1 vote
2 answers
142 views

Based on answers online, it appears that feature scaling helps to (1) ensure balanced step sizes and (2) make the cost function more symmetrical. Why would an imbalanced step size be an issue, since a ...
Michael's user avatar
  • 11
1 vote
0 answers
61 views

I developed my own NN toolbox, and it seems it works fine. But I am not sure why I get these spikes in my loss during training: I a training for a classification task of 2 inputs and 2 classes, ...
z_tjona's user avatar
  • 119
4 votes
2 answers
186 views

I am implementing an unconstrained optimization algorithm using gradient descent. I am evaluating a cost function at a given point, evaluating the gradient at this point, and selecting the next ...
Joaquin Rapela's user avatar
1 vote
0 answers
386 views

I'm trying to understand how to compute the derivative of the Softmax activation function in order to compute the gradient of the quadratic cost function w.r.t. the weights of the last layer. ...
Dario Ranieri's user avatar
2 votes
0 answers
92 views

I have been learning about Real Analysis and recently undergone a stats module on regressions etc.. We have come across gradient descent and I was curious about, if we have a complicated loss function,...
ILE2091's user avatar
  • 45
0 votes
0 answers
16 views

Gradient descent involves significant computational effort, whereas the method of least squares enables direct and accurate calculation. Does gradient descent offer any advantages over least squares ...
Dawid's user avatar
  • 33
1 vote
0 answers
69 views

I have a data set $\mathbf X$, with around 20 predictors, which is a matrix of parameters of a surrogate model. For each observation $\mathbf i$ of $\mathbf X$, the surrogate model was trained to ...
Florent H's user avatar
  • 153

1
2 3 4 5
20