Trending 'backpropagation' questions

1 vote

1 answer

58 views

Bayes-by-backprop - meaning of partial derivative

The Google Deepmind paper "Weight Uncertainty in Neural Networks" features the following algorithm: Note that the $\frac{∂f(w,θ)}{∂w}$ term of the gradients for the mean and standard ...

user494234

21

asked Oct 3 at 9:45

3 votes

2 answers

125 views

Question on RNNs lookback window when unrolling

I will use the answer here as an example: https://stats.stackexchange.com/a/370732/78063 It says "which means that you choose a number of time steps $N$, and unroll your network so that it ...

Baron Yugovich

509

asked May 23 at 18:20

4 votes

1 answer

89 views

Weight Gradient Dimensions in LSTM Backpropagation

In an LSTM(regression), the output gate is defined as: $$o_t = \sigma\left(W_o x_t + U_o h_{t-1} + b_o \right),$$ where: $W_o \in \mathbb{R}^{m \times d}$ is the input weight matrix, $U_o \in \mathbb{...

Marie

135

asked May 24 at 8:58

0 votes

0 answers

44 views

Confusion on same-sign gradients problem of Sigmoid function

I'm trying to wrap my head around the problem of same-sign gradients when using sigmoid activation function in a deep neural network. The problem emerges from the fact that sigmoid can only be ...

John

1

asked Jun 14 at 12:37

4 votes

1 answer

66 views

How to prove that Q of the attention mechanism represents the 'search intent'?

It is said that the $Q$ Q represents "search intent" and K represents the "available information" in the attention mechanism. $\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^...

Gihan

77

asked Apr 10 at 21:48

4 votes

2 answers

205 views

Avoiding tensors when differentiating with respect to weight matrices in backpropagation

Consider a neural network consisting of only a single affine transformation with no non-linearity. Use the following notation: $\textbf{Inputs}: x \in \mathbb{R}^n$ $\textbf{Weights}: W \in \mathbb{R}...

kuzzooroo

181

asked Nov 2, 2024 at 18:53

3 votes

1 answer

124 views

Check through calculations whether the gradients will explode or vanish

I'm reviewing old exam questions and came across this one: Consider a regular MLP (multi-layer perceptron) architecture with 10 fully connected layers with ReLU activation function. The input to the ...

Aleksander Wojsz

133

asked Feb 5 at 18:19

4 votes

1 answer

413 views

Questions on backpropagation in a neural net

I understand how to symbolically apply back propagation, calculate the formulas with pen and paper. When it comes to actually using these derivations on data, I have 2 questions: Suppose certain ...

Baron Yugovich

509

asked Nov 22, 2024 at 14:20

77 votes

7 answers

86k views

Why is tanh almost always better than sigmoid as an activation function?

In Andrew Ng's Neural Networks and Deep Learning course on Coursera he says that using $tanh$ is almost always preferable to using $sigmoid$. The reason he gives is that the outputs using $tanh$ ...

Tom Hale

2,671

asked Feb 26, 2018 at 8:45

0 votes

0 answers

89 views

Analytically solving backpropagation through time for a simple gated RNN

Consider the following simple gated RNN: \begin{aligned} c_{t} &= \sigma\bigl(W_{c}\,x_{t} + W_{z}\,z_{t-1}\bigr) \\[6pt] z_{t} &= c_{t} \,\odot\, z_{t-1} \;\;+\;\; (1 - c_{t}) \,\odot\,\...

kuzzooroo

181

asked Jan 31 at 3:12

19 votes

4 answers

7k views

Simulated annealing for deep learning: Why is gradient free statistical learning not in the main stream?

In order to define what deep learning is, the learning portion is often listed with backpropagation as a requirement without alternatives in the main stream software libraries and in the literature. ...

patagonicus

2,789

asked Jan 5, 2022 at 0:00

1 vote

0 answers

62 views

Do weights update less towards the start of a neural network?

That is, because the error is coming from the end of the neural network (ie at the output layer) and trickles back via backpropagation to the start of the neural network, does that mean that the ...

Null Six

11

asked Jan 22 at 18:10

28 votes

4 answers

15k views

Can $\sin(x)$ be used as activation in deep learning?

$\sin(x)$ seems to zero centered which is a desirable property for activation functions. Even the gradient won't vanish at any point. I am not sure if the oscillating nature of the function or its ...

Biswadip Mandal

405

asked Apr 12, 2019 at 8:35

122 votes

6 answers

56k views

Is it possible to train a neural network without backpropagation?

Many neural network books and tutorials spend a lot of time on the backpropagation algorithm, which is essentially a tool to compute the gradient. Let's assume we are building a model with ~10K ...

HXD

37.8k

asked Sep 20, 2016 at 1:48

2 votes

1 answer

735 views

Batch Normalization derivatives

I'm following the derivative calculation of Batch Norm paper: Something doesn't seem right. In the 3rd equation shouldn't we lose the 2nd term as the sum is equal to 0 ($\mu_B$ is the mean of the $...

Maverick Meerkat

3,884

asked Mar 15, 2023 at 16:57

0 votes

0 answers

54 views

Understanding Backpropagation in Convolutional layer

I need help understanding backpropagation in the convolutional layer. From what I know so far, the forward phase is as follows: where, the tensor $A_{3\times3\times1}$ refers to the feature map in ...

Theethawat Trakunweerayut

1

asked Mar 29 at 17:08

0 votes

0 answers

55 views

"Inflating" learning rates in diminishing gradient areas for NN training

In neural net training, nowadays tanh and sigmoid activation functions in hidden layers are avoided as they tend to "saturate" easily. Meaning, if the x value plugged into tanh/sigmoid is ...

omrii

101

asked Nov 26, 2024 at 16:45

66 votes

5 answers

167k views

Backpropagation with Softmax / Cross Entropy

I'm trying to understand how backpropagation works for a softmax/cross-entropy output layer. The cross entropy error function is $$E(t,o)=-\sum_j t_j \log o_j$$ with $t$ and $o$ as the target and ...

micha

763

asked Sep 17, 2016 at 23:32

20 votes

1 answer

2k views

Why isn't (symmetric) log(1+x) used as neural network activation function?

Specifically, I mean $$ f(x)= \begin{cases} -\log(1-x) & x \le 0 \\ \space \space \space \log(1+x) & x \gt 0 \\ \end{cases} $$ Which is red in the plot: It behaves similarly to widely used $\...

yuri kilochek

427

asked Feb 16, 2023 at 14:53

2 votes

1 answer

1k views

I am not able to understand how did the elementwise multiplication came into the picture of backpropagation in neural networks

I have understood the backpropagation algorithm along with the chain rule well enough that I can derive it on my own, but I don't understand where the elementwise multiplication came from and how does ...

Circuit_Breaker0.7

21

asked Sep 18, 2022 at 13:46

1 vote

0 answers

45 views

Batch Normalization and the effect of scaled weights on the gradients

I have been reading the following paper: https://arxiv.org/pdf/1706.05350, and I am having a hard time with some claims and derivations made in the paper. First of all, the main thing I am interested ...

kklaw

554

asked Dec 26, 2024 at 11:21

61 votes

8 answers

69k views

Danger of setting all initial weights to zero in Backpropagation

Why is it dangerous to initialize weights with zeros? Is there any simple example that demonstrates it?

user8078

713

asked Apr 25, 2012 at 18:21

8 votes

1 answer

4k views

Why is ReLU so popular despite being NOT zero-centered

Activation functions should be zero-centered (Reference) and that's why tanh is preferred over sigmoid. But ReLU is not zero centered and still is often the first choice. I know it solves the issue of ...

snowdenassange

83

asked Sep 26, 2020 at 18:24

6 votes

1 answer

272 views

Does training time increase more if I add a layer at the beginning of a neural network or at the end?

Let's consider a fixed NN architecture, dataset and hardware. We add a layer, either at the beginning or at the end of the NN. In which case the training time will increase more? Intuitively, I ...

DeltaIV

18.6k

asked Oct 26, 2023 at 12:09

0 votes

1 answer

216 views

Is it valid to calculate a transformer neural network loss with respect to one element of a sequence input?

Suppose one sample of my training data consists of a sequence with $n$ elements. My task is to do binary classification on one element in the sequence, and my labels are such that for each sequence in ...

Jude Wells

173

asked Oct 9, 2022 at 13:21

1 vote

0 answers

113 views

Why doesn't Kaiming/He weight Initialization seek a 50/50 compromise for forward and backward pass?

Sorry, please let me know if I'm off, but it seems that He initialization aims to either maintain a constant variance through the forward pass or through the backward pass. It seems the idea is that, ...

riley

11

asked Jun 15, 2024 at 5:53

8 votes

3 answers

16k views

Deriving the Backpropagation Matrix formulas for a Neural Network - Matrix dimensions don't work out

I try to really internalize the way backpropagation works. I made up different networks with increasing complexity and wrote the formulas to it. However, I have some difficulties with the matrix ...

Yves Boutellier

181

asked Jan 29, 2021 at 22:23

4 votes

1 answer

866 views

How does a gradient pass through argmax in classification?

I just realized I have not given this issue much thought. In a classification task, there is an argmax happening after the softmax to get the most likely class. So how does backpropagation go through ...

Sam

413

asked Feb 25, 2023 at 7:34

21 votes

3 answers

29k views

MNIST digit recognition: what is the best we can get with a fully connected NN only? (no CNN)

To fully understand how it works internally, I'm re-writing a neural network from scratch in Python + numpy only. (As it's for learning purposes, performance is not an issue). Before moving to ...

Basj

632

asked Nov 10, 2018 at 11:03

1 vote

0 answers

48 views

Calculate gradient with chain rule using additions [closed]

I am taking Karpathy's course, specifically I am on the first video. There is a step in the development of micrograd that I don't fully understand. Specifically in this section, when he talks about ...

Guillermo Álvarez

43

asked Jul 20, 2024 at 18:27

0 votes

1 answer

582 views

Does gradient clipping in a RNN help the network learn the long term dependencies?

So this was asked in one of the exams and I think that gradient clipping does help in learning long term dependencies in RNN but the answer provided to us was "Gradient clipping cannot help with ...

thisisbhavin

195

asked Dec 26, 2021 at 0:03

1 vote

1 answer

2k views

Single input - multiple outputs with different loss functions in Keras: how is the gradient computed?

I've implemented a neural network with single input - multiple outputs using Keras API. The general structure of the network is like in this figure: Because each branch does a different task, I ...

Elise Le

97

asked Jan 17, 2023 at 10:39

4 votes

1 answer

651 views

Reparameterization of Poisson Distribution

In deep learning, especially generative models, sometimes we need to add some random noise to the input of model. To make the sampling of random noise learnable (or differentiable), we need to ...

Lorin60

91

asked Jun 14, 2023 at 6:58

13 votes

3 answers

29k views

Mean Absolute Error (MAE) derivative

$$MAE=|y_\textrm{pred} - y_\textrm{true}|$$ $$\dfrac{\mathrm dMAE}{\mathrm dy_\textrm{pred}} = ?$$ I'm trying to understand how MAE works as a loss function in neural networks using backpropogation. I ...

tea_pea

746

asked Nov 8, 2017 at 23:39

1 vote

2 answers

447 views

Why gradient vanishing/exploding is bad?

I don't know why gradient vanishing/exploding is a bad thing? If the gradient of a parameter is small by gradient descent and back propagation, it is the power of Mathematics Rules (Chain Rule) that ...

benjaminchanming

121

asked Oct 21, 2022 at 7:21

2 votes

0 answers

67 views

pyimagesearch.com - backpropagation algorithm. What is exactly the difference with the last two layers?

In one part of this tutorial, you could find the following lines: for layer in np.arange(len(A) - 2, 0, -1): Here, we start looping over the layers, backwards, to ...

Anthony

41

asked Mar 6, 2024 at 11:14

20 votes

2 answers

13k views

Why is the second derivative required for newton's method for back-propagation?

I am troubled with why isn't the Newton's method used for backpropagation, instead, or in addition to Gradient Descent more widely. I have seen this same question, and the widely accepted answer ...

Gulzar

633

asked Nov 9, 2018 at 15:24

49 votes

1 answer

35k views

Why are non zero-centered activation functions a problem in backpropagation?

I read here the following: Sigmoid outputs are not zero-centered. This is undesirable since neurons in later layers of processing in a Neural Network (more on this soon) would be receiving data ...

Amelio Vazquez-Reina

19.9k

asked Sep 27, 2016 at 14:12

2 votes

1 answer

117 views

Calculating derivative for the final layer of a neural network

I'm first learning about backpropagation in neural networks. We're doing stochastic gradient descent. The lecture provides incomplete detail on computing the derivatives for the final layer. We have ...

Ben G

153

asked Nov 19, 2023 at 2:18

1 vote

0 answers

386 views

Understanding Backpropagation with Softmax and Quadratic Error

I'm trying to understand how to compute the derivative of the Softmax activation function in order to compute the gradient of the quadratic cost function w.r.t. the weights of the last layer. ...

Dario Ranieri

11

asked Jul 18, 2023 at 14:36

26 votes

1 answer

15k views

What are the practical uses of Neural ODEs?

"Neural Ordinary Differential Equations", by Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt and David Duvenaud, was awarded the best-paper award in NeurIPS in 2018 There, authors propose the ...

Firebug

20.5k

asked Jan 20, 2020 at 3:27

5 votes

1 answer

326 views

Matrix Derivation for Neural Network Formula

I am learning some insights of Neural network but I have some problem with the derivation of matrix for backpropagation. On an assumption that the formula for calculating for one node in a neural ...

Hoang Nam

151

asked Mar 5, 2022 at 9:43

1 vote

0 answers

84 views

Backpropagation in LSTM network [closed]

as we have Vanishing Gradient in Vanilla RNN and LSTM is the solution , according to some sources LSTM has Vanishing Gradient too but it doesnt cause any problem in the context of LSTM network cause ...

Kasra

11

asked Dec 26, 2023 at 16:12

0 votes

1 answer

318 views

How to backpropagate transposed convolution?

I'm currently learning Convolutional Neural Networks and am stuck on trying to figure out how to compute gradients in a layer that uses transposed convolution. Also, how do I calculate the gradients ...

Jakob

1

asked Jun 3, 2023 at 10:03

1 vote

1 answer

2k views

Backpropagation with binary cross entropy loss formula

I will classify using a neural network algorithm. I use 2 output, Y1=1 (positive) and Y2=0 (negative). The architecture is as follows: loss that I use is binary cross entropy with the following ...

Andryan

47

asked Jan 9, 2023 at 7:44

4 votes

2 answers

1k views

What do they mean by "batchnormalization allows to initialization of weights less carefully?"

In Towards Data Science - Manish Chablani - Batch Normalization, it is stated that: Makes weights easier to initialize — Weight initialization can be difficult, and it’s even more difficult when ...

Mas A

273

asked Dec 23, 2021 at 10:01

37 votes

6 answers

44k views

Backpropagation vs Genetic Algorithm for Neural Network training

I've read a few papers discussing pros and cons of each method, some arguing that GA doesn't give any improvement in finding the optimal solution while others show that it is more effective. It seems ...

sashkello

2,274

asked Apr 11, 2013 at 23:42

2 votes

1 answer

478 views

Derivative error with respect to bias in binary cross entropy

I will do research using NN with 1 hidden layer. To calculate loss using binary cross entropy and for the activation function using sigmoid. I found the derivative formula from Sadowski, 2016 (link: ...

Andryan

47

asked Dec 26, 2022 at 22:51

12 votes

2 answers

6k views

Transfer learning: How and why retrain only final layers of a network?

In this video, at 5:29, Prof. Andrew Ng says regarding transfer learning: Depending on how much data you have, you might just retrain the new layers of the network, or maybe you could retrain even ...

Tom Hale

2,671

asked May 31, 2018 at 8:26

3 votes

1 answer

356 views

Does VAE backprop start from the decoder all the way to encoder?

In neural networks that start with input layer, run through hidden layers, and ultimately reach the output layer, we start back-propagation from weights closer to output layer and go backward towards ...

Curious

481

asked May 28, 2022 at 13:18

Questions tagged [backpropagation]