Questions tagged [backpropagation]
Backpropagation, an abbreviation for "backward propagation of errors", is a common method of training artificial neural networks used in conjunction with an optimization method such as gradient descent.
503 questions
1
vote
1
answer
58
views
Bayes-by-backprop - meaning of partial derivative
The Google Deepmind paper "Weight Uncertainty in Neural Networks" features the following algorithm:
Note that the $\frac{∂f(w,θ)}{∂w}$
term of the gradients for the mean and
standard ...
3
votes
2
answers
125
views
Question on RNNs lookback window when unrolling
I will use the answer here as an example: https://stats.stackexchange.com/a/370732/78063
It says "which means that you choose a number of time steps $N$, and unroll your network so that it ...
4
votes
1
answer
89
views
Weight Gradient Dimensions in LSTM Backpropagation
In an LSTM(regression), the output gate is defined as:
$$o_t = \sigma\left(W_o x_t + U_o h_{t-1} + b_o \right),$$
where: $W_o \in \mathbb{R}^{m \times d}$ is the input weight matrix,
$U_o \in \mathbb{...
0
votes
0
answers
44
views
Confusion on same-sign gradients problem of Sigmoid function
I'm trying to wrap my head around the problem of same-sign gradients when using sigmoid activation function in a deep neural network. The problem emerges from the fact that sigmoid can only be ...
4
votes
1
answer
66
views
How to prove that Q of the attention mechanism represents the 'search intent'?
It is said that the $Q$ Q represents "search intent" and K represents the "available information" in the attention mechanism.
$\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^...
4
votes
2
answers
205
views
Avoiding tensors when differentiating with respect to weight matrices in backpropagation
Consider a neural network consisting of only a single affine transformation with no non-linearity. Use the following notation:
$\textbf{Inputs}: x \in \mathbb{R}^n$
$\textbf{Weights}: W \in \mathbb{R}...
3
votes
1
answer
124
views
Check through calculations whether the gradients will explode or vanish
I'm reviewing old exam questions and came across this one:
Consider a regular MLP (multi-layer perceptron) architecture with 10 fully connected layers with ReLU activation function. The input to the ...
4
votes
1
answer
413
views
Questions on backpropagation in a neural net
I understand how to symbolically apply back propagation, calculate the formulas with pen and paper. When it comes to actually using these derivations on data, I have 2 questions:
Suppose certain ...
77
votes
7
answers
86k
views
Why is tanh almost always better than sigmoid as an activation function?
In Andrew Ng's Neural Networks and Deep Learning course on Coursera he says that using $tanh$ is almost always preferable to using $sigmoid$.
The reason he gives is that the outputs using $tanh$ ...
0
votes
0
answers
89
views
Analytically solving backpropagation through time for a simple gated RNN
Consider the following simple gated RNN:
\begin{aligned}
c_{t} &= \sigma\bigl(W_{c}\,x_{t} + W_{z}\,z_{t-1}\bigr)
\\[6pt]
z_{t} &= c_{t} \,\odot\, z_{t-1} \;\;+\;\;
(1 - c_{t}) \,\odot\,\...
19
votes
4
answers
7k
views
Simulated annealing for deep learning: Why is gradient free statistical learning not in the main stream?
In order to define what deep learning is, the learning portion is often listed with backpropagation as a requirement without alternatives in the main stream software libraries and in the literature. ...
1
vote
0
answers
62
views
Do weights update less towards the start of a neural network?
That is, because the error is coming from the end of the neural network (ie at the output layer) and trickles back via backpropagation to the start of the neural network, does that mean that the ...
28
votes
4
answers
15k
views
Can $\sin(x)$ be used as activation in deep learning?
$\sin(x)$ seems to zero centered which is a desirable property for activation functions. Even the gradient won't vanish at any point. I am not sure if the oscillating nature of the function or its ...
122
votes
6
answers
56k
views
Is it possible to train a neural network without backpropagation?
Many neural network books and tutorials spend a lot of time on the backpropagation algorithm, which is essentially a tool to compute the gradient.
Let's assume we are building a model with ~10K ...
2
votes
1
answer
735
views
Batch Normalization derivatives
I'm following the derivative calculation of Batch Norm paper:
Something doesn't seem right. In the 3rd equation shouldn't we lose the 2nd term as the sum is equal to 0 ($\mu_B$ is the mean of the $...
0
votes
0
answers
54
views
Understanding Backpropagation in Convolutional layer
I need help understanding backpropagation in the convolutional layer.
From what I know so far, the forward phase is as follows:
where, the tensor $A_{3\times3\times1}$ refers to the feature map in ...
0
votes
0
answers
55
views
"Inflating" learning rates in diminishing gradient areas for NN training
In neural net training, nowadays tanh and sigmoid activation functions in hidden layers are avoided as they tend to "saturate" easily. Meaning, if the x value plugged into tanh/sigmoid is ...
66
votes
5
answers
167k
views
Backpropagation with Softmax / Cross Entropy
I'm trying to understand how backpropagation works for a softmax/cross-entropy output layer.
The cross entropy error function is
$$E(t,o)=-\sum_j t_j \log o_j$$
with $t$ and $o$ as the target and ...
20
votes
1
answer
2k
views
Why isn't (symmetric) log(1+x) used as neural network activation function?
Specifically, I mean
$$
f(x)=
\begin{cases}
-\log(1-x) & x \le 0 \\
\space \space \space \log(1+x) & x \gt 0 \\
\end{cases}
$$
Which is red in the plot:
It behaves similarly to widely used $\...
2
votes
1
answer
1k
views
I am not able to understand how did the elementwise multiplication came into the picture of backpropagation in neural networks
I have understood the backpropagation algorithm along with the chain rule well enough that I can derive it on my own, but I don't understand where the elementwise multiplication came from and how does ...
1
vote
0
answers
45
views
Batch Normalization and the effect of scaled weights on the gradients
I have been reading the following paper: https://arxiv.org/pdf/1706.05350, and I am having a hard time with some claims and derivations made in the paper.
First of all, the main thing I am interested ...
61
votes
8
answers
69k
views
Danger of setting all initial weights to zero in Backpropagation
Why is it dangerous to initialize weights with zeros? Is there any simple example that demonstrates it?
8
votes
1
answer
4k
views
Why is ReLU so popular despite being NOT zero-centered
Activation functions should be zero-centered (Reference) and that's why tanh is preferred over sigmoid.
But ReLU is not zero centered and still is often the first choice. I know it solves the issue of ...
6
votes
1
answer
272
views
Does training time increase more if I add a layer at the beginning of a neural network or at the end?
Let's consider a fixed NN architecture, dataset and hardware. We add a layer, either at the beginning or at the end of the NN. In which case the training time will increase more? Intuitively, I ...
0
votes
1
answer
216
views
Is it valid to calculate a transformer neural network loss with respect to one element of a sequence input?
Suppose one sample of my training data consists of a sequence with $n$ elements. My task is to do binary classification on one element in the sequence, and my labels are such that for each sequence in ...
1
vote
0
answers
113
views
Why doesn't Kaiming/He weight Initialization seek a 50/50 compromise for forward and backward pass?
Sorry, please let me know if I'm off, but it seems that He initialization aims to either maintain a constant variance through the forward pass or through the backward pass.
It seems the idea is that, ...
8
votes
3
answers
16k
views
Deriving the Backpropagation Matrix formulas for a Neural Network - Matrix dimensions don't work out
I try to really internalize the way backpropagation works. I made up different networks with increasing complexity and wrote the formulas to it.
However, I have some difficulties with the matrix ...
4
votes
1
answer
866
views
How does a gradient pass through argmax in classification?
I just realized I have not given this issue much thought. In a classification task, there is an argmax happening after the softmax to get the most likely class. So how does backpropagation go through ...
21
votes
3
answers
29k
views
MNIST digit recognition: what is the best we can get with a fully connected NN only? (no CNN)
To fully understand how it works internally, I'm re-writing a neural network from scratch in Python + numpy only. (As it's for learning purposes, performance is not an issue).
Before moving to ...
1
vote
0
answers
48
views
Calculate gradient with chain rule using additions [closed]
I am taking Karpathy's course, specifically I am on the first video. There is a step in the development of micrograd that I don't fully understand. Specifically in this section, when he talks about ...
0
votes
1
answer
582
views
Does gradient clipping in a RNN help the network learn the long term dependencies?
So this was asked in one of the exams and I think that gradient clipping does help in learning long term dependencies in RNN but the answer provided to us was "Gradient clipping cannot help with ...
1
vote
1
answer
2k
views
Single input - multiple outputs with different loss functions in Keras: how is the gradient computed?
I've implemented a neural network with single input - multiple outputs using Keras API. The general structure of the network is like in this figure:
Because each branch does a different task, I ...
4
votes
1
answer
651
views
Reparameterization of Poisson Distribution
In deep learning, especially generative models, sometimes we need to add some random noise to the input of model. To make the sampling of random noise learnable (or differentiable), we need to ...
13
votes
3
answers
29k
views
Mean Absolute Error (MAE) derivative
$$MAE=|y_\textrm{pred} - y_\textrm{true}|$$
$$\dfrac{\mathrm dMAE}{\mathrm dy_\textrm{pred}} = ?$$
I'm trying to understand how MAE works as a loss function in neural networks using backpropogation. I ...
1
vote
2
answers
447
views
Why gradient vanishing/exploding is bad?
I don't know why gradient vanishing/exploding is a bad thing?
If the gradient of a parameter is small by gradient descent and back propagation, it is the power of Mathematics Rules (Chain Rule) that ...
2
votes
0
answers
67
views
pyimagesearch.com - backpropagation algorithm. What is exactly the difference with the last two layers?
In one part of this tutorial, you could find the following lines:
for layer in np.arange(len(A) - 2, 0, -1):
Here, we start looping over the layers, backwards, to ...
20
votes
2
answers
13k
views
Why is the second derivative required for newton's method for back-propagation?
I am troubled with why isn't the Newton's method used for backpropagation, instead, or in addition to Gradient Descent more widely.
I have seen this same question, and the widely accepted answer ...
49
votes
1
answer
35k
views
Why are non zero-centered activation functions a problem in backpropagation?
I read here the following:
Sigmoid outputs are not zero-centered. This is undesirable since neurons in later layers of processing in a Neural Network (more on
this soon) would be receiving data ...
2
votes
1
answer
117
views
Calculating derivative for the final layer of a neural network
I'm first learning about backpropagation in neural networks. We're doing stochastic gradient descent.
The lecture provides incomplete detail on computing the derivatives for the final layer.
We have ...
1
vote
0
answers
386
views
Understanding Backpropagation with Softmax and Quadratic Error
I'm trying to understand how to compute the derivative of the Softmax activation function in order to compute the gradient of the quadratic cost function w.r.t. the weights of the last layer.
...
26
votes
1
answer
15k
views
What are the practical uses of Neural ODEs?
"Neural Ordinary Differential Equations", by Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt and David Duvenaud, was awarded the best-paper award in NeurIPS in 2018
There, authors propose the ...
5
votes
1
answer
326
views
Matrix Derivation for Neural Network Formula
I am learning some insights of Neural network but I have some problem with the derivation of matrix for backpropagation. On an assumption that the formula for calculating for one node in a neural ...
1
vote
0
answers
84
views
Backpropagation in LSTM network [closed]
as we have Vanishing Gradient in Vanilla RNN and LSTM is the solution , according to some sources LSTM has Vanishing Gradient too but it doesnt cause any problem in the context of LSTM network cause ...
0
votes
1
answer
318
views
How to backpropagate transposed convolution?
I'm currently learning Convolutional Neural Networks and am stuck on trying to figure out how to compute gradients in a layer that uses transposed convolution. Also, how do I calculate the gradients ...
1
vote
1
answer
2k
views
Backpropagation with binary cross entropy loss formula
I will classify using a neural network algorithm. I use 2 output, Y1=1 (positive) and Y2=0 (negative). The architecture is as follows:
loss that I use is binary cross entropy with the following ...
4
votes
2
answers
1k
views
What do they mean by "batchnormalization allows to initialization of weights less carefully?"
In Towards Data Science - Manish Chablani - Batch Normalization, it is stated that:
Makes weights easier to initialize — Weight initialization can be
difficult, and it’s even more difficult when ...
37
votes
6
answers
44k
views
Backpropagation vs Genetic Algorithm for Neural Network training
I've read a few papers discussing pros and cons of each method, some arguing that GA doesn't give any improvement in finding the optimal solution while others show that it is more effective. It seems ...
2
votes
1
answer
478
views
Derivative error with respect to bias in binary cross entropy
I will do research using NN with 1 hidden layer. To calculate loss using binary cross entropy and for the activation function using sigmoid. I found the derivative formula from Sadowski, 2016 (link: ...
12
votes
2
answers
6k
views
Transfer learning: How and why retrain only final layers of a network?
In this video, at 5:29, Prof. Andrew Ng says regarding transfer learning:
Depending on how much data you have, you might just retrain the new layers of the network, or maybe you could retrain even ...
3
votes
1
answer
356
views
Does VAE backprop start from the decoder all the way to encoder?
In neural networks that start with input layer, run through hidden layers, and ultimately reach the output layer, we start back-propagation from weights closer to output layer and go backward towards ...