Questions tagged [backpropagation]
Backpropagation, an abbreviation for "backward propagation of errors", is a common method of training artificial neural networks used in conjunction with an optimization method such as gradient descent.
503 questions
1
vote
1
answer
58
views
Bayes-by-backprop - meaning of partial derivative
The Google Deepmind paper "Weight Uncertainty in Neural Networks" features the following algorithm:
Note that the $\frac{∂f(w,θ)}{∂w}$
term of the gradients for the mean and
standard ...
0
votes
0
answers
44
views
Confusion on same-sign gradients problem of Sigmoid function
I'm trying to wrap my head around the problem of same-sign gradients when using sigmoid activation function in a deep neural network. The problem emerges from the fact that sigmoid can only be ...
4
votes
1
answer
89
views
Weight Gradient Dimensions in LSTM Backpropagation
In an LSTM(regression), the output gate is defined as:
$$o_t = \sigma\left(W_o x_t + U_o h_{t-1} + b_o \right),$$
where: $W_o \in \mathbb{R}^{m \times d}$ is the input weight matrix,
$U_o \in \mathbb{...
3
votes
2
answers
125
views
Question on RNNs lookback window when unrolling
I will use the answer here as an example: https://stats.stackexchange.com/a/370732/78063
It says "which means that you choose a number of time steps $N$, and unroll your network so that it ...
4
votes
1
answer
66
views
How to prove that Q of the attention mechanism represents the 'search intent'?
It is said that the $Q$ Q represents "search intent" and K represents the "available information" in the attention mechanism.
$\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^...
0
votes
0
answers
54
views
Understanding Backpropagation in Convolutional layer
I need help understanding backpropagation in the convolutional layer.
From what I know so far, the forward phase is as follows:
where, the tensor $A_{3\times3\times1}$ refers to the feature map in ...
0
votes
0
answers
36
views
Score Matching Algorithim
I've been reading about score matching and I have a very basic question about how one would (naively) implement the algorithm via gradient descent.
Say I have some sort of neural network that that ...
3
votes
1
answer
124
views
Check through calculations whether the gradients will explode or vanish
I'm reviewing old exam questions and came across this one:
Consider a regular MLP (multi-layer perceptron) architecture with 10 fully connected layers with ReLU activation function. The input to the ...
0
votes
0
answers
89
views
Analytically solving backpropagation through time for a simple gated RNN
Consider the following simple gated RNN:
\begin{aligned}
c_{t} &= \sigma\bigl(W_{c}\,x_{t} + W_{z}\,z_{t-1}\bigr)
\\[6pt]
z_{t} &= c_{t} \,\odot\, z_{t-1} \;\;+\;\;
(1 - c_{t}) \,\odot\,\...
1
vote
0
answers
62
views
Do weights update less towards the start of a neural network?
That is, because the error is coming from the end of the neural network (ie at the output layer) and trickles back via backpropagation to the start of the neural network, does that mean that the ...
1
vote
0
answers
45
views
Batch Normalization and the effect of scaled weights on the gradients
I have been reading the following paper: https://arxiv.org/pdf/1706.05350, and I am having a hard time with some claims and derivations made in the paper.
First of all, the main thing I am interested ...
0
votes
0
answers
55
views
"Inflating" learning rates in diminishing gradient areas for NN training
In neural net training, nowadays tanh and sigmoid activation functions in hidden layers are avoided as they tend to "saturate" easily. Meaning, if the x value plugged into tanh/sigmoid is ...
4
votes
1
answer
413
views
Questions on backpropagation in a neural net
I understand how to symbolically apply back propagation, calculate the formulas with pen and paper. When it comes to actually using these derivations on data, I have 2 questions:
Suppose certain ...
4
votes
2
answers
205
views
Avoiding tensors when differentiating with respect to weight matrices in backpropagation
Consider a neural network consisting of only a single affine transformation with no non-linearity. Use the following notation:
$\textbf{Inputs}: x \in \mathbb{R}^n$
$\textbf{Weights}: W \in \mathbb{R}...
1
vote
0
answers
48
views
Calculate gradient with chain rule using additions [closed]
I am taking Karpathy's course, specifically I am on the first video. There is a step in the development of micrograd that I don't fully understand. Specifically in this section, when he talks about ...
1
vote
0
answers
113
views
Why doesn't Kaiming/He weight Initialization seek a 50/50 compromise for forward and backward pass?
Sorry, please let me know if I'm off, but it seems that He initialization aims to either maintain a constant variance through the forward pass or through the backward pass.
It seems the idea is that, ...
2
votes
0
answers
67
views
pyimagesearch.com - backpropagation algorithm. What is exactly the difference with the last two layers?
In one part of this tutorial, you could find the following lines:
for layer in np.arange(len(A) - 2, 0, -1):
Here, we start looping over the layers, backwards, to ...
1
vote
0
answers
84
views
Backpropagation in LSTM network [closed]
as we have Vanishing Gradient in Vanilla RNN and LSTM is the solution , according to some sources LSTM has Vanishing Gradient too but it doesnt cause any problem in the context of LSTM network cause ...
2
votes
1
answer
117
views
Calculating derivative for the final layer of a neural network
I'm first learning about backpropagation in neural networks. We're doing stochastic gradient descent.
The lecture provides incomplete detail on computing the derivatives for the final layer.
We have ...
6
votes
1
answer
272
views
Does training time increase more if I add a layer at the beginning of a neural network or at the end?
Let's consider a fixed NN architecture, dataset and hardware. We add a layer, either at the beginning or at the end of the NN. In which case the training time will increase more? Intuitively, I ...
1
vote
0
answers
40
views
Is the semicolon (;) notation used to indicate operations are performed concurrently in backpropagation algorithm by Bengio?
I am trying to understand the backpropagation algorithm in a multi-layer perceptron environment.
Algorithm 6.4 Backward computation for the deep neural network of
algorithm 6.3, which uses, in ...
1
vote
0
answers
75
views
Does skipgram model uses backpropagation?
I just started to get interested in natural language processing and I was trying to understand the skipgram model from word2vec. I was reading this interesting website. However, in the mentioned ...
0
votes
0
answers
58
views
CNN kernels similarity
I know some theory about deep neural network, cnn and back propagation in general. I am fascinated by the power of these technologies. I try to understand also the math aspects.
For example the fact ...
1
vote
0
answers
386
views
Understanding Backpropagation with Softmax and Quadratic Error
I'm trying to understand how to compute the derivative of the Softmax activation function in order to compute the gradient of the quadratic cost function w.r.t. the weights of the last layer.
...
4
votes
1
answer
651
views
Reparameterization of Poisson Distribution
In deep learning, especially generative models, sometimes we need to add some random noise to the input of model. To make the sampling of random noise learnable (or differentiable), we need to ...
0
votes
1
answer
318
views
How to backpropagate transposed convolution?
I'm currently learning Convolutional Neural Networks and am stuck on trying to figure out how to compute gradients in a layer that uses transposed convolution. Also, how do I calculate the gradients ...
0
votes
0
answers
37
views
Accumulate the gradient on all conditions before or after backpropagating
I have been working on coding a simple Neural Network, and I have come across a question that I would like to discuss with you. I am trying to approximate two functions with the Neural Network: $f_1(x)...
1
vote
0
answers
152
views
Closed form expression for the gradient of a fully connected neural network with respect to its parameters
A real-valued feedforward/fully connected neural network with activation function $\sigma : x\in \mathbb R \mapsto \max \{0,x\}\equiv \text{ReLU}(x)$ can formally be seen as a function $f_\theta :\...
2
votes
0
answers
80
views
Backpropagation of position-wise feedforward neural network
I have read a paper entitled "Attention is all you need" by Vaswani et al. (2017). This paper use the so-called position-wise feedforward neural network, where the input of this network is a ...
0
votes
0
answers
112
views
How do you find the gradients of weights and biases in neural network during back propagation?
I have been trying to create a neural network from scratch.
I have been trying to calculate the gradients of the weights and biases of the neural network by watching videos and reading papers, but ...
0
votes
0
answers
116
views
My loss has a non-differentiable point
I had to design a loss function max(0,x). It's not differentiable at x=0. In order to train it with gradient descent, what should I do?
I have learned that subgradient can be used instead, so does it ...
2
votes
1
answer
735
views
Batch Normalization derivatives
I'm following the derivative calculation of Batch Norm paper:
Something doesn't seem right. In the 3rd equation shouldn't we lose the 2nd term as the sum is equal to 0 ($\mu_B$ is the mean of the $...
0
votes
0
answers
69
views
How to calculate gradient manually in backpropagation if the neural network architecture consists of linear, batch norm, leaky relu, linear?
What are the steps to manually calculate the backpropagation gradient with the architecture that I mentioned? because I'm confused, the architecture on google regarding backprop is different from the ...
2
votes
1
answer
266
views
How to express backpropagation dE/dV using matrix
I'm new in NN and my math is not that good. I try to do manual calculation using NN model. I already know and try to calculate the feedforward and backward one by one using the formula. but when I try ...
1
vote
0
answers
28
views
Log base in Cross Entropy Loss [duplicate]
What is the base for the logarithm used in the cross entropy loss (while doing multiclass classification's backpropagation)? Is it e, 2, or 10?
4
votes
1
answer
866
views
How does a gradient pass through argmax in classification?
I just realized I have not given this issue much thought. In a classification task, there is an argmax happening after the softmax to get the most likely class. So how does backpropagation go through ...
1
vote
0
answers
163
views
Deriving vectorized back propagation
I'm trying to derive vectorized backpropagation from mostly first principles, but I'm having
trouble marrying how
this
paper explains backpropagation with the derivative of a loss function with ...
0
votes
1
answer
162
views
Neural network, what is the error value for a layer?
I am new to neural networks. I am studying back propagation and saw different references. for a layer $k$, some references state that the error $\delta_j^k$ for neuron $j$ at $k$th layer is
$$
\...
1
vote
1
answer
176
views
What is the derivative of a set or a string? [closed]
Neural networks operate on numbers, and it's well-known what the derivative of numeric functions are, as well as what the derivative of matrix functions are.
What about functions that operate on maps ...
20
votes
1
answer
2k
views
Why isn't (symmetric) log(1+x) used as neural network activation function?
Specifically, I mean
$$
f(x)=
\begin{cases}
-\log(1-x) & x \le 0 \\
\space \space \space \log(1+x) & x \gt 0 \\
\end{cases}
$$
Which is red in the plot:
It behaves similarly to widely used $\...
0
votes
0
answers
77
views
RNN's Output Layer: How does it learn from its prev iterations if each Activation Vector is processed in parallel?
First of all, are the activation vectors processed in parallel? If so:
That doesnt make sense since each previous activation vector feeds into the RNN as input. So if you're processing all activation ...
1
vote
1
answer
2k
views
Single input - multiple outputs with different loss functions in Keras: how is the gradient computed?
I've implemented a neural network with single input - multiple outputs using Keras API. The general structure of the network is like in this figure:
Because each branch does a different task, I ...
1
vote
1
answer
2k
views
Backpropagation with binary cross entropy loss formula
I will classify using a neural network algorithm. I use 2 output, Y1=1 (positive) and Y2=0 (negative). The architecture is as follows:
loss that I use is binary cross entropy with the following ...
1
vote
1
answer
96
views
Can there be 3 initial weights for 2 inputs in a backpropagation network?
I am fairly new to machine learning and Neural Network. I was given a scenario where there is a 2-input single unit backpropagation Neural Network has 3 initial weights. The inputs are x and y. The ...
2
votes
1
answer
478
views
Derivative error with respect to bias in binary cross entropy
I will do research using NN with 1 hidden layer. To calculate loss using binary cross entropy and for the activation function using sigmoid. I found the derivative formula from Sadowski, 2016 (link: ...
2
votes
0
answers
373
views
Why is Gradient Accumulation not used frequently in training large models compared to using bigger batch sizes?
So I was currently going through various implementations of models and they set a large batch size of around 256 , running this is google collab is very memory intensive, so I decided to use a batch ...
0
votes
0
answers
244
views
How is Dropout implemented in the Training Algorithm? [duplicate]
I've searched for an answer to this, and read a number of scientific articles on the subject, but I can't find a practical explanation of how Dropout actually drops nodes in an algorithm.
I've read ...
1
vote
0
answers
46
views
Freezing parts of neural net depending on classification label
I have a classification model that is trained to predict the probability of approve/reject on an object, and we use the output of the final sigmoid as our final score to rank these object.
These ...
0
votes
0
answers
25
views
Differences between Maximum Likelihood Estimation and Backpropagation? [duplicate]
Reading the definition of MLE, it sounds like it is: "Given a likelihood function, estimate the most likely parameters."
When I read that, it sounds like it has the same goal of what ...
1
vote
2
answers
447
views
Why gradient vanishing/exploding is bad?
I don't know why gradient vanishing/exploding is a bad thing?
If the gradient of a parameter is small by gradient descent and back propagation, it is the power of Mathematics Rules (Chain Rule) that ...