Skip to main content

Questions tagged [backpropagation]

Backpropagation, an abbreviation for "backward propagation of errors", is a common method of training artificial neural networks used in conjunction with an optimization method such as gradient descent.

Filter by
Sorted by
Tagged with
1 vote
1 answer
58 views

The Google Deepmind paper "Weight Uncertainty in Neural Networks" features the following algorithm: Note that the $\frac{∂f(w,θ)}{∂w}$ term of the gradients for the mean and standard ...
user494234's user avatar
3 votes
2 answers
125 views

I will use the answer here as an example: https://stats.stackexchange.com/a/370732/78063 It says "which means that you choose a number of time steps $N$, and unroll your network so that it ...
Baron Yugovich's user avatar
4 votes
1 answer
89 views

In an LSTM(regression), the output gate is defined as: $$o_t = \sigma\left(W_o x_t + U_o h_{t-1} + b_o \right),$$ where: $W_o \in \mathbb{R}^{m \times d}$ is the input weight matrix, $U_o \in \mathbb{...
Marie's user avatar
  • 135
0 votes
0 answers
44 views

I'm trying to wrap my head around the problem of same-sign gradients when using sigmoid activation function in a deep neural network. The problem emerges from the fact that sigmoid can only be ...
John's user avatar
  • 1
4 votes
1 answer
66 views

It is said that the $Q$ Q represents "search intent" and K represents the "available information" in the attention mechanism. $\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^...
Gihan's user avatar
  • 77
4 votes
2 answers
205 views

Consider a neural network consisting of only a single affine transformation with no non-linearity. Use the following notation: $\textbf{Inputs}: x \in \mathbb{R}^n$ $\textbf{Weights}: W \in \mathbb{R}...
kuzzooroo's user avatar
  • 181
3 votes
1 answer
124 views

I'm reviewing old exam questions and came across this one: Consider a regular MLP (multi-layer perceptron) architecture with 10 fully connected layers with ReLU activation function. The input to the ...
Aleksander Wojsz's user avatar
4 votes
1 answer
413 views

I understand how to symbolically apply back propagation, calculate the formulas with pen and paper. When it comes to actually using these derivations on data, I have 2 questions: Suppose certain ...
Baron Yugovich's user avatar
77 votes
7 answers
86k views

In Andrew Ng's Neural Networks and Deep Learning course on Coursera he says that using $tanh$ is almost always preferable to using $sigmoid$. The reason he gives is that the outputs using $tanh$ ...
Tom Hale's user avatar
  • 2,671
0 votes
0 answers
89 views

Consider the following simple gated RNN: \begin{aligned} c_{t} &= \sigma\bigl(W_{c}\,x_{t} + W_{z}\,z_{t-1}\bigr) \\[6pt] z_{t} &= c_{t} \,\odot\, z_{t-1} \;\;+\;\; (1 - c_{t}) \,\odot\,\...
kuzzooroo's user avatar
  • 181
19 votes
4 answers
7k views

In order to define what deep learning is, the learning portion is often listed with backpropagation as a requirement without alternatives in the main stream software libraries and in the literature. ...
patagonicus's user avatar
  • 2,789
1 vote
0 answers
62 views

That is, because the error is coming from the end of the neural network (ie at the output layer) and trickles back via backpropagation to the start of the neural network, does that mean that the ...
Null Six's user avatar
28 votes
4 answers
15k views

$\sin(x)$ seems to zero centered which is a desirable property for activation functions. Even the gradient won't vanish at any point. I am not sure if the oscillating nature of the function or its ...
Biswadip Mandal's user avatar
122 votes
6 answers
56k views

Many neural network books and tutorials spend a lot of time on the backpropagation algorithm, which is essentially a tool to compute the gradient. Let's assume we are building a model with ~10K ...
HXD's user avatar
  • 37.8k
2 votes
1 answer
735 views

I'm following the derivative calculation of Batch Norm paper: Something doesn't seem right. In the 3rd equation shouldn't we lose the 2nd term as the sum is equal to 0 ($\mu_B$ is the mean of the $...
Maverick Meerkat's user avatar
0 votes
0 answers
54 views

I need help understanding backpropagation in the convolutional layer. From what I know so far, the forward phase is as follows: where, the tensor $A_{3\times3\times1}$ refers to the feature map in ...
Theethawat Trakunweerayut's user avatar
0 votes
0 answers
55 views

In neural net training, nowadays tanh and sigmoid activation functions in hidden layers are avoided as they tend to "saturate" easily. Meaning, if the x value plugged into tanh/sigmoid is ...
omrii's user avatar
  • 101
66 votes
5 answers
167k views

I'm trying to understand how backpropagation works for a softmax/cross-entropy output layer. The cross entropy error function is $$E(t,o)=-\sum_j t_j \log o_j$$ with $t$ and $o$ as the target and ...
micha's user avatar
  • 763
20 votes
1 answer
2k views

Specifically, I mean $$ f(x)= \begin{cases} -\log(1-x) & x \le 0 \\ \space \space \space \log(1+x) & x \gt 0 \\ \end{cases} $$ Which is red in the plot: It behaves similarly to widely used $\...
yuri kilochek's user avatar
2 votes
1 answer
1k views

I have understood the backpropagation algorithm along with the chain rule well enough that I can derive it on my own, but I don't understand where the elementwise multiplication came from and how does ...
Circuit_Breaker0.7's user avatar
1 vote
0 answers
45 views

I have been reading the following paper: https://arxiv.org/pdf/1706.05350, and I am having a hard time with some claims and derivations made in the paper. First of all, the main thing I am interested ...
kklaw's user avatar
  • 554
61 votes
8 answers
69k views

Why is it dangerous to initialize weights with zeros? Is there any simple example that demonstrates it?
user8078's user avatar
  • 713
8 votes
1 answer
4k views

Activation functions should be zero-centered (Reference) and that's why tanh is preferred over sigmoid. But ReLU is not zero centered and still is often the first choice. I know it solves the issue of ...
snowdenassange's user avatar
6 votes
1 answer
272 views

Let's consider a fixed NN architecture, dataset and hardware. We add a layer, either at the beginning or at the end of the NN. In which case the training time will increase more? Intuitively, I ...
DeltaIV's user avatar
  • 18.6k
0 votes
1 answer
216 views

Suppose one sample of my training data consists of a sequence with $n$ elements. My task is to do binary classification on one element in the sequence, and my labels are such that for each sequence in ...
Jude Wells's user avatar
1 vote
0 answers
113 views

Sorry, please let me know if I'm off, but it seems that He initialization aims to either maintain a constant variance through the forward pass or through the backward pass. It seems the idea is that, ...
riley's user avatar
  • 11
8 votes
3 answers
16k views

I try to really internalize the way backpropagation works. I made up different networks with increasing complexity and wrote the formulas to it. However, I have some difficulties with the matrix ...
Yves Boutellier's user avatar
4 votes
1 answer
866 views

I just realized I have not given this issue much thought. In a classification task, there is an argmax happening after the softmax to get the most likely class. So how does backpropagation go through ...
Sam's user avatar
  • 413
21 votes
3 answers
29k views

To fully understand how it works internally, I'm re-writing a neural network from scratch in Python + numpy only. (As it's for learning purposes, performance is not an issue). Before moving to ...
Basj's user avatar
  • 632
1 vote
0 answers
48 views

I am taking Karpathy's course, specifically I am on the first video. There is a step in the development of micrograd that I don't fully understand. Specifically in this section, when he talks about ...
Guillermo Álvarez's user avatar
0 votes
1 answer
582 views

So this was asked in one of the exams and I think that gradient clipping does help in learning long term dependencies in RNN but the answer provided to us was "Gradient clipping cannot help with ...
thisisbhavin's user avatar
1 vote
1 answer
2k views

I've implemented a neural network with single input - multiple outputs using Keras API. The general structure of the network is like in this figure: Because each branch does a different task, I ...
Elise Le's user avatar
4 votes
1 answer
651 views

In deep learning, especially generative models, sometimes we need to add some random noise to the input of model. To make the sampling of random noise learnable (or differentiable), we need to ...
Lorin60's user avatar
  • 91
13 votes
3 answers
29k views

$$MAE=|y_\textrm{pred} - y_\textrm{true}|$$ $$\dfrac{\mathrm dMAE}{\mathrm dy_\textrm{pred}} = ?$$ I'm trying to understand how MAE works as a loss function in neural networks using backpropogation. I ...
tea_pea's user avatar
  • 746
1 vote
2 answers
447 views

I don't know why gradient vanishing/exploding is a bad thing? If the gradient of a parameter is small by gradient descent and back propagation, it is the power of Mathematics Rules (Chain Rule) that ...
benjaminchanming's user avatar
2 votes
0 answers
67 views

In one part of this tutorial, you could find the following lines: for layer in np.arange(len(A) - 2, 0, -1): Here, we start looping over the layers, backwards, to ...
Anthony's user avatar
  • 41
20 votes
2 answers
13k views

I am troubled with why isn't the Newton's method used for backpropagation, instead, or in addition to Gradient Descent more widely. I have seen this same question, and the widely accepted answer ...
Gulzar's user avatar
  • 633
49 votes
1 answer
35k views

I read here the following: Sigmoid outputs are not zero-centered. This is undesirable since neurons in later layers of processing in a Neural Network (more on this soon) would be receiving data ...
Amelio Vazquez-Reina's user avatar
2 votes
1 answer
117 views

I'm first learning about backpropagation in neural networks. We're doing stochastic gradient descent. The lecture provides incomplete detail on computing the derivatives for the final layer. We have ...
Ben G's user avatar
  • 153
1 vote
0 answers
386 views

I'm trying to understand how to compute the derivative of the Softmax activation function in order to compute the gradient of the quadratic cost function w.r.t. the weights of the last layer. ...
Dario Ranieri's user avatar
26 votes
1 answer
15k views

"Neural Ordinary Differential Equations", by Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt and David Duvenaud, was awarded the best-paper award in NeurIPS in 2018 There, authors propose the ...
Firebug's user avatar
  • 20.5k
5 votes
1 answer
326 views

I am learning some insights of Neural network but I have some problem with the derivation of matrix for backpropagation. On an assumption that the formula for calculating for one node in a neural ...
Hoang Nam's user avatar
  • 151
1 vote
0 answers
84 views

as we have Vanishing Gradient in Vanilla RNN and LSTM is the solution , according to some sources LSTM has Vanishing Gradient too but it doesnt cause any problem in the context of LSTM network cause ...
Kasra's user avatar
  • 11
0 votes
1 answer
318 views

I'm currently learning Convolutional Neural Networks and am stuck on trying to figure out how to compute gradients in a layer that uses transposed convolution. Also, how do I calculate the gradients ...
Jakob's user avatar
  • 1
1 vote
1 answer
2k views

I will classify using a neural network algorithm. I use 2 output, Y1=1 (positive) and Y2=0 (negative). The architecture is as follows: loss that I use is binary cross entropy with the following ...
Andryan's user avatar
  • 47
4 votes
2 answers
1k views

In Towards Data Science - Manish Chablani - Batch Normalization, it is stated that: Makes weights easier to initialize — Weight initialization can be difficult, and it’s even more difficult when ...
Mas A's user avatar
  • 273
37 votes
6 answers
44k views

I've read a few papers discussing pros and cons of each method, some arguing that GA doesn't give any improvement in finding the optimal solution while others show that it is more effective. It seems ...
sashkello's user avatar
  • 2,274
2 votes
1 answer
478 views

I will do research using NN with 1 hidden layer. To calculate loss using binary cross entropy and for the activation function using sigmoid. I found the derivative formula from Sadowski, 2016 (link: ...
Andryan's user avatar
  • 47
12 votes
2 answers
6k views

In this video, at 5:29, Prof. Andrew Ng says regarding transfer learning: Depending on how much data you have, you might just retrain the new layers of the network, or maybe you could retrain even ...
Tom Hale's user avatar
  • 2,671
3 votes
1 answer
356 views

In neural networks that start with input layer, run through hidden layers, and ultimately reach the output layer, we start back-propagation from weights closer to output layer and go backward towards ...
Curious's user avatar
  • 481

1
2 3 4 5
11