Skip to main content

Questions tagged [backpropagation]

Backpropagation, an abbreviation for "backward propagation of errors", is a common method of training artificial neural networks used in conjunction with an optimization method such as gradient descent.

Filter by
Sorted by
Tagged with
1 vote
1 answer
58 views

The Google Deepmind paper "Weight Uncertainty in Neural Networks" features the following algorithm: Note that the $\frac{∂f(w,θ)}{∂w}$ term of the gradients for the mean and standard ...
user494234's user avatar
0 votes
0 answers
44 views

I'm trying to wrap my head around the problem of same-sign gradients when using sigmoid activation function in a deep neural network. The problem emerges from the fact that sigmoid can only be ...
John's user avatar
  • 1
4 votes
1 answer
89 views

In an LSTM(regression), the output gate is defined as: $$o_t = \sigma\left(W_o x_t + U_o h_{t-1} + b_o \right),$$ where: $W_o \in \mathbb{R}^{m \times d}$ is the input weight matrix, $U_o \in \mathbb{...
Marie's user avatar
  • 135
3 votes
2 answers
125 views

I will use the answer here as an example: https://stats.stackexchange.com/a/370732/78063 It says "which means that you choose a number of time steps $N$, and unroll your network so that it ...
Baron Yugovich's user avatar
4 votes
1 answer
66 views

It is said that the $Q$ Q represents "search intent" and K represents the "available information" in the attention mechanism. $\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^...
Gihan's user avatar
  • 77
0 votes
0 answers
54 views

I need help understanding backpropagation in the convolutional layer. From what I know so far, the forward phase is as follows: where, the tensor $A_{3\times3\times1}$ refers to the feature map in ...
Theethawat Trakunweerayut's user avatar
0 votes
0 answers
36 views

I've been reading about score matching and I have a very basic question about how one would (naively) implement the algorithm via gradient descent. Say I have some sort of neural network that that ...
Vasting's user avatar
  • 155
3 votes
1 answer
124 views

I'm reviewing old exam questions and came across this one: Consider a regular MLP (multi-layer perceptron) architecture with 10 fully connected layers with ReLU activation function. The input to the ...
Aleksander Wojsz's user avatar
0 votes
0 answers
89 views

Consider the following simple gated RNN: \begin{aligned} c_{t} &= \sigma\bigl(W_{c}\,x_{t} + W_{z}\,z_{t-1}\bigr) \\[6pt] z_{t} &= c_{t} \,\odot\, z_{t-1} \;\;+\;\; (1 - c_{t}) \,\odot\,\...
kuzzooroo's user avatar
  • 181
1 vote
0 answers
62 views

That is, because the error is coming from the end of the neural network (ie at the output layer) and trickles back via backpropagation to the start of the neural network, does that mean that the ...
Null Six's user avatar
1 vote
0 answers
45 views

I have been reading the following paper: https://arxiv.org/pdf/1706.05350, and I am having a hard time with some claims and derivations made in the paper. First of all, the main thing I am interested ...
kklaw's user avatar
  • 554
0 votes
0 answers
55 views

In neural net training, nowadays tanh and sigmoid activation functions in hidden layers are avoided as they tend to "saturate" easily. Meaning, if the x value plugged into tanh/sigmoid is ...
omrii's user avatar
  • 101
4 votes
1 answer
413 views

I understand how to symbolically apply back propagation, calculate the formulas with pen and paper. When it comes to actually using these derivations on data, I have 2 questions: Suppose certain ...
Baron Yugovich's user avatar
4 votes
2 answers
205 views

Consider a neural network consisting of only a single affine transformation with no non-linearity. Use the following notation: $\textbf{Inputs}: x \in \mathbb{R}^n$ $\textbf{Weights}: W \in \mathbb{R}...
kuzzooroo's user avatar
  • 181
1 vote
0 answers
48 views

I am taking Karpathy's course, specifically I am on the first video. There is a step in the development of micrograd that I don't fully understand. Specifically in this section, when he talks about ...
Guillermo Álvarez's user avatar
1 vote
0 answers
113 views

Sorry, please let me know if I'm off, but it seems that He initialization aims to either maintain a constant variance through the forward pass or through the backward pass. It seems the idea is that, ...
riley's user avatar
  • 11
2 votes
0 answers
67 views

In one part of this tutorial, you could find the following lines: for layer in np.arange(len(A) - 2, 0, -1): Here, we start looping over the layers, backwards, to ...
Anthony's user avatar
  • 41
1 vote
0 answers
84 views

as we have Vanishing Gradient in Vanilla RNN and LSTM is the solution , according to some sources LSTM has Vanishing Gradient too but it doesnt cause any problem in the context of LSTM network cause ...
Kasra's user avatar
  • 11
2 votes
1 answer
117 views

I'm first learning about backpropagation in neural networks. We're doing stochastic gradient descent. The lecture provides incomplete detail on computing the derivatives for the final layer. We have ...
Ben G's user avatar
  • 153
6 votes
1 answer
272 views

Let's consider a fixed NN architecture, dataset and hardware. We add a layer, either at the beginning or at the end of the NN. In which case the training time will increase more? Intuitively, I ...
DeltaIV's user avatar
  • 18.6k
1 vote
0 answers
40 views

I am trying to understand the backpropagation algorithm in a multi-layer perceptron environment. Algorithm 6.4 Backward computation for the deep neural network of algorithm 6.3, which uses, in ...
Revolucion for Monica's user avatar
1 vote
0 answers
75 views

I just started to get interested in natural language processing and I was trying to understand the skipgram model from word2vec. I was reading this interesting website. However, in the mentioned ...
edamondo's user avatar
  • 111
0 votes
0 answers
58 views

I know some theory about deep neural network, cnn and back propagation in general. I am fascinated by the power of these technologies. I try to understand also the math aspects. For example the fact ...
Luca's user avatar
  • 1
1 vote
0 answers
386 views

I'm trying to understand how to compute the derivative of the Softmax activation function in order to compute the gradient of the quadratic cost function w.r.t. the weights of the last layer. ...
Dario Ranieri's user avatar
4 votes
1 answer
651 views

In deep learning, especially generative models, sometimes we need to add some random noise to the input of model. To make the sampling of random noise learnable (or differentiable), we need to ...
Lorin60's user avatar
  • 91
0 votes
1 answer
318 views

I'm currently learning Convolutional Neural Networks and am stuck on trying to figure out how to compute gradients in a layer that uses transposed convolution. Also, how do I calculate the gradients ...
Jakob's user avatar
  • 1
0 votes
0 answers
37 views

I have been working on coding a simple Neural Network, and I have come across a question that I would like to discuss with you. I am trying to approximate two functions with the Neural Network: $f_1(x)...
user362981's user avatar
1 vote
0 answers
152 views

A real-valued feedforward/fully connected neural network with activation function $\sigma : x\in \mathbb R \mapsto \max \{0,x\}\equiv \text{ReLU}(x)$ can formally be seen as a function $f_\theta :\...
Stratos supports the strike's user avatar
2 votes
0 answers
80 views

I have read a paper entitled "Attention is all you need" by Vaswani et al. (2017). This paper use the so-called position-wise feedforward neural network, where the input of this network is a ...
poglhar's user avatar
  • 51
0 votes
0 answers
112 views

I have been trying to create a neural network from scratch. I have been trying to calculate the gradients of the weights and biases of the neural network by watching videos and reading papers, but ...
ManOnTheMoon's user avatar
0 votes
0 answers
116 views

I had to design a loss function max(0,x). It's not differentiable at x=0. In order to train it with gradient descent, what should I do? I have learned that subgradient can be used instead, so does it ...
firstforst's user avatar
2 votes
1 answer
735 views

I'm following the derivative calculation of Batch Norm paper: Something doesn't seem right. In the 3rd equation shouldn't we lose the 2nd term as the sum is equal to 0 ($\mu_B$ is the mean of the $...
Maverick Meerkat's user avatar
0 votes
0 answers
69 views

What are the steps to manually calculate the backpropagation gradient with the architecture that I mentioned? because I'm confused, the architecture on google regarding backprop is different from the ...
Octa Dion's user avatar
2 votes
1 answer
266 views

I'm new in NN and my math is not that good. I try to do manual calculation using NN model. I already know and try to calculate the feedforward and backward one by one using the formula. but when I try ...
user382927's user avatar
1 vote
0 answers
28 views

What is the base for the logarithm used in the cross entropy loss (while doing multiclass classification's backpropagation)? Is it e, 2, or 10?
Sachin's user avatar
  • 111
4 votes
1 answer
866 views

I just realized I have not given this issue much thought. In a classification task, there is an argmax happening after the softmax to get the most likely class. So how does backpropagation go through ...
Sam's user avatar
  • 413
1 vote
0 answers
163 views

I'm trying to derive vectorized backpropagation from mostly first principles, but I'm having trouble marrying how this paper explains backpropagation with the derivative of a loss function with ...
Nick Righi's user avatar
0 votes
1 answer
162 views

I am new to neural networks. I am studying back propagation and saw different references. for a layer $k$, some references state that the error $\delta_j^k$ for neuron $j$ at $k$th layer is $$ \...
M a m a D's user avatar
  • 157
1 vote
1 answer
176 views

Neural networks operate on numbers, and it's well-known what the derivative of numeric functions are, as well as what the derivative of matrix functions are. What about functions that operate on maps ...
Wilhelm's user avatar
  • 111
20 votes
1 answer
2k views

Specifically, I mean $$ f(x)= \begin{cases} -\log(1-x) & x \le 0 \\ \space \space \space \log(1+x) & x \gt 0 \\ \end{cases} $$ Which is red in the plot: It behaves similarly to widely used $\...
yuri kilochek's user avatar
0 votes
0 answers
77 views

First of all, are the activation vectors processed in parallel? If so: That doesnt make sense since each previous activation vector feeds into the RNN as input. So if you're processing all activation ...
Katsu's user avatar
  • 1,051
1 vote
1 answer
2k views

I've implemented a neural network with single input - multiple outputs using Keras API. The general structure of the network is like in this figure: Because each branch does a different task, I ...
Elise Le's user avatar
1 vote
1 answer
2k views

I will classify using a neural network algorithm. I use 2 output, Y1=1 (positive) and Y2=0 (negative). The architecture is as follows: loss that I use is binary cross entropy with the following ...
Andryan's user avatar
  • 47
1 vote
1 answer
96 views

I am fairly new to machine learning and Neural Network. I was given a scenario where there is a 2-input single unit backpropagation Neural Network has 3 initial weights. The inputs are x and y. The ...
Tragend's user avatar
  • 11
2 votes
1 answer
478 views

I will do research using NN with 1 hidden layer. To calculate loss using binary cross entropy and for the activation function using sigmoid. I found the derivative formula from Sadowski, 2016 (link: ...
Andryan's user avatar
  • 47
2 votes
0 answers
373 views

So I was currently going through various implementations of models and they set a large batch size of around 256 , running this is google collab is very memory intensive, so I decided to use a batch ...
mutli-arm-bandit's user avatar
0 votes
0 answers
244 views

I've searched for an answer to this, and read a number of scientific articles on the subject, but I can't find a practical explanation of how Dropout actually drops nodes in an algorithm. I've read ...
Connor's user avatar
  • 677
1 vote
0 answers
46 views

I have a classification model that is trained to predict the probability of approve/reject on an object, and we use the output of the final sigmoid as our final score to rank these object. These ...
mobius's user avatar
  • 11
0 votes
0 answers
25 views

Reading the definition of MLE, it sounds like it is: "Given a likelihood function, estimate the most likely parameters." When I read that, it sounds like it has the same goal of what ...
Katsu's user avatar
  • 1,051
1 vote
2 answers
447 views

I don't know why gradient vanishing/exploding is a bad thing? If the gradient of a parameter is small by gradient descent and back propagation, it is the power of Mathematics Rules (Chain Rule) that ...
benjaminchanming's user avatar

1
2 3 4 5
11