Newest 'backpropagation' Questions

1 vote

1 answer

58 views

Bayes-by-backprop - meaning of partial derivative

The Google Deepmind paper "Weight Uncertainty in Neural Networks" features the following algorithm: Note that the $\frac{∂f(w,θ)}{∂w}$ term of the gradients for the mean and standard ...

user494234

21

asked Oct 3 at 9:45

0 votes

0 answers

44 views

Confusion on same-sign gradients problem of Sigmoid function

I'm trying to wrap my head around the problem of same-sign gradients when using sigmoid activation function in a deep neural network. The problem emerges from the fact that sigmoid can only be ...

John

1

asked Jun 14 at 12:37

4 votes

1 answer

89 views

Weight Gradient Dimensions in LSTM Backpropagation

In an LSTM(regression), the output gate is defined as: $$o_t = \sigma\left(W_o x_t + U_o h_{t-1} + b_o \right),$$ where: $W_o \in \mathbb{R}^{m \times d}$ is the input weight matrix, $U_o \in \mathbb{...

Marie

135

asked May 24 at 8:58

3 votes

2 answers

125 views

Question on RNNs lookback window when unrolling

I will use the answer here as an example: https://stats.stackexchange.com/a/370732/78063 It says "which means that you choose a number of time steps $N$, and unroll your network so that it ...

Baron Yugovich

509

asked May 23 at 18:20

4 votes

1 answer

66 views

How to prove that Q of the attention mechanism represents the 'search intent'?

It is said that the $Q$ Q represents "search intent" and K represents the "available information" in the attention mechanism. $\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^...

Gihan

77

asked Apr 10 at 21:48

0 votes

0 answers

54 views

Understanding Backpropagation in Convolutional layer

I need help understanding backpropagation in the convolutional layer. From what I know so far, the forward phase is as follows: where, the tensor $A_{3\times3\times1}$ refers to the feature map in ...

Theethawat Trakunweerayut

1

asked Mar 29 at 17:08

0 votes

0 answers

36 views

Score Matching Algorithim

I've been reading about score matching and I have a very basic question about how one would (naively) implement the algorithm via gradient descent. Say I have some sort of neural network that that ...

Vasting

155

asked Mar 14 at 18:59

3 votes

1 answer

124 views

Check through calculations whether the gradients will explode or vanish

I'm reviewing old exam questions and came across this one: Consider a regular MLP (multi-layer perceptron) architecture with 10 fully connected layers with ReLU activation function. The input to the ...

Aleksander Wojsz

133

asked Feb 5 at 18:19

0 votes

0 answers

89 views

Analytically solving backpropagation through time for a simple gated RNN

Consider the following simple gated RNN: \begin{aligned} c_{t} &= \sigma\bigl(W_{c}\,x_{t} + W_{z}\,z_{t-1}\bigr) \\[6pt] z_{t} &= c_{t} \,\odot\, z_{t-1} \;\;+\;\; (1 - c_{t}) \,\odot\,\...

kuzzooroo

181

asked Jan 31 at 3:12

1 vote

0 answers

62 views

Do weights update less towards the start of a neural network?

That is, because the error is coming from the end of the neural network (ie at the output layer) and trickles back via backpropagation to the start of the neural network, does that mean that the ...

Null Six

11

asked Jan 22 at 18:10

1 vote

0 answers

45 views

Batch Normalization and the effect of scaled weights on the gradients

I have been reading the following paper: https://arxiv.org/pdf/1706.05350, and I am having a hard time with some claims and derivations made in the paper. First of all, the main thing I am interested ...

kklaw

554

asked Dec 26, 2024 at 11:21

0 votes

0 answers

55 views

"Inflating" learning rates in diminishing gradient areas for NN training

In neural net training, nowadays tanh and sigmoid activation functions in hidden layers are avoided as they tend to "saturate" easily. Meaning, if the x value plugged into tanh/sigmoid is ...

omrii

101

asked Nov 26, 2024 at 16:45

4 votes

1 answer

413 views

Questions on backpropagation in a neural net

I understand how to symbolically apply back propagation, calculate the formulas with pen and paper. When it comes to actually using these derivations on data, I have 2 questions: Suppose certain ...

Baron Yugovich

509

asked Nov 22, 2024 at 14:20

4 votes

2 answers

205 views

Avoiding tensors when differentiating with respect to weight matrices in backpropagation

Consider a neural network consisting of only a single affine transformation with no non-linearity. Use the following notation: $\textbf{Inputs}: x \in \mathbb{R}^n$ $\textbf{Weights}: W \in \mathbb{R}...

kuzzooroo

181

asked Nov 2, 2024 at 18:53

1 vote

0 answers

48 views

Calculate gradient with chain rule using additions [closed]

I am taking Karpathy's course, specifically I am on the first video. There is a step in the development of micrograd that I don't fully understand. Specifically in this section, when he talks about ...

Guillermo Álvarez

43

asked Jul 20, 2024 at 18:27

1 vote

0 answers

113 views

Why doesn't Kaiming/He weight Initialization seek a 50/50 compromise for forward and backward pass?

Sorry, please let me know if I'm off, but it seems that He initialization aims to either maintain a constant variance through the forward pass or through the backward pass. It seems the idea is that, ...

riley

11

asked Jun 15, 2024 at 5:53

2 votes

0 answers

67 views

pyimagesearch.com - backpropagation algorithm. What is exactly the difference with the last two layers?

In one part of this tutorial, you could find the following lines: for layer in np.arange(len(A) - 2, 0, -1): Here, we start looping over the layers, backwards, to ...

Anthony

41

asked Mar 6, 2024 at 11:14

1 vote

0 answers

84 views

Backpropagation in LSTM network [closed]

as we have Vanishing Gradient in Vanilla RNN and LSTM is the solution , according to some sources LSTM has Vanishing Gradient too but it doesnt cause any problem in the context of LSTM network cause ...

Kasra

11

asked Dec 26, 2023 at 16:12

2 votes

1 answer

117 views

Calculating derivative for the final layer of a neural network

I'm first learning about backpropagation in neural networks. We're doing stochastic gradient descent. The lecture provides incomplete detail on computing the derivatives for the final layer. We have ...

Ben G

153

asked Nov 19, 2023 at 2:18

6 votes

1 answer

272 views

Does training time increase more if I add a layer at the beginning of a neural network or at the end?

Let's consider a fixed NN architecture, dataset and hardware. We add a layer, either at the beginning or at the end of the NN. In which case the training time will increase more? Intuitively, I ...

DeltaIV

18.6k

asked Oct 26, 2023 at 12:09

1 vote

0 answers

40 views

Is the semicolon (;) notation used to indicate operations are performed concurrently in backpropagation algorithm by Bengio?

I am trying to understand the backpropagation algorithm in a multi-layer perceptron environment. Algorithm 6.4 Backward computation for the deep neural network of algorithm 6.3, which uses, in ...

Revolucion for Monica

863

asked Oct 4, 2023 at 16:31

1 vote

0 answers

75 views

Does skipgram model uses backpropagation?

I just started to get interested in natural language processing and I was trying to understand the skipgram model from word2vec. I was reading this interesting website. However, in the mentioned ...

edamondo

111

asked Aug 30, 2023 at 6:56

0 votes

0 answers

58 views

CNN kernels similarity

I know some theory about deep neural network, cnn and back propagation in general. I am fascinated by the power of these technologies. I try to understand also the math aspects. For example the fact ...

Luca

1

asked Aug 17, 2023 at 12:19

1 vote

0 answers

386 views

Understanding Backpropagation with Softmax and Quadratic Error

I'm trying to understand how to compute the derivative of the Softmax activation function in order to compute the gradient of the quadratic cost function w.r.t. the weights of the last layer. ...

Dario Ranieri

11

asked Jul 18, 2023 at 14:36

4 votes

1 answer

651 views

Reparameterization of Poisson Distribution

In deep learning, especially generative models, sometimes we need to add some random noise to the input of model. To make the sampling of random noise learnable (or differentiable), we need to ...

Lorin60

91

asked Jun 14, 2023 at 6:58

0 votes

1 answer

318 views

How to backpropagate transposed convolution?

I'm currently learning Convolutional Neural Networks and am stuck on trying to figure out how to compute gradients in a layer that uses transposed convolution. Also, how do I calculate the gradients ...

Jakob

1

asked Jun 3, 2023 at 10:03

0 votes

0 answers

37 views

Accumulate the gradient on all conditions before or after backpropagating

I have been working on coding a simple Neural Network, and I have come across a question that I would like to discuss with you. I am trying to approximate two functions with the Neural Network: $f_1(x)...

user362981

1

asked May 11, 2023 at 7:58

1 vote

0 answers

152 views

Closed form expression for the gradient of a fully connected neural network with respect to its parameters

A real-valued feedforward/fully connected neural network with activation function $\sigma : x\in \mathbb R \mapsto \max \{0,x\}\equiv \text{ReLU}(x)$ can formally be seen as a function $f_\theta :\...

Stratos supports the strike

2,252

asked May 8, 2023 at 11:32

2 votes

0 answers

80 views

Backpropagation of position-wise feedforward neural network

I have read a paper entitled "Attention is all you need" by Vaswani et al. (2017). This paper use the so-called position-wise feedforward neural network, where the input of this network is a ...

poglhar

51

asked May 2, 2023 at 3:34

0 votes

0 answers

112 views

How do you find the gradients of weights and biases in neural network during back propagation?

I have been trying to create a neural network from scratch. I have been trying to calculate the gradients of the weights and biases of the neural network by watching videos and reading papers, but ...

ManOnTheMoon

1

asked Apr 16, 2023 at 15:16

0 votes

0 answers

116 views

My loss has a non-differentiable point

I had to design a loss function max(0,x). It's not differentiable at x=0. In order to train it with gradient descent, what should I do? I have learned that subgradient can be used instead, so does it ...

firstforst

1

asked Mar 17, 2023 at 9:20

2 votes

1 answer

735 views

Batch Normalization derivatives

I'm following the derivative calculation of Batch Norm paper: Something doesn't seem right. In the 3rd equation shouldn't we lose the 2nd term as the sum is equal to 0 ($\mu_B$ is the mean of the $...

Maverick Meerkat

3,884

asked Mar 15, 2023 at 16:57

0 votes

0 answers

69 views

How to calculate gradient manually in backpropagation if the neural network architecture consists of linear, batch norm, leaky relu, linear?

What are the steps to manually calculate the backpropagation gradient with the architecture that I mentioned? because I'm confused, the architecture on google regarding backprop is different from the ...

Octa Dion

1

asked Mar 15, 2023 at 6:24

2 votes

1 answer

266 views

How to express backpropagation dE/dV using matrix

I'm new in NN and my math is not that good. I try to do manual calculation using NN model. I already know and try to calculate the feedforward and backward one by one using the formula. but when I try ...

user382927

21

asked Mar 11, 2023 at 1:32

1 vote

0 answers

28 views

Log base in Cross Entropy Loss [duplicate]

What is the base for the logarithm used in the cross entropy loss (while doing multiclass classification's backpropagation)? Is it e, 2, or 10?

Sachin

111

asked Mar 2, 2023 at 13:37

4 votes

1 answer

866 views

How does a gradient pass through argmax in classification?

I just realized I have not given this issue much thought. In a classification task, there is an argmax happening after the softmax to get the most likely class. So how does backpropagation go through ...

Sam

413

asked Feb 25, 2023 at 7:34

1 vote

0 answers

163 views

Deriving vectorized back propagation

I'm trying to derive vectorized backpropagation from mostly first principles, but I'm having trouble marrying how this paper explains backpropagation with the derivative of a loss function with ...

Nick Righi

11

asked Feb 22, 2023 at 19:29

0 votes

1 answer

162 views

Neural network, what is the error value for a layer?

I am new to neural networks. I am studying back propagation and saw different references. for a layer $k$, some references state that the error $\delta_j^k$ for neuron $j$ at $k$th layer is $$ \...

M a m a D

157

asked Feb 18, 2023 at 11:03

1 vote

1 answer

176 views

What is the derivative of a set or a string? [closed]

Neural networks operate on numbers, and it's well-known what the derivative of numeric functions are, as well as what the derivative of matrix functions are. What about functions that operate on maps ...

Wilhelm

111

asked Feb 17, 2023 at 22:10

20 votes

1 answer

2k views

Why isn't (symmetric) log(1+x) used as neural network activation function?

Specifically, I mean $$ f(x)= \begin{cases} -\log(1-x) & x \le 0 \\ \space \space \space \log(1+x) & x \gt 0 \\ \end{cases} $$ Which is red in the plot: It behaves similarly to widely used $\...

yuri kilochek

427

asked Feb 16, 2023 at 14:53

0 votes

0 answers

77 views

RNN's Output Layer: How does it learn from its prev iterations if each Activation Vector is processed in parallel?

First of all, are the activation vectors processed in parallel? If so: That doesnt make sense since each previous activation vector feeds into the RNN as input. So if you're processing all activation ...

Katsu

1,051

asked Jan 27, 2023 at 4:30

1 vote

1 answer

2k views

Single input - multiple outputs with different loss functions in Keras: how is the gradient computed?

I've implemented a neural network with single input - multiple outputs using Keras API. The general structure of the network is like in this figure: Because each branch does a different task, I ...

Elise Le

97

asked Jan 17, 2023 at 10:39

1 vote

1 answer

2k views

Backpropagation with binary cross entropy loss formula

I will classify using a neural network algorithm. I use 2 output, Y1=1 (positive) and Y2=0 (negative). The architecture is as follows: loss that I use is binary cross entropy with the following ...

Andryan

47

asked Jan 9, 2023 at 7:44

1 vote

1 answer

96 views

Can there be 3 initial weights for 2 inputs in a backpropagation network?

I am fairly new to machine learning and Neural Network. I was given a scenario where there is a 2-input single unit backpropagation Neural Network has 3 initial weights. The inputs are x and y. The ...

Tragend

11

asked Dec 29, 2022 at 12:20

2 votes

1 answer

478 views

Derivative error with respect to bias in binary cross entropy

I will do research using NN with 1 hidden layer. To calculate loss using binary cross entropy and for the activation function using sigmoid. I found the derivative formula from Sadowski, 2016 (link: ...

Andryan

47

asked Dec 26, 2022 at 22:51

2 votes

0 answers

373 views

Why is Gradient Accumulation not used frequently in training large models compared to using bigger batch sizes?

So I was currently going through various implementations of models and they set a large batch size of around 256 , running this is google collab is very memory intensive, so I decided to use a batch ...

mutli-arm-bandit

148

asked Dec 16, 2022 at 15:09

0 votes

0 answers

244 views

How is Dropout implemented in the Training Algorithm? [duplicate]

I've searched for an answer to this, and read a number of scientific articles on the subject, but I can't find a practical explanation of how Dropout actually drops nodes in an algorithm. I've read ...

Connor

677

asked Dec 13, 2022 at 11:11

1 vote

0 answers

46 views

Freezing parts of neural net depending on classification label

I have a classification model that is trained to predict the probability of approve/reject on an object, and we use the output of the final sigmoid as our final score to rank these object. These ...

mobius

11

asked Nov 27, 2022 at 15:14

0 votes

0 answers

25 views

Differences between Maximum Likelihood Estimation and Backpropagation? [duplicate]

Reading the definition of MLE, it sounds like it is: "Given a likelihood function, estimate the most likely parameters." When I read that, it sounds like it has the same goal of what ...

Katsu

1,051

asked Oct 21, 2022 at 18:19

1 vote

2 answers

447 views

Why gradient vanishing/exploding is bad?

I don't know why gradient vanishing/exploding is a bad thing? If the gradient of a parameter is small by gradient descent and back propagation, it is the power of Mathematics Rules (Chain Rule) that ...

benjaminchanming

121

asked Oct 21, 2022 at 7:21

Questions tagged [backpropagation]