Stochastic Gradient Descent for Multilayer Networks

Question

I was going through the algorithm for Stochastic Gradient decent in mulilayer network from the book Machine Learning by Tom Mitchell, and it shows the formulae for weight update rule. However, I dont understand how there are various subscripts i,j,k. Like what does Δwji mean for the final formula? Even though its stated its confusing.

Mitchell's book is one of the best books in ML. A masterpiece in ML education. Indices are explained very well. $i$, $j$ and $k$ are the units that form weights. $k$ is for output vector. $\Delta w$ -s are the standard delta rule. — patagonicus
– patagonicus, Commented May 11 at 1:32

gunes · Accepted Answer · 2025-05-10 22:12:11Z

2

First of all, if anyone wants to have a look at the wider context, the book is accessible in Tom Mitchell's website: https://www.cs.cmu.edu/~tom/mlbook.html

I think you're right to be confused. There are, I believe, notational mistakes and I'm surprised that previous pages do not include a proper explanation for the forward propagation equations, as well as the variables used.

Problems I noticed:

It says $x$ is the input vector, but refers to it with a double index: $x_{ji}$
It uses $\delta$ variable for both hidden and output layers, i.e. $\delta_k, \delta_h$.
It does the same for $w$. According to T4.4, $w_{kh}$ refers to the weights between the hidden and the output layer. So, it's initial dimension, $k$ ranges from $1$ to $n_{out}$. But, if it's the same $w_{ji}$ in the next line, $j$ must have the same range. That doesn't make sense for $x_{ji}$ (This was probably $x_i$) although I didn't check. In any case, looks like $\delta_j$ also ranges from $1$ to $n_{hidden}$ for the hidden layer.

I can probably continue...

So, I suggest you look at another source. As a starting point, wikipedia entry is much better. For a more detailed explanation with multiple layers, this one looks good. Although its notation is more complicated, it's correct as far as I can see.

answered May 10 at 22:12

gunes

59.3k4 gold badges52 silver badges92 bronze badges

$\begingroup$ Indices and their dimensions are actually consistent. $\delta$ and $w$ are used for both input and hidden layer. $\endgroup$

patagonicus
– patagonicus

2025-05-11 01:28:12 +00:00
Commented May 11 at 1:28
$\begingroup$ I don't think I'd call it as consistency. The usual way of dealing with it is to use superscripts on which layer those variables belong to, e.g. what is $\delta_1$? $\endgroup$

gunes
– gunes

2025-05-11 21:03:11 +00:00
Commented May 11 at 21:03
$\begingroup$ I understood your approach. Confusion might be that, Mitchell uses single hidden layer and subscripts run over units and layers are differentiated with subscripts. There is no inconsistency in his explanation, albeit not the most generic, given that book is written in 80s and GPUs were not available, single hidden layer has given the pedagogical intro. Though he has hinted how to generalise on section 4.5.2.2. The book is still a gem. $\endgroup$

patagonicus
– patagonicus

2025-05-12 14:25:17 +00:00
Commented May 12 at 14:25
$\begingroup$ Even in single hidden layer case, using the same variable for both output and hidden layer is confusing. Besides, I'm still not sure what $x_{ji}$ is, given that $x$ was a vector. $\endgroup$

gunes
– gunes

2025-05-12 20:58:00 +00:00
Commented May 12 at 20:58
$\begingroup$ Still a vector. Mitchell uses dummy indices, $x_{ji}$ are computed for hidden inputs, fully-connected, recall the Einstein summation rule. $\endgroup$

patagonicus
– patagonicus

2025-05-13 17:14:28 +00:00
Commented May 13 at 17:14

Add a comment |

Stack Exchange Network

Stochastic Gradient Descent for Multilayer Networks

1 Answer 1

Your Answer

Hot Network Questions

Stochastic Gradient Descent for Multilayer Networks

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions