4
$\begingroup$

I was going through the algorithm for Stochastic Gradient decent in mulilayer network from the book Machine Learning by Tom Mitchell, and it shows the formulae for weight update rule. However, I dont understand how there are various subscripts i,j,k. Like what does Δwji mean for the final formula? Even though its stated its confusing.Stochastic Gradient Descent for Multilayer Networks

$\endgroup$
1
  • $\begingroup$ Mitchell's book is one of the best books in ML. A masterpiece in ML education. Indices are explained very well. $i$, $j$ and $k$ are the units that form weights. $k$ is for output vector. $\Delta w$ -s are the standard delta rule. $\endgroup$ Commented May 11 at 1:32

1 Answer 1

2
$\begingroup$

First of all, if anyone wants to have a look at the wider context, the book is accessible in Tom Mitchell's website: https://www.cs.cmu.edu/~tom/mlbook.html

I think you're right to be confused. There are, I believe, notational mistakes and I'm surprised that previous pages do not include a proper explanation for the forward propagation equations, as well as the variables used.

Problems I noticed:

  • It says $x$ is the input vector, but refers to it with a double index: $x_{ji}$
  • It uses $\delta$ variable for both hidden and output layers, i.e. $\delta_k, \delta_h$.
  • It does the same for $w$. According to T4.4, $w_{kh}$ refers to the weights between the hidden and the output layer. So, it's initial dimension, $k$ ranges from $1$ to $n_{out}$. But, if it's the same $w_{ji}$ in the next line, $j$ must have the same range. That doesn't make sense for $x_{ji}$ (This was probably $x_i$) although I didn't check. In any case, looks like $\delta_j$ also ranges from $1$ to $n_{hidden}$ for the hidden layer.

I can probably continue...

So, I suggest you look at another source. As a starting point, wikipedia entry is much better. For a more detailed explanation with multiple layers, this one looks good. Although its notation is more complicated, it's correct as far as I can see.

$\endgroup$
5
  • $\begingroup$ Indices and their dimensions are actually consistent. $\delta$ and $w$ are used for both input and hidden layer. $\endgroup$ Commented May 11 at 1:28
  • $\begingroup$ I don't think I'd call it as consistency. The usual way of dealing with it is to use superscripts on which layer those variables belong to, e.g. what is $\delta_1$? $\endgroup$ Commented May 11 at 21:03
  • $\begingroup$ I understood your approach. Confusion might be that, Mitchell uses single hidden layer and subscripts run over units and layers are differentiated with subscripts. There is no inconsistency in his explanation, albeit not the most generic, given that book is written in 80s and GPUs were not available, single hidden layer has given the pedagogical intro. Though he has hinted how to generalise on section 4.5.2.2. The book is still a gem. $\endgroup$ Commented May 12 at 14:25
  • $\begingroup$ Even in single hidden layer case, using the same variable for both output and hidden layer is confusing. Besides, I'm still not sure what $x_{ji}$ is, given that $x$ was a vector. $\endgroup$ Commented May 12 at 20:58
  • $\begingroup$ Still a vector. Mitchell uses dummy indices, $x_{ji}$ are computed for hidden inputs, fully-connected, recall the Einstein summation rule. $\endgroup$ Commented May 13 at 17:14

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.