I'm working through Dive Into Deep Learning right now and am struggling with the following question:
- We can explore the connection between exponential families and the
softmax in some more depth.
- Compute the second derivative of the cross-entropy loss for the softmax.
- Compute the variance of the distribution given by $softmax(\bf{o})$ and show that it matches the second derivative computed above.
For part 1, I've already calculated the 2nd derivative as $softmax(\bf{o})_j(1-softmax(\bf{o})_j)$, where $\bf{o}$ is a vector given by $\bf{Wx+b}$, where $\bf{x}$ is a vector of inputs, and $\bf{W,b}$ are weights and biases.
For part 2, I am unsure where to begin. I know $Var[X] = E[X^2] - E[X]^2$, and $E[X] = \Sigma xf(x)$ or $\int xf(x)dx$, where $f(x)$ is the pdf. But in this case, I'm not really sure what $x$ would be other than the output of the softmax function, and I'm not sure how to determine the pdf of that.