I am having trouble understanding how the result of categorical cross entropy loss can be used to calculate the gradient for all of the weights.
The output of cross entropy function is the sum of all of the negative log likelihoods multiplied by the one-hot encoded vector of the actual(desired) result of the neural net. Once I get this information I do not understand what I am supposed to do with it.
To calculate the gradient of cross entropy loss, many sources on the internet have told me to use this formula:
//The variable actualOutput being the one-hot encoded vector of desired outputs based on input
for(int i = 0; i < softmaxOutput.size(); i++)
{
softmaxOutput[i] -= actualOutput[i];
}
Then to take this result and pipe it in to the softmax derivative function. I was very confused by this, because it does not involve the result of the loss function at all during backward propagation. Even worse of a problem for me is that when I implement this it does not work. Am I looking at this problem the right way?
Here is some of my code for the entropy loss formula I am using made specifically to solve MNIST:
class mnist_entropy_loss
{
private:
double *pred = new double[10];
vector<vector<double>> dist = {{1, 0, 0, 0, 0, 0, 0, 0, 0, 0},
{0, 1, 0, 0, 0, 0, 0, 0, 0, 0},
{0, 0, 1, 0, 0, 0, 0, 0, 0, 0},
{0, 0, 0, 1, 0, 0, 0, 0, 0, 0},
{0, 0, 0, 0, 1, 0, 0, 0, 0, 0},
{0, 0, 0, 0, 0, 1, 0, 0, 0, 0},
{0, 0, 0, 0, 0, 0, 1, 0, 0, 0},
{0, 0, 0, 0, 0, 0, 0, 1, 0, 0},
{0, 0, 0, 0, 0, 0, 0, 0, 1, 0},
{0, 0, 0, 0, 0, 0, 0, 0, 0, 1}};
public:
mnist_entropy_loss();
double calculateLoss(double*, char);
double* calculateGradient(double*, char);
};
double mnist_entropy_loss::calculateLoss(double *input, char label)
{
vector<double> actual = this->dist[label];
double loss = 0;
for(int i = 0; i < 10; i++)
{
loss += -(actual[i] * log(input[i]));
}
return loss;
}
double *mnist_entropy_loss::calculateGradient(double *input, char label)
{
vector<double> actual = this->dist[label]
for(int i = 0; i < 10; i++)
{
//I find it confusing that the loss function output does not seem to have anything
//to do with this.
input[i] -= actual[i];
}
return input;
}
I am piping the results of the calculateGradient formula into the backward function. I am not sure if I am approaching this correctly. Many of the articles that I have read and videos that I have watched show me the same derivatives and formulas, but infer different uses for them. I am just confused about the flow of the data at this part of the network.
My network runs properly without cross entropy, so I know that's not the problem. Am I handling the data correctly here? Do you know of any good resources for this?