Questions tagged [cross-entropy]
A measure of the difference between two probability distributions for a given random variable or set of events.
268 questions
2
votes
1
answer
276
views
Maximum Likelihood, Cross-Entropy, and Conditional Empirical Distributions for Conditional Models
I came across this article: “MSE is Cross Entropy at Heart: Maximum Likelihood Estimation Explained” which states:
"When training a neural network, we are trying to find the parameters of a ...
3
votes
1
answer
107
views
What is the meaning of the Variance of Cross-Entropy Loss?
Usually we use the average of cross entropy loss over all test examples as an index, can we use the variance of cross entropy loss as an index also?
0
votes
0
answers
50
views
Optimal Importance Sampling
Suppose we want to estimate
$$r = \mathbb{E}_{x\backsim p(x)} [f(x)]$$ via importance sampling i.e.
$$r = \mathbb{E}_{x\backsim q(x)} \left[\frac{f(x)p(x)}{q(x)}\right]$$
Now wikipedia says that ...
4
votes
2
answers
665
views
Relationship between AUC and Cross-entropy
I understand that AUC measures the model's ability to rank the subjects (see Why is ROC AUC equivalent to the probability that two randomly-selected samples are correctly ranked?).
In contrast, binary ...
2
votes
1
answer
205
views
Why is it ok NOT to use soft labels in classification? [duplicate]
I have, in some sense, an opposite question to Is it okay to use cross entropy loss function with soft labels? which is why is it ok NOT to use soft labels in classification?
Let's say you have a ...
0
votes
0
answers
66
views
How do we derive the standard cross entropy loss from negative log likelihood in a supervised (conditional) learning setting?
I know that when optimizing neural networks (supervised) that cross entropy loss is equivalent to negative log likelihood is euivalent to MLE but I can't get all the math together.
I am trying to ...
1
vote
0
answers
29
views
Why are we using KL divergence over cross entropy? [duplicate]
I read this question
Why do we use Kullback-Leibler divergence rather than cross entropy in the t-SNE objective function?
and I cannot fully understand the answer.
If we're using KL divergence for the ...
4
votes
1
answer
530
views
Link between Cross-entropy and MLE
There are numerous material that show the relationship between MLE and cross-entropy.
Typically, these are the steps taken to show the relationship for a I.I.D data generating process $D = (X,Y)$:
$$
...
2
votes
1
answer
188
views
Which form of cross-entropy loss is correct?
For classification problems with more than two classes, I've seen these two forms of cross-entropy loss:
-$\sum_k y_k \log(a_k)$
-$\sum_k y_k \log(a_k) + (1-y_k) \log(1-a_k)$
Here $y_i$ are the true ...
2
votes
0
answers
212
views
Why is cross-entropy increasing with accuracy? [closed]
I'm making an implementation of the softmax regression and I'm struggling to understand the nature behind the problem of increasing value of Cross-Entropy: $H(y_i, p_i)=-\sum_{i=1}^C y_i log(p_i)$, ...
2
votes
1
answer
1k
views
Understanding shannon entropy and computation with scipy.stats.entropy
I am trying to understand the shannon entropy better. By definition, the shannon entropy is calculated as H = -sum(pk * log(pk)).
I am using the scipy.stats.entropy formula and I am running the ...
0
votes
1
answer
334
views
How (or can) you formulate the Fisher information matrix in terms of a loss function, specifically cross-entropy loss?
I recently saw the following formulation of the Fisher information matrix in a paper on Transformer pruning:
$$
\mathcal{I} := \frac{1}{|D|} \sum_{(x,y) \in D} \left( \frac{\partial \mathcal{L}(x,y;1)}...
0
votes
0
answers
97
views
Concentration Inequality for cross-entropy
I am currently trying to estimate the cross-entropy between two distributions with densities $p$ and $q$.
$$
\ell = -\mathbb{E}_{x\sim p(x) }[\log q(x)]
$$
I am using a Monte-Carlo estimate:
$$
\hat{\...
0
votes
0
answers
133
views
Minimizing cross entropy over a restricted domain?
Suppose $f(x;q)$ is the true distribution. The support of the random variable $X$ is $\Omega$. Suppose, I am interested in a particular subset of $\Xi \subset \Omega$. I would like to minimize the ...
2
votes
1
answer
605
views
What exactly is the problem with overconfident predictions?
Say I have a neural network that classifies images by training to minimise cross-entropy loss with one-hot encoded training labels. It is often seen that such neural networks are 'overconfident', with ...
0
votes
1
answer
98
views
Is CE(X, Y) equivalent to H(X) + H(Y)?
From my understanding mutual information can be defined in the following ways:
[1]:
$I(X;Y)=H(X)+H(Y)-H(X,Y)$ where $H(X), H(Y)$ are marginal entropies and $H(X,Y)$ is the joint entropy.
[2]:
$I(X;Y)=...
1
vote
0
answers
63
views
Decision boundary for Cross entropy loss and Least square loss
We can see the source in this paper.
My question is that why cross entropy loss has a boundary line in slope but least square loss has horizontal boundary.
Can somebody explain?
0
votes
1
answer
143
views
Derivation of cross entropy loss in machine learning
Given a dataset $\mathcal{D} = \{ (x_1, y_1),\cdots, (x_n, y_n)\}$, let's say we want to approximate the conditional probability $p(y|x)$, and we parameterized it as $p_{\theta}(y|x)$. So,for a ...
2
votes
1
answer
530
views
can we use binary cross entropy with labels -1 and 1?
Binary cross entropy is written as follows:
\begin{equation}
\mathcal{L} = -y\log\left(\hat{y}\right)-(1-y)\log\left(1-\hat{y}\right)
\end{equation}
In every reference that I read, when using binary ...
4
votes
1
answer
923
views
Meaning of non-{0,1} labels in binary cross entropy?
Binary cross entropy is normally used in situations where the "true" result or label is one of two values (hence "binary"), typically encoded as 0 and 1.
However, the documentation ...
1
vote
1
answer
216
views
Calculating KL divergence with entropy and cross entropy for VAEs
When looking at implementations of VAE's online, specifically the KL divergence loss, the formula used is:
$$ KL\hspace{1mm} Loss = -\frac{1}{2}(1+\log{\sigma^2}-\mu^2-\sigma^2) $$
or some variation ...
1
vote
1
answer
140
views
Very balanced dataset and a multiclass classification problem, no context behind the inputs. Which evaluation metric to use?
I have constructed a simple neural network model, for a classification problem, with 10 target classes where an input (with some number of features) is to be classified to only one of the 10 classes.
...
1
vote
0
answers
28
views
Log base in Cross Entropy Loss [duplicate]
What is the base for the logarithm used in the cross entropy loss (while doing multiclass classification's backpropagation)? Is it e, 2, or 10?
2
votes
1
answer
1k
views
Why is the cross entropy of the same probability distribution not 0?
From what I've been reading, if there is no underlying difference between the 2 probabilities distributions we would have perfect entropy.
I'm putting an example below. Can anybody explain why the ...
1
vote
0
answers
139
views
How does the cross entropy loss function interact with the final layer of a neural network?
I am having trouble understanding how the result of categorical cross entropy loss can be used to calculate the gradient for all of the weights.
The output of cross entropy function is the sum of all ...
2
votes
0
answers
299
views
Understanding intuitive difference between KL divergence and Cross entropy
I know there are related questions already asked, for example this one.
I also know the following:
KL divergence $D_{KL}(P\Vert Q)$ is given as:
$$\begin{align}
D_{KL}(P\Vert Q) & = -\sum_xP(x)\...
2
votes
1
answer
478
views
Derivative error with respect to bias in binary cross entropy
I will do research using NN with 1 hidden layer. To calculate loss using binary cross entropy and for the activation function using sigmoid. I found the derivative formula from Sadowski, 2016 (link: ...
5
votes
1
answer
332
views
Does logistic regression try to predict the true conditional P(Y|X)?
Consider a binary classification dataset (X, Y), generated according to some unknown distribution $P(X, Y)$. I have a question about models which output probabilities by minimizing the cross-entropy ...
4
votes
2
answers
599
views
Understanding StatQuest video: why cross entropy is used over Sum Squared Error
I was watching cross entropy video from StatQuest. While explaining why to use cross entropy over SSE in multi output scenario with softmax output activation, Josh gives this graph of both losses:
He ...
2
votes
0
answers
27
views
Relate cross-entropy formal definition to the cross-entropy loss [duplicate]
Cross entropy for a random variable $x \sim p$ and a distribution $q$ is defined as:
$$H(p,q) = -\sum_{x\in\mathcal{X}} p(x)\log q(x) = \mathbb{E}(\log q(x))$$
$\mathcal{X}$ is all possible values ...
0
votes
0
answers
138
views
Add Bias to classification after training
I have a dataset with classes [a, b] where during training I have made sure that the dataset is equally balanced. I have trained the network using cross-entropy loss with equal importance.
I am able ...
1
vote
0
answers
79
views
Genetic Algorithm as engine for Variational Inference?
I'm curious if anyone has used, heard of, or otherwise considered using Genetic Algorithms as an engine for Variational Inference (VI)?
My understanding of VI is that it's an optimization algorithm, ...
2
votes
2
answers
242
views
Surprisal in rankings
I'm looking for some metric of surprisal when comparing ranked lists - things along the lines of (eg) the rankings in a marathon race, or the times in the race.
Intuitively, in a race with 100 people, ...
4
votes
0
answers
616
views
Cross-entropy vs dot product
Performance of classification algorithms is quantified by comparing the predicted probability distribution of the labels $q$ to the true probability $p$, which is commonly a vector of zeros for all ...
1
vote
1
answer
287
views
How do machine learning algorithms handle classification labels?
I am working on a domain adaptation problem, where the default is a classification problem. I have worked exclusively with regression problems until now, so I am kind of thrown for a loop when it ...
1
vote
0
answers
276
views
How to explain the high accuracy and F1 score on the test set with a huge binary crossentropy loss?
I'll provide a little of introduction based on my example. I have a small collection of RGB (but 'gray-looking') brain MRI photos, divided into 2 classes: healthy and tumor. My data split looks like ...
2
votes
1
answer
263
views
Why is it called the cross-entropy of q relative to p, not p relative to q?
I'm looking into the definition of cross entropy from wikipedia. https://en.wikipedia.org/wiki/Cross_entropy
Cross entropy is not symmetric, so I think for sure it shouldn't be called cross entropy ...
2
votes
1
answer
902
views
Calculating the variance of softmax
I'm working through Dive Into Deep Learning right now and am struggling with the following question:
We can explore the connection between exponential families and the
softmax in some more depth.
...
2
votes
1
answer
442
views
Cross entropy of a random variable or a probability distribution function? [duplicate]
I'm looking into the wikipedia page of cross entropy. https://en.wikipedia.org/wiki/Cross_entropy
$$H(p,q)=-\sum_{x\in \mathcal{X}} p(x)\log q(x)$$
It can be written as $$H(p,q) = H(p) + D_{KL} (p||q)$...
0
votes
0
answers
55
views
What performance we get with same data combines with different datasets?
Suppose that we have dataset of special kind of cat. We are going to train a model on combination of the cat a the car! Suppose that in this model we will get a performance ( precision,recall or...) X ...
0
votes
0
answers
20
views
3
votes
0
answers
778
views
Is there an empirical rule for selecting the value of label smoothing?
I am wondering if there is any emperical rule for selecting the value of label smoothing when training a neural network. Let's define smoothed prediction targets in relation to a value $\epsilon$ to ...
1
vote
0
answers
737
views
Final Layer and Inference with CE vs BCE
I have read a similar question here: 1 neuron BCE loss VS 2 neurons CE loss that suggests there is no difference between softmax cross entropy loss and binary cross entropy loss, when choosing between ...
0
votes
0
answers
448
views
Why most works on Cityscapes don't use weighted cross-entropy?
Weight Cross-Entroy (WCE) helps to handle an imbalanced dataset, and Cityscapes is quite imbalanced as seen below:
If we check the best benchmarks on this dataset, most of the works use bare CE as a ...
3
votes
1
answer
2k
views
Impact of L1 and L2 regularisation with cross-entropy loss
When we are dealing with Mean Square Error (MSE) loss function in optimization problems, we often add $L_1$ or $L_2$ penalty terms (or a combination of both) to the MSE loss function while training. ...
0
votes
0
answers
330
views
How can I get the Binary Cross Entropy from the Cross Entropy function for GANs
I got the definition of log-likelihood by Goodfellow's Deep Learning book:
\begin{equation}
\label{eq:loglikelihood}
\theta_{ML} = {argmax}\sum_{i=1}^{m} \log p_{model}(x_i; \theta).
\end{...
2
votes
1
answer
335
views
XGBoost Objective Derivation Problem
This is the loss function of XGBoost.
This is the Second-order approximation of the loss function.
Note:
\begin{equation} L^{(t)} \text{: cross entropy loss function.} \end{equation}
\begin{equation}...
2
votes
1
answer
180
views
Likelihood and cross-entropy: continuous case
I think it's pretty clear to me that average log-likelihood is equivalent to negative cross-entropy for discrete distributions, as shown here:
$$\frac{1}{N}\log\mathcal{L}(\theta) = \frac{1}{N}\log \...
16
votes
2
answers
2k
views
Disadvantages of using a regression loss function in multi-class classification
Given $k > 2$ classes, consider the following loss function
$$
\sum_i||y^{(i)} - \hat y^{(i)}||^2
$$
Here $y^{(i)} \in \{0,1\}^k$ is the $i^{th}$ one-hot encoded true label and $\hat y^{(i)} \in [0,...
1
vote
1
answer
4k
views
Confused with binary cross-entropy vs categorical cross-entropy
I have a dataset with 10 input categorical features and one output categorical feature with class 0 and 1. X_train follows a 3D array so I have done label encoding beforehand on the dataset.
I have ...