Skip to main content

Questions tagged [cross-entropy]

A measure of the difference between two probability distributions for a given random variable or set of events.

Filter by
Sorted by
Tagged with
2 votes
1 answer
276 views

I came across this article: “MSE is Cross Entropy at Heart: Maximum Likelihood Estimation Explained” which states: "When training a neural network, we are trying to find the parameters of a ...
spie227's user avatar
  • 242
3 votes
1 answer
107 views

Usually we use the average of cross entropy loss over all test examples as an index, can we use the variance of cross entropy loss as an index also?
Bayesian Hat's user avatar
0 votes
0 answers
50 views

Suppose we want to estimate $$r = \mathbb{E}_{x\backsim p(x)} [f(x)]$$ via importance sampling i.e. $$r = \mathbb{E}_{x\backsim q(x)} \left[\frac{f(x)p(x)}{q(x)}\right]$$ Now wikipedia says that ...
Lazy Guy's user avatar
4 votes
2 answers
665 views

I understand that AUC measures the model's ability to rank the subjects (see Why is ROC AUC equivalent to the probability that two randomly-selected samples are correctly ranked?). In contrast, binary ...
iRum's user avatar
  • 41
2 votes
1 answer
205 views

I have, in some sense, an opposite question to Is it okay to use cross entropy loss function with soft labels? which is why is it ok NOT to use soft labels in classification? Let's say you have a ...
YuseqYaseq's user avatar
0 votes
0 answers
66 views

I know that when optimizing neural networks (supervised) that cross entropy loss is equivalent to negative log likelihood is euivalent to MLE but I can't get all the math together. I am trying to ...
Meem12's user avatar
  • 11
1 vote
0 answers
29 views

I read this question Why do we use Kullback-Leibler divergence rather than cross entropy in the t-SNE objective function? and I cannot fully understand the answer. If we're using KL divergence for the ...
COTHE's user avatar
  • 11
4 votes
1 answer
530 views

There are numerous material that show the relationship between MLE and cross-entropy. Typically, these are the steps taken to show the relationship for a I.I.D data generating process $D = (X,Y)$: $$ ...
spie227's user avatar
  • 242
2 votes
1 answer
188 views

For classification problems with more than two classes, I've seen these two forms of cross-entropy loss: -$\sum_k y_k \log(a_k)$ -$\sum_k y_k \log(a_k) + (1-y_k) \log(1-a_k)$ Here $y_i$ are the true ...
theQman's user avatar
  • 707
2 votes
0 answers
212 views

I'm making an implementation of the softmax regression and I'm struggling to understand the nature behind the problem of increasing value of Cross-Entropy: $H(y_i, p_i)=-\sum_{i=1}^C y_i log(p_i)$, ...
JoshJohnson's user avatar
2 votes
1 answer
1k views

I am trying to understand the shannon entropy better. By definition, the shannon entropy is calculated as H = -sum(pk * log(pk)). I am using the scipy.stats.entropy formula and I am running the ...
GGChe's user avatar
  • 185
0 votes
1 answer
334 views

I recently saw the following formulation of the Fisher information matrix in a paper on Transformer pruning: $$ \mathcal{I} := \frac{1}{|D|} \sum_{(x,y) \in D} \left( \frac{\partial \mathcal{L}(x,y;1)}...
premed's user avatar
  • 1
0 votes
0 answers
97 views

I am currently trying to estimate the cross-entropy between two distributions with densities $p$ and $q$. $$ \ell = -\mathbb{E}_{x\sim p(x) }[\log q(x)] $$ I am using a Monte-Carlo estimate: $$ \hat{\...
Nick Bishop's user avatar
0 votes
0 answers
133 views

Suppose $f(x;q)$ is the true distribution. The support of the random variable $X$ is $\Omega$. Suppose, I am interested in a particular subset of $\Xi \subset \Omega$. I would like to minimize the ...
entropy's user avatar
  • 19
2 votes
1 answer
605 views

Say I have a neural network that classifies images by training to minimise cross-entropy loss with one-hot encoded training labels. It is often seen that such neural networks are 'overconfident', with ...
Danny Duberstein's user avatar
0 votes
1 answer
98 views

From my understanding mutual information can be defined in the following ways: [1]: $I(X;Y)=H(X)+H(Y)-H(X,Y)$ where $H(X), H(Y)$ are marginal entropies and $H(X,Y)$ is the joint entropy. [2]: $I(X;Y)=...
Rui's user avatar
  • 3
1 vote
0 answers
63 views

We can see the source in this paper. My question is that why cross entropy loss has a boundary line in slope but least square loss has horizontal boundary. Can somebody explain?
batuman's user avatar
  • 483
0 votes
1 answer
143 views

Given a dataset $\mathcal{D} = \{ (x_1, y_1),\cdots, (x_n, y_n)\}$, let's say we want to approximate the conditional probability $p(y|x)$, and we parameterized it as $p_{\theta}(y|x)$. So,for a ...
UESTCfresh's user avatar
2 votes
1 answer
530 views

Binary cross entropy is written as follows: \begin{equation} \mathcal{L} = -y\log\left(\hat{y}\right)-(1-y)\log\left(1-\hat{y}\right) \end{equation} In every reference that I read, when using binary ...
andryan86's user avatar
  • 147
4 votes
1 answer
923 views

Binary cross entropy is normally used in situations where the "true" result or label is one of two values (hence "binary"), typically encoded as 0 and 1. However, the documentation ...
R.M.'s user avatar
  • 1,098
1 vote
1 answer
216 views

When looking at implementations of VAE's online, specifically the KL divergence loss, the formula used is: $$ KL\hspace{1mm} Loss = -\frac{1}{2}(1+\log{\sigma^2}-\mu^2-\sigma^2) $$ or some variation ...
pyrrosk's user avatar
  • 33
1 vote
1 answer
140 views

I have constructed a simple neural network model, for a classification problem, with 10 target classes where an input (with some number of features) is to be classified to only one of the 10 classes. ...
creamedcheese83's user avatar
1 vote
0 answers
28 views

What is the base for the logarithm used in the cross entropy loss (while doing multiclass classification's backpropagation)? Is it e, 2, or 10?
Sachin's user avatar
  • 111
2 votes
1 answer
1k views

From what I've been reading, if there is no underlying difference between the 2 probabilities distributions we would have perfect entropy. I'm putting an example below. Can anybody explain why the ...
julian lagier's user avatar
1 vote
0 answers
139 views

I am having trouble understanding how the result of categorical cross entropy loss can be used to calculate the gradient for all of the weights. The output of cross entropy function is the sum of all ...
Nick's user avatar
  • 33
2 votes
0 answers
299 views

I know there are related questions already asked, for example this one. I also know the following: KL divergence $D_{KL}(P\Vert Q)$ is given as: $$\begin{align} D_{KL}(P\Vert Q) & = -\sum_xP(x)\...
Mahesha999's user avatar
2 votes
1 answer
478 views

I will do research using NN with 1 hidden layer. To calculate loss using binary cross entropy and for the activation function using sigmoid. I found the derivative formula from Sadowski, 2016 (link: ...
Andryan's user avatar
  • 47
5 votes
1 answer
332 views

Consider a binary classification dataset (X, Y), generated according to some unknown distribution $P(X, Y)$. I have a question about models which output probabilities by minimizing the cross-entropy ...
usual me's user avatar
  • 1,267
4 votes
2 answers
599 views

I was watching cross entropy video from StatQuest. While explaining why to use cross entropy over SSE in multi output scenario with softmax output activation, Josh gives this graph of both losses: He ...
Mahesha999's user avatar
2 votes
0 answers
27 views

Cross entropy for a random variable $x \sim p$ and a distribution $q$ is defined as: $$H(p,q) = -\sum_{x\in\mathcal{X}} p(x)\log q(x) = \mathbb{E}(\log q(x))$$ $\mathcal{X}$ is all possible values ...
rando's user avatar
  • 360
0 votes
0 answers
138 views

I have a dataset with classes [a, b] where during training I have made sure that the dataset is equally balanced. I have trained the network using cross-entropy loss with equal importance. I am able ...
JakobVinkas's user avatar
1 vote
0 answers
79 views

I'm curious if anyone has used, heard of, or otherwise considered using Genetic Algorithms as an engine for Variational Inference (VI)? My understanding of VI is that it's an optimization algorithm, ...
jbuddy_13's user avatar
  • 3,970
2 votes
2 answers
242 views

I'm looking for some metric of surprisal when comparing ranked lists - things along the lines of (eg) the rankings in a marathon race, or the times in the race. Intuitively, in a race with 100 people, ...
Alex I's user avatar
  • 1,183
4 votes
0 answers
616 views

Performance of classification algorithms is quantified by comparing the predicted probability distribution of the labels $q$ to the true probability $p$, which is commonly a vector of zeros for all ...
Aleksejs Fomins's user avatar
1 vote
1 answer
287 views

I am working on a domain adaptation problem, where the default is a classification problem. I have worked exclusively with regression problems until now, so I am kind of thrown for a loop when it ...
Scott's user avatar
  • 13
1 vote
0 answers
276 views

I'll provide a little of introduction based on my example. I have a small collection of RGB (but 'gray-looking') brain MRI photos, divided into 2 classes: healthy and tumor. My data split looks like ...
Karolina Świergała's user avatar
2 votes
1 answer
263 views

I'm looking into the definition of cross entropy from wikipedia. https://en.wikipedia.org/wiki/Cross_entropy Cross entropy is not symmetric, so I think for sure it shouldn't be called cross entropy ...
user900476's user avatar
2 votes
1 answer
902 views

I'm working through Dive Into Deep Learning right now and am struggling with the following question: We can explore the connection between exponential families and the softmax in some more depth. ...
jimac82's user avatar
  • 31
2 votes
1 answer
442 views

I'm looking into the wikipedia page of cross entropy. https://en.wikipedia.org/wiki/Cross_entropy $$H(p,q)=-\sum_{x\in \mathcal{X}} p(x)\log q(x)$$ It can be written as $$H(p,q) = H(p) + D_{KL} (p||q)$...
user900476's user avatar
0 votes
0 answers
55 views

Suppose that we have dataset of special kind of cat. We are going to train a model on combination of the cat a the car! Suppose that in this model we will get a performance ( precision,recall or...) X ...
Mahdi Amrollahi's user avatar
3 votes
0 answers
778 views

I am wondering if there is any emperical rule for selecting the value of label smoothing when training a neural network. Let's define smoothed prediction targets in relation to a value $\epsilon$ to ...
thiaamak's user avatar
1 vote
0 answers
737 views

I have read a similar question here: 1 neuron BCE loss VS 2 neurons CE loss that suggests there is no difference between softmax cross entropy loss and binary cross entropy loss, when choosing between ...
Anonymous's user avatar
  • 181
0 votes
0 answers
448 views

Weight Cross-Entroy (WCE) helps to handle an imbalanced dataset, and Cityscapes is quite imbalanced as seen below: If we check the best benchmarks on this dataset, most of the works use bare CE as a ...
Rafael Toledo's user avatar
3 votes
1 answer
2k views

When we are dealing with Mean Square Error (MSE) loss function in optimization problems, we often add $L_1$ or $L_2$ penalty terms (or a combination of both) to the MSE loss function while training. ...
Aravind G.'s user avatar
0 votes
0 answers
330 views

I got the definition of log-likelihood by Goodfellow's Deep Learning book: \begin{equation} \label{eq:loglikelihood} \theta_{ML} = {argmax}\sum_{i=1}^{m} \log p_{model}(x_i; \theta). \end{...
Lucas Lima de Sousa's user avatar
2 votes
1 answer
335 views

This is the loss function of XGBoost. This is the Second-order approximation of the loss function. Note: \begin{equation} L^{(t)} \text{: cross entropy loss function.} \end{equation} \begin{equation}...
ChrisChu's user avatar
2 votes
1 answer
180 views

I think it's pretty clear to me that average log-likelihood is equivalent to negative cross-entropy for discrete distributions, as shown here: $$\frac{1}{N}\log\mathcal{L}(\theta) = \frac{1}{N}\log \...
Alex Zakharov's user avatar
16 votes
2 answers
2k views

Given $k > 2$ classes, consider the following loss function $$ \sum_i||y^{(i)} - \hat y^{(i)}||^2 $$ Here $y^{(i)} \in \{0,1\}^k$ is the $i^{th}$ one-hot encoded true label and $\hat y^{(i)} \in [0,...
helperFunction's user avatar
1 vote
1 answer
4k views

I have a dataset with 10 input categorical features and one output categorical feature with class 0 and 1. X_train follows a 3D array so I have done label encoding beforehand on the dataset. I have ...
be_real's user avatar
  • 113

1
2 3 4 5 6