The Exact Relationship
For discrete distributions and over the same sample space:
where is the cross-entropy, is the entropy of , and is the KL divergence from to .
This identity is the single most important fact about the relationship between these two quantities. Everything else follows from it.
Definitions
Cross-Entropy
For discrete distributions and :
Cross-entropy measures the expected number of bits (or nats) needed to encode samples from using the code optimized for . It combines two sources of cost: the inherent uncertainty in (the entropy ) and the mismatch between and (the KL divergence).
KL Divergence
For discrete distributions and where whenever :
KL divergence measures only the excess cost of using instead of . It is always non-negative ( by Gibbs' inequality) and equals zero if and only if almost everywhere.
Why Minimizing Cross-Entropy Equals Minimizing KL
In supervised classification, is the empirical data distribution (one-hot labels) and is the model's predicted distribution. You optimize .
Since does not depend on :
The entropy is a constant with respect to the model parameters. Minimizing cross-entropy and minimizing KL divergence produce the same optimal . Cross-entropy is preferred in practice because it avoids computing terms (which are constant and involve for one-hot labels).
Comparison Table
| Property | Cross-Entropy | KL Divergence |
|---|---|---|
| Formula | ||
| Measures | Total encoding cost under | Excess cost beyond optimal coding |
| Minimum value | (achieved when ) | (achieved when ) |
| Symmetry | Not symmetric: | Not symmetric: |
| Is a metric? | No (not symmetric, no triangle inequality) | No (not symmetric, no triangle inequality) |
| Relationship | ||
| Primary use | Classification loss function | Distribution comparison, variational inference |
| Requires ? | No | Yes |
| Gradient w.r.t. | Same as cross-entropy when is fixed |
The Asymmetry of KL Divergence
KL divergence is not symmetric: . The two directions have different behaviors and different names.
Forward KL:
The expectation is over . This penalizes heavily when but : the model assigns near-zero probability to an event that actually occurs. Forward KL produces mean-seeking behavior. If is multimodal, will try to cover all modes, even at the cost of placing probability mass between them.
Reverse KL:
The expectation is over . This penalizes when but : the model places probability where no data exists. Reverse KL produces mode-seeking behavior. If is multimodal, will collapse to a single mode rather than spread mass across all of them.
Why This Matters for Variational Inference
In variational inference, you approximate an intractable posterior with a tractable family .
- ELBO maximization minimizes (reverse KL). The approximation is mode-seeking: it concentrates on one mode of the posterior and underestimates uncertainty. This is what standard VI does.
- Expectation propagation minimizes (forward KL). The approximation is mean-seeking: it covers all modes but may overestimate uncertainty and place mass in low-probability regions.
The choice of KL direction determines the qualitative behavior of the approximation. Neither is universally better.
Where Each Is Used
Training a neural classifier
Use cross-entropy as the loss function. For a sample with true class and predicted probabilities :
This is cross-entropy with a one-hot . It is equivalent to minimizing , but cross-entropy is simpler to compute because the terms vanish (they are by convention).
Knowledge distillation
Use KL divergence explicitly. Given teacher distribution and student distribution :
Here is not one-hot: it is the softened output of the teacher model. The terms are not constant across samples (different inputs produce different teacher distributions), so KL divergence and cross-entropy are not interchangeable as losses. In practice, many implementations use cross-entropy for convenience and accept the constant offset.
Variational autoencoders
The ELBO contains a KL divergence term that regularizes the approximate posterior:
The KL term measures how far the encoder's approximate posterior deviates from the prior . For Gaussian and , this has a closed-form expression. Cross-entropy would not be used here because the comparison is between two distributions that both change during training.
Common Confusions
Cross-entropy is not symmetric either
Both cross-entropy and KL divergence are asymmetric. in general. The asymmetry of KL divergence gets more attention because the two directions (forward and reverse) produce qualitatively different approximations. But cross-entropy is equally asymmetric, and swapping the arguments changes the loss function.
KL divergence is not a distance
Despite being called a 'divergence,' KL does not satisfy the properties of a distance metric. It is not symmetric and does not satisfy the triangle inequality. The Jensen-Shannon divergence with is symmetric and its square root is a true metric.
The log base matters for units, not for optimization
Using gives cross-entropy in bits. Using gives nats. Using gives hartleys. The choice affects the numerical value but not the optimizer's behavior: the minimum is at the same regardless of log base. PyTorch uses by default.
Cross-entropy loss in PyTorch includes the softmax
torch.nn.CrossEntropyLoss applies log-softmax internally before computing the negative log-likelihood. If you apply softmax to your logits and then pass them to CrossEntropyLoss, you are applying softmax twice. This is a common bug that produces silently degraded training.
References
- Cover, T.M. and Thomas, J.A. Elements of Information Theory. 2nd ed. Wiley, 2006. Chapter 2 defines entropy, cross-entropy, and KL divergence. Theorem 2.6.3 proves non-negativity of KL.
- Bishop, C.M. Pattern Recognition and Machine Learning. Springer, 2006. Section 1.6.1 derives the relationship between KL divergence and cross-entropy.
- Murphy, K.P. Machine Learning: A Probabilistic Perspective. MIT Press, 2012. Chapter 2.8 covers KL divergence and its use in variational inference.
- Blei, D.M., Kucukelbir, A., and McAuliffe, J.D. "Variational Inference: A Review for Statisticians." JASA, 2017. Section 2 explains the ELBO and the role of KL divergence direction.
- Hinton, G., Vinyals, O., and Dean, J. "Distilling the Knowledge in a Neural Network." 2015. Section 2 uses KL divergence for knowledge distillation.
- Kingma, D.P. and Welling, M. "Auto-Encoding Variational Bayes." ICLR 2014. Section 2.2 derives the KL term in the VAE objective.
- Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016. Section 3.13 covers KL divergence and cross-entropy in the context of loss functions.