KL Divergence vs Cross-Entropy

The Exact Relationship

For discrete distributions $P$ and $Q$ over the same sample space:

$H(P, Q) = H(P) + D_{\text{KL}}(P \| Q)$

where $H(P, Q)$ is the cross-entropy, $H(P)$ is the entropy of $P$ , and $D_{\text{KL}}(P \| Q)$ is the KL divergence from $P$ to $Q$ .

This identity is the single most important fact about the relationship between these two quantities. Everything else follows from it.

Definitions

Definition

Cross-Entropy

For discrete distributions $P$ and $Q$ :

$H(P, Q) = -\sum_{x} P(x) \log Q(x)$

Cross-entropy measures the expected number of bits (or nats) needed to encode samples from $P$ using the code optimized for $Q$ . It combines two sources of cost: the inherent uncertainty in $P$ (the entropy $H(P)$ ) and the mismatch between $P$ and $Q$ (the KL divergence).

Definition

KL Divergence

For discrete distributions $P$ and $Q$ where $Q(x) > 0$ whenever $P(x) > 0$ :

$D_{\text{KL}}(P \| Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)}$

KL divergence measures only the excess cost of using $Q$ instead of $P$ . It is always non-negative ( $D_{\text{KL}}(P \| Q) \geq 0$ by Gibbs' inequality) and equals zero if and only if $P = Q$ almost everywhere.

Why Minimizing Cross-Entropy Equals Minimizing KL

In supervised classification, $P$ is the empirical data distribution (one-hot labels) and $Q_\theta$ is the model's predicted distribution. You optimize $\theta$ .

Since $H(P)$ does not depend on $\theta$ :

$\arg\min_\theta H(P, Q_\theta) = \arg\min_\theta \left[ H(P) + D_{\text{KL}}(P \| Q_\theta) \right] = \arg\min_\theta D_{\text{KL}}(P \| Q_\theta)$

The entropy $H(P)$ is a constant with respect to the model parameters. Minimizing cross-entropy and minimizing KL divergence produce the same optimal $\theta$ . Cross-entropy is preferred in practice because it avoids computing $P \log P$ terms (which are constant and involve $\log 0$ for one-hot labels).

Comparison Table

Property	Cross-Entropy $H(P, Q)$	KL Divergence $D_{\text{KL}}(P \\| Q)$
Formula	$-\sum P(x) \log Q(x)$	$\sum P(x) \log \frac{P(x)}{Q(x)}$
Measures	Total encoding cost under $Q$	Excess cost beyond optimal coding
Minimum value	$H(P)$ (achieved when $Q = P$ )	$0$ (achieved when $Q = P$ )
Symmetry	Not symmetric: $H(P, Q) \neq H(Q, P)$	Not symmetric: $D_{\text{KL}}(P \\| Q) \neq D_{\text{KL}}(Q \\| P)$
Is a metric?	No (not symmetric, no triangle inequality)	No (not symmetric, no triangle inequality)
Relationship	$H(P, Q) = H(P) + D_{\text{KL}}(P \\| Q)$	$D_{\text{KL}}(P \\| Q) = H(P, Q) - H(P)$
Primary use	Classification loss function	Distribution comparison, variational inference
Requires $P \log P$ ?	No	Yes
Gradient w.r.t. $\theta$	$\nabla_\theta H(P, Q_\theta) = \nabla_\theta D_{\text{KL}}(P \\| Q_\theta)$	Same as cross-entropy when $P$ is fixed

The Asymmetry of KL Divergence

KL divergence is not symmetric: $D_{\text{KL}}(P \| Q) \neq D_{\text{KL}}(Q \| P)$ . The two directions have different behaviors and different names.

Forward KL: $D_{\text{KL}}(P \| Q)$

$D_{\text{KL}}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}$

The expectation is over $P$ . This penalizes heavily when $P(x) > 0$ but $Q(x) \approx 0$ : the model assigns near-zero probability to an event that actually occurs. Forward KL produces mean-seeking behavior. If $P$ is multimodal, $Q$ will try to cover all modes, even at the cost of placing probability mass between them.

Reverse KL: $D_{\text{KL}}(Q \| P)$

$D_{\text{KL}}(Q \| P) = \sum_x Q(x) \log \frac{Q(x)}{P(x)}$

The expectation is over $Q$ . This penalizes when $Q(x) > 0$ but $P(x) \approx 0$ : the model places probability where no data exists. Reverse KL produces mode-seeking behavior. If $P$ is multimodal, $Q$ will collapse to a single mode rather than spread mass across all of them.

Why This Matters for Variational Inference

In variational inference, you approximate an intractable posterior $p(\theta | x)$ with a tractable family $q_\phi(\theta)$ .

ELBO maximization minimizes $D_{\text{KL}}(q \| p)$ (reverse KL). The approximation $q$ is mode-seeking: it concentrates on one mode of the posterior and underestimates uncertainty. This is what standard VI does.
Expectation propagation minimizes $D_{\text{KL}}(p \| q)$ (forward KL). The approximation $q$ is mean-seeking: it covers all modes but may overestimate uncertainty and place mass in low-probability regions.

The choice of KL direction determines the qualitative behavior of the approximation. Neither is universally better.

Where Each Is Used

Example

Training a neural classifier

Use cross-entropy as the loss function. For a sample with true class $c$ and predicted probabilities $q_1, \ldots, q_K$ :

$\mathcal{L} = -\log q_c$

This is cross-entropy with a one-hot $P$ . It is equivalent to minimizing $D_{\text{KL}}(P \| Q)$ , but cross-entropy is simpler to compute because the $P \log P$ terms vanish (they are $0 \cdot \log 0 = 0$ by convention).

Example

Knowledge distillation

Use KL divergence explicitly. Given teacher distribution $P_T$ and student distribution $P_S$ :

$\mathcal{L}_{\text{KD}} = D_{\text{KL}}(P_T \| P_S)$

Here $P_T$ is not one-hot: it is the softened output of the teacher model. The $P_T \log P_T$ terms are not constant across samples (different inputs produce different teacher distributions), so KL divergence and cross-entropy are not interchangeable as losses. In practice, many implementations use cross-entropy for convenience and accept the constant offset.

Example

Variational autoencoders

The ELBO contains a KL divergence term that regularizes the approximate posterior:

$\mathcal{L}_{\text{VAE}} = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{\text{KL}}(q_\phi(z|x) \| p(z))$

The KL term measures how far the encoder's approximate posterior $q_\phi(z|x)$ deviates from the prior $p(z)$ . For Gaussian $q$ and $p$ , this has a closed-form expression. Cross-entropy would not be used here because the comparison is between two distributions that both change during training.

Common Confusions

Watch Out

Cross-entropy is not symmetric either

Both cross-entropy and KL divergence are asymmetric. $H(P, Q) \neq H(Q, P)$ in general. The asymmetry of KL divergence gets more attention because the two directions (forward and reverse) produce qualitatively different approximations. But cross-entropy is equally asymmetric, and swapping the arguments changes the loss function.

Watch Out

KL divergence is not a distance

Despite being called a 'divergence,' KL does not satisfy the properties of a distance metric. It is not symmetric and does not satisfy the triangle inequality. The Jensen-Shannon divergence $\text{JSD}(P \| Q) = \frac{1}{2}D_{\text{KL}}(P \| M) + \frac{1}{2}D_{\text{KL}}(Q \| M)$ with $M = \frac{1}{2}(P + Q)$ is symmetric and its square root is a true metric.

Watch Out

The log base matters for units, not for optimization

Using $\log_2$ gives cross-entropy in bits. Using $\ln$ gives nats. Using $\log_{10}$ gives hartleys. The choice affects the numerical value but not the optimizer's behavior: the minimum is at the same $\theta$ regardless of log base. PyTorch uses $\ln$ by default.

Watch Out

Cross-entropy loss in PyTorch includes the softmax

torch.nn.CrossEntropyLoss applies log-softmax internally before computing the negative log-likelihood. If you apply softmax to your logits and then pass them to CrossEntropyLoss, you are applying softmax twice. This is a common bug that produces silently degraded training.

References

Cover, T.M. and Thomas, J.A. Elements of Information Theory. 2nd ed. Wiley, 2006. Chapter 2 defines entropy, cross-entropy, and KL divergence. Theorem 2.6.3 proves non-negativity of KL.
Bishop, C.M. Pattern Recognition and Machine Learning. Springer, 2006. Section 1.6.1 derives the relationship between KL divergence and cross-entropy.
Murphy, K.P. Machine Learning: A Probabilistic Perspective. MIT Press, 2012. Chapter 2.8 covers KL divergence and its use in variational inference.
Blei, D.M., Kucukelbir, A., and McAuliffe, J.D. "Variational Inference: A Review for Statisticians." JASA, 2017. Section 2 explains the ELBO and the role of KL divergence direction.
Hinton, G., Vinyals, O., and Dean, J. "Distilling the Knowledge in a Neural Network." 2015. Section 2 uses KL divergence for knowledge distillation.
Kingma, D.P. and Welling, M. "Auto-Encoding Variational Bayes." ICLR 2014. Section 2.2 derives the KL term in the VAE objective.
Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016. Section 3.13 covers KL divergence and cross-entropy in the context of loss functions.

The Exact Relationship

Definitions

Why Minimizing Cross-Entropy Equals Minimizing KL

Comparison Table

The Asymmetry of KL Divergence

Forward KL: DKL(P∥Q)D_{\text{KL}}(P \| Q)DKL​(P∥Q)

Reverse KL: DKL(Q∥P)D_{\text{KL}}(Q \| P)DKL​(Q∥P)

Why This Matters for Variational Inference

Where Each Is Used

Common Confusions

References

Forward KL: $D_{\text{KL}}(P \| Q)$

Reverse KL: $D_{\text{KL}}(Q \| P)$