Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Comparison

KL Divergence vs. Cross-Entropy

Cross-entropy and KL divergence are related by a constant: H(P,Q) = H(P) + KL(P||Q). When the true distribution P is fixed (as in supervised classification), minimizing cross-entropy is equivalent to minimizing KL divergence. They differ in symmetry, interpretation, and usage context.

The Exact Relationship

For discrete distributions PP and QQ over the same sample space:

H(P,Q)=H(P)+DKL(PQ)H(P, Q) = H(P) + D_{\text{KL}}(P \| Q)

where H(P,Q)H(P, Q) is the cross-entropy, H(P)H(P) is the entropy of PP, and DKL(PQ)D_{\text{KL}}(P \| Q) is the KL divergence from PP to QQ.

This identity is the single most important fact about the relationship between these two quantities. Everything else follows from it.

Definitions

Definition

Cross-Entropy

For discrete distributions PP and QQ:

H(P,Q)=xP(x)logQ(x)H(P, Q) = -\sum_{x} P(x) \log Q(x)

Cross-entropy measures the expected number of bits (or nats) needed to encode samples from PP using the code optimized for QQ. It combines two sources of cost: the inherent uncertainty in PP (the entropy H(P)H(P)) and the mismatch between PP and QQ (the KL divergence).

Definition

KL Divergence

For discrete distributions PP and QQ where Q(x)>0Q(x) > 0 whenever P(x)>0P(x) > 0:

DKL(PQ)=xP(x)logP(x)Q(x)D_{\text{KL}}(P \| Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)}

KL divergence measures only the excess cost of using QQ instead of PP. It is always non-negative (DKL(PQ)0D_{\text{KL}}(P \| Q) \geq 0 by Gibbs' inequality) and equals zero if and only if P=QP = Q almost everywhere.

Why Minimizing Cross-Entropy Equals Minimizing KL

In supervised classification, PP is the empirical data distribution (one-hot labels) and QθQ_\theta is the model's predicted distribution. You optimize θ\theta.

Since H(P)H(P) does not depend on θ\theta:

argminθH(P,Qθ)=argminθ[H(P)+DKL(PQθ)]=argminθDKL(PQθ)\arg\min_\theta H(P, Q_\theta) = \arg\min_\theta \left[ H(P) + D_{\text{KL}}(P \| Q_\theta) \right] = \arg\min_\theta D_{\text{KL}}(P \| Q_\theta)

The entropy H(P)H(P) is a constant with respect to the model parameters. Minimizing cross-entropy and minimizing KL divergence produce the same optimal θ\theta. Cross-entropy is preferred in practice because it avoids computing PlogPP \log P terms (which are constant and involve log0\log 0 for one-hot labels).

Comparison Table

PropertyCross-Entropy H(P,Q)H(P, Q)KL Divergence DKL(PQ)D_{\text{KL}}(P \| Q)
FormulaP(x)logQ(x)-\sum P(x) \log Q(x)P(x)logP(x)Q(x)\sum P(x) \log \frac{P(x)}{Q(x)}
MeasuresTotal encoding cost under QQExcess cost beyond optimal coding
Minimum valueH(P)H(P) (achieved when Q=PQ = P)00 (achieved when Q=PQ = P)
SymmetryNot symmetric: H(P,Q)H(Q,P)H(P, Q) \neq H(Q, P)Not symmetric: DKL(PQ)DKL(QP)D_{\text{KL}}(P \| Q) \neq D_{\text{KL}}(Q \| P)
Is a metric?No (not symmetric, no triangle inequality)No (not symmetric, no triangle inequality)
RelationshipH(P,Q)=H(P)+DKL(PQ)H(P, Q) = H(P) + D_{\text{KL}}(P \| Q)DKL(PQ)=H(P,Q)H(P)D_{\text{KL}}(P \| Q) = H(P, Q) - H(P)
Primary useClassification loss functionDistribution comparison, variational inference
Requires PlogPP \log P?NoYes
Gradient w.r.t. θ\thetaθH(P,Qθ)=θDKL(PQθ)\nabla_\theta H(P, Q_\theta) = \nabla_\theta D_{\text{KL}}(P \| Q_\theta)Same as cross-entropy when PP is fixed

The Asymmetry of KL Divergence

KL divergence is not symmetric: DKL(PQ)DKL(QP)D_{\text{KL}}(P \| Q) \neq D_{\text{KL}}(Q \| P). The two directions have different behaviors and different names.

Forward KL: DKL(PQ)D_{\text{KL}}(P \| Q)

DKL(PQ)=xP(x)logP(x)Q(x)D_{\text{KL}}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}

The expectation is over PP. This penalizes heavily when P(x)>0P(x) > 0 but Q(x)0Q(x) \approx 0: the model assigns near-zero probability to an event that actually occurs. Forward KL produces mean-seeking behavior. If PP is multimodal, QQ will try to cover all modes, even at the cost of placing probability mass between them.

Reverse KL: DKL(QP)D_{\text{KL}}(Q \| P)

DKL(QP)=xQ(x)logQ(x)P(x)D_{\text{KL}}(Q \| P) = \sum_x Q(x) \log \frac{Q(x)}{P(x)}

The expectation is over QQ. This penalizes when Q(x)>0Q(x) > 0 but P(x)0P(x) \approx 0: the model places probability where no data exists. Reverse KL produces mode-seeking behavior. If PP is multimodal, QQ will collapse to a single mode rather than spread mass across all of them.

Why This Matters for Variational Inference

In variational inference, you approximate an intractable posterior p(θx)p(\theta | x) with a tractable family qϕ(θ)q_\phi(\theta).

The choice of KL direction determines the qualitative behavior of the approximation. Neither is universally better.

Where Each Is Used

Example

Training a neural classifier

Use cross-entropy as the loss function. For a sample with true class cc and predicted probabilities q1,,qKq_1, \ldots, q_K:

L=logqc\mathcal{L} = -\log q_c

This is cross-entropy with a one-hot PP. It is equivalent to minimizing DKL(PQ)D_{\text{KL}}(P \| Q), but cross-entropy is simpler to compute because the PlogPP \log P terms vanish (they are 0log0=00 \cdot \log 0 = 0 by convention).

Example

Knowledge distillation

Use KL divergence explicitly. Given teacher distribution PTP_T and student distribution PSP_S:

LKD=DKL(PTPS)\mathcal{L}_{\text{KD}} = D_{\text{KL}}(P_T \| P_S)

Here PTP_T is not one-hot: it is the softened output of the teacher model. The PTlogPTP_T \log P_T terms are not constant across samples (different inputs produce different teacher distributions), so KL divergence and cross-entropy are not interchangeable as losses. In practice, many implementations use cross-entropy for convenience and accept the constant offset.

Example

Variational autoencoders

The ELBO contains a KL divergence term that regularizes the approximate posterior:

LVAE=Eqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)p(z))\mathcal{L}_{\text{VAE}} = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{\text{KL}}(q_\phi(z|x) \| p(z))

The KL term measures how far the encoder's approximate posterior qϕ(zx)q_\phi(z|x) deviates from the prior p(z)p(z). For Gaussian qq and pp, this has a closed-form expression. Cross-entropy would not be used here because the comparison is between two distributions that both change during training.

Common Confusions

Watch Out

Cross-entropy is not symmetric either

Both cross-entropy and KL divergence are asymmetric. H(P,Q)H(Q,P)H(P, Q) \neq H(Q, P) in general. The asymmetry of KL divergence gets more attention because the two directions (forward and reverse) produce qualitatively different approximations. But cross-entropy is equally asymmetric, and swapping the arguments changes the loss function.

Watch Out

KL divergence is not a distance

Despite being called a 'divergence,' KL does not satisfy the properties of a distance metric. It is not symmetric and does not satisfy the triangle inequality. The Jensen-Shannon divergence JSD(PQ)=12DKL(PM)+12DKL(QM)\text{JSD}(P \| Q) = \frac{1}{2}D_{\text{KL}}(P \| M) + \frac{1}{2}D_{\text{KL}}(Q \| M) with M=12(P+Q)M = \frac{1}{2}(P + Q) is symmetric and its square root is a true metric.

Watch Out

The log base matters for units, not for optimization

Using log2\log_2 gives cross-entropy in bits. Using ln\ln gives nats. Using log10\log_{10} gives hartleys. The choice affects the numerical value but not the optimizer's behavior: the minimum is at the same θ\theta regardless of log base. PyTorch uses ln\ln by default.

Watch Out

Cross-entropy loss in PyTorch includes the softmax

torch.nn.CrossEntropyLoss applies log-softmax internally before computing the negative log-likelihood. If you apply softmax to your logits and then pass them to CrossEntropyLoss, you are applying softmax twice. This is a common bug that produces silently degraded training.

References

  1. Cover, T.M. and Thomas, J.A. Elements of Information Theory. 2nd ed. Wiley, 2006. Chapter 2 defines entropy, cross-entropy, and KL divergence. Theorem 2.6.3 proves non-negativity of KL.
  2. Bishop, C.M. Pattern Recognition and Machine Learning. Springer, 2006. Section 1.6.1 derives the relationship between KL divergence and cross-entropy.
  3. Murphy, K.P. Machine Learning: A Probabilistic Perspective. MIT Press, 2012. Chapter 2.8 covers KL divergence and its use in variational inference.
  4. Blei, D.M., Kucukelbir, A., and McAuliffe, J.D. "Variational Inference: A Review for Statisticians." JASA, 2017. Section 2 explains the ELBO and the role of KL divergence direction.
  5. Hinton, G., Vinyals, O., and Dean, J. "Distilling the Knowledge in a Neural Network." 2015. Section 2 uses KL divergence for knowledge distillation.
  6. Kingma, D.P. and Welling, M. "Auto-Encoding Variational Bayes." ICLR 2014. Section 2.2 derives the KL term in the VAE objective.
  7. Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016. Section 3.13 covers KL divergence and cross-entropy in the context of loss functions.