Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Foundations

KL Divergence

Kullback-Leibler divergence measures how one probability distribution differs from another. Asymmetric, always non-negative, and central to variational inference, MLE, and RLHF.

CoreTier 1Stable~45 min

Why This Matters

p(x)q(x)divergence gapKL(p ‖ q) = E_p[log(p/q)] = 0.42 natsKL(q ‖ p) = E_q[log(q/p)] = 0.58 nats (different!)x-3-2-10123KL divergence is not symmetric: KL(p‖q) penalizes q for missing p's mass. KL(q‖p) penalizes q for excess mass.

KL divergence appears everywhere in machine learning. Maximum likelihood estimation minimizes KL divergence from the data distribution to the model. Variational inference minimizes KL divergence from the approximate posterior to the true posterior. RLHF uses a KL penalty to keep the fine-tuned policy close to the base model. Understanding the asymmetry of KL divergence explains why these different objectives produce different behaviors.

Mental Model

DKL(PQ)D_{\text{KL}}(P \| Q) measures the expected number of extra bits needed to encode samples from PP using a code optimized for QQ. It is zero only when P=QP = Q. It is not symmetric: the cost of encoding PP-samples with a QQ-code differs from encoding QQ-samples with a PP-code.

Core Definitions

Definition

KL Divergence

For discrete distributions PP and QQ on the same sample space:

DKL(PQ)=xP(x)logP(x)Q(x)D_{\text{KL}}(P \| Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)}

For continuous distributions with densities pp and qq:

DKL(PQ)=p(x)logp(x)q(x)dxD_{\text{KL}}(P \| Q) = \int p(x) \log \frac{p(x)}{q(x)} \, dx

Convention: 0log(0/q)=00 \log(0/q) = 0 and plog(p/0)=+p \log(p/0) = +\infty when p>0p > 0. If QQ does not dominate PP (there exists xx with P(x)>0P(x) > 0 but Q(x)=0Q(x) = 0), then DKL(PQ)=+D_{\text{KL}}(P \| Q) = +\infty.

Definition

Cross-Entropy

The cross-entropy between PP and QQ is:

H(P,Q)=xP(x)logQ(x)=H(P)+DKL(PQ)H(P, Q) = -\sum_{x} P(x) \log Q(x) = H(P) + D_{\text{KL}}(P \| Q)

where H(P)=xP(x)logP(x)H(P) = -\sum_x P(x) \log P(x) is the Shannon entropy. Since H(P)H(P) is fixed with respect to QQ, minimizing cross-entropy over QQ is equivalent to minimizing KL divergence.

Main Theorems

Theorem

Gibbs Inequality (Non-negativity of KL Divergence)

Statement

For any two probability distributions PP and QQ:

DKL(PQ)0D_{\text{KL}}(P \| Q) \geq 0

with equality if and only if P=QP = Q almost everywhere.

Intuition

The logarithm is concave. Applying Jensen's inequality to log(Q/P)\log(Q/P) under expectation with respect to PP gives EP[log(Q/P)]log(EP[Q/P])=log(1)=0\mathbb{E}_P[\log(Q/P)] \leq \log(\mathbb{E}_P[Q/P]) = \log(1) = 0. Negating both sides yields DKL0D_{\text{KL}} \geq 0.

Proof Sketch

By Jensen's inequality applied to the concave function log\log:

DKL(PQ)=EP[logQ(X)P(X)]logEP[Q(X)P(X)]=logxQ(x)=0-D_{\text{KL}}(P \| Q) = \mathbb{E}_P\left[\log \frac{Q(X)}{P(X)}\right] \leq \log \mathbb{E}_P\left[\frac{Q(X)}{P(X)}\right] = \log \sum_x Q(x) = 0

Equality in Jensen's holds iff Q(X)/P(X)Q(X)/P(X) is constant PP-a.s., which forces P=QP = Q.

Why It Matters

Non-negativity of KL divergence is the foundation for many results. It implies that cross-entropy is always at least the entropy: H(P,Q)H(P)H(P, Q) \geq H(P). It guarantees that MLE is consistent (the minimizer of KL from the true distribution is the true distribution itself). It ensures the ELBO is a valid lower bound on the log-evidence.

Failure Mode

KL divergence is not a metric. It is not symmetric: DKL(PQ)DKL(QP)D_{\text{KL}}(P \| Q) \neq D_{\text{KL}}(Q \| P) in general. It does not satisfy the triangle inequality. This means you cannot use it as a distance in metric-space arguments. If you need a true metric, consider the total variation distance or Wasserstein distance.

Forward vs. Reverse KL

The two directions of KL divergence produce qualitatively different behavior when fitting an approximate distribution QQ to a target PP.

Forward KL: DKL(PQ)D_{\text{KL}}(P \| Q)

Minimizing DKL(PQ)D_{\text{KL}}(P \| Q) over QQ is mode-covering (also called "M-projection" or mean-seeking). The objective penalizes Q(x)0Q(x) \approx 0 wherever P(x)>0P(x) > 0, because P(x)log(P(x)/Q(x))P(x) \log(P(x)/Q(x)) \to \infty as Q(x)0Q(x) \to 0. So QQ must spread mass to cover all modes of PP, even if this means placing mass where PP is small.

This is exactly what MLE does: given data from PP, minimize H(Pdata,Qθ)=H(Pdata)+DKL(PdataQθ)H(P_{\text{data}}, Q_\theta) = H(P_{\text{data}}) + D_{\text{KL}}(P_{\text{data}} \| Q_\theta).

Reverse KL: DKL(QP)D_{\text{KL}}(Q \| P)

Minimizing DKL(QP)D_{\text{KL}}(Q \| P) over QQ is mode-seeking (also called "I-projection"). The objective penalizes Q(x)>0Q(x) > 0 wherever P(x)0P(x) \approx 0, because Q(x)log(Q(x)/P(x))Q(x) \log(Q(x)/P(x)) \to \infty as P(x)0P(x) \to 0. So QQ avoids placing mass where PP is small, and tends to concentrate on a single mode of PP.

This is what variational inference does: minimize DKL(qϕ(z)p(zx))D_{\text{KL}}(q_\phi(z) \| p(z | x)) over the variational family qϕq_\phi.

Connection to Variational Inference

The evidence lower bound (ELBO) arises from the decomposition:

logp(x)=L(q)+DKL(q(z)p(zx))\log p(x) = \mathcal{L}(q) + D_{\text{KL}}(q(z) \| p(z|x))

where L(q)=Eq[logp(x,z)logq(z)]\mathcal{L}(q) = \mathbb{E}_{q}[\log p(x, z) - \log q(z)] is the ELBO. Since DKL0D_{\text{KL}} \geq 0, we have L(q)logp(x)\mathcal{L}(q) \leq \log p(x). Maximizing the ELBO over qq is equivalent to minimizing DKL(qp(x))D_{\text{KL}}(q \| p(\cdot | x)).

Connection to RLHF

In RLHF, the optimization objective includes a KL penalty:

maxπExπ[r(x)]βDKL(ππref)\max_\pi \mathbb{E}_{x \sim \pi}[r(x)] - \beta \, D_{\text{KL}}(\pi \| \pi_{\text{ref}})

where πref\pi_{\text{ref}} is the supervised fine-tuned (base) policy and rr is the reward model. The KL term prevents the fine-tuned policy from deviating too far from the base model, which would lead to reward hacking (exploiting errors in rr rather than genuinely improving quality). The coefficient β\beta controls the tradeoff.

Canonical Examples

Example

KL between two Gaussians

For P=N(μ1,σ12)P = \mathcal{N}(\mu_1, \sigma_1^2) and Q=N(μ2,σ22)Q = \mathcal{N}(\mu_2, \sigma_2^2):

DKL(PQ)=logσ2σ1+σ12+(μ1μ2)22σ2212D_{\text{KL}}(P \| Q) = \log \frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}

Note: this is zero only when μ1=μ2\mu_1 = \mu_2 and σ1=σ2\sigma_1 = \sigma_2. The asymmetry is visible: swapping PP and QQ gives a different expression.

Example

KL between Bernoulli distributions

For P=Bern(p)P = \text{Bern}(p) and Q=Bern(q)Q = \text{Bern}(q):

DKL(PQ)=plogpq+(1p)log1p1qD_{\text{KL}}(P \| Q) = p \log \frac{p}{q} + (1-p) \log \frac{1-p}{1-q}

If p=0.9p = 0.9 and q=0.5q = 0.5: DKL=0.9log(1.8)+0.1log(0.2)0.368D_{\text{KL}} = 0.9 \log(1.8) + 0.1 \log(0.2) \approx 0.368 nats. If p=0.5p = 0.5 and q=0.9q = 0.9: DKL=0.5log(5/9)+0.5log(5)0.511D_{\text{KL}} = 0.5 \log(5/9) + 0.5 \log(5) \approx 0.511 nats. Different values confirm asymmetry.

Common Confusions

Watch Out

KL divergence is not a distance

DKL(PQ)DKL(QP)D_{\text{KL}}(P \| Q) \neq D_{\text{KL}}(Q \| P) and the triangle inequality fails. Calling it a "distance" is common but misleading. It is a divergence: a measure of discrepancy, not a metric.

Watch Out

Zero KL does not mean the distributions are close pointwise

DKL(PQ)=0D_{\text{KL}}(P \| Q) = 0 means P=QP = Q a.e., which is a strong statement. But DKL(PQ)D_{\text{KL}}(P \| Q) being small does not imply PP and QQ are close in total variation. Pinsker's inequality gives TV(P,Q)DKL(PQ)/2\text{TV}(P, Q) \leq \sqrt{D_{\text{KL}}(P \| Q)/2}, but this bound is often loose.

Watch Out

Minimizing forward KL is not the same as minimizing reverse KL

MLE minimizes forward KL DKL(PdataQθ)D_{\text{KL}}(P_{\text{data}} \| Q_\theta), producing mode-covering fits. Variational inference minimizes reverse KL DKL(qϕp(x))D_{\text{KL}}(q_\phi \| p(\cdot | x)), producing mode-seeking fits. When PP is multimodal and QQ is unimodal, forward KL places QQ between modes; reverse KL concentrates QQ on one mode. These are genuinely different objectives with different solutions.

Summary

  • DKL(PQ)=EP[log(P/Q)]0D_{\text{KL}}(P \| Q) = \mathbb{E}_P[\log(P/Q)] \geq 0, with equality iff P=QP = Q
  • Not symmetric, not a metric
  • Forward KL (D(PQ)D(P \| Q) over QQ): mode-covering, equivalent to MLE
  • Reverse KL (D(QP)D(Q \| P) over QQ): mode-seeking, used in variational inference
  • Cross-entropy = entropy + KL divergence
  • KL penalty in RLHF constrains policy deviation from the base model

Exercises

ExerciseCore

Problem

Compute DKL(PQ)D_{\text{KL}}(P \| Q) and DKL(QP)D_{\text{KL}}(Q \| P) for P=Bern(0.5)P = \text{Bern}(0.5) and Q=Bern(0.1)Q = \text{Bern}(0.1). Verify that they differ.

ExerciseAdvanced

Problem

Derive the closed-form KL divergence between two multivariate Gaussians P=N(μ1,Σ1)P = \mathcal{N}(\mu_1, \Sigma_1) and Q=N(μ2,Σ2)Q = \mathcal{N}(\mu_2, \Sigma_2) in Rd\mathbb{R}^d.

Related Comparisons

References

Canonical:

  • Cover & Thomas, Elements of Information Theory (2006), Chapter 2
  • Kullback & Leibler, "On Information and Sufficiency" (1951)

Current:

  • Murphy, Probabilistic Machine Learning: Advanced Topics (2023), Chapter 6
  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1.6 and 10.1
  • MacKay, Information Theory, Inference, and Learning Algorithms (2003), Chapter 2
  • Csiszár & Körner, Information Theory (2nd ed., 2011), Chapter 2

Next Topics

  • Variational inference: using reverse KL to approximate intractable posteriors
  • Mutual information: symmetric quantity built from KL divergence

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics