Foundations
KL Divergence
Kullback-Leibler divergence measures how one probability distribution differs from another. Asymmetric, always non-negative, and central to variational inference, MLE, and RLHF.
Why This Matters
KL divergence appears everywhere in machine learning. Maximum likelihood estimation minimizes KL divergence from the data distribution to the model. Variational inference minimizes KL divergence from the approximate posterior to the true posterior. RLHF uses a KL penalty to keep the fine-tuned policy close to the base model. Understanding the asymmetry of KL divergence explains why these different objectives produce different behaviors.
Mental Model
measures the expected number of extra bits needed to encode samples from using a code optimized for . It is zero only when . It is not symmetric: the cost of encoding -samples with a -code differs from encoding -samples with a -code.
Core Definitions
KL Divergence
For discrete distributions and on the same sample space:
For continuous distributions with densities and :
Convention: and when . If does not dominate (there exists with but ), then .
Cross-Entropy
The cross-entropy between and is:
where is the Shannon entropy. Since is fixed with respect to , minimizing cross-entropy over is equivalent to minimizing KL divergence.
Main Theorems
Gibbs Inequality (Non-negativity of KL Divergence)
Statement
For any two probability distributions and :
with equality if and only if almost everywhere.
Intuition
The logarithm is concave. Applying Jensen's inequality to under expectation with respect to gives . Negating both sides yields .
Proof Sketch
By Jensen's inequality applied to the concave function :
Equality in Jensen's holds iff is constant -a.s., which forces .
Why It Matters
Non-negativity of KL divergence is the foundation for many results. It implies that cross-entropy is always at least the entropy: . It guarantees that MLE is consistent (the minimizer of KL from the true distribution is the true distribution itself). It ensures the ELBO is a valid lower bound on the log-evidence.
Failure Mode
KL divergence is not a metric. It is not symmetric: in general. It does not satisfy the triangle inequality. This means you cannot use it as a distance in metric-space arguments. If you need a true metric, consider the total variation distance or Wasserstein distance.
Forward vs. Reverse KL
The two directions of KL divergence produce qualitatively different behavior when fitting an approximate distribution to a target .
Forward KL:
Minimizing over is mode-covering (also called "M-projection" or mean-seeking). The objective penalizes wherever , because as . So must spread mass to cover all modes of , even if this means placing mass where is small.
This is exactly what MLE does: given data from , minimize .
Reverse KL:
Minimizing over is mode-seeking (also called "I-projection"). The objective penalizes wherever , because as . So avoids placing mass where is small, and tends to concentrate on a single mode of .
This is what variational inference does: minimize over the variational family .
Connection to Variational Inference
The evidence lower bound (ELBO) arises from the decomposition:
where is the ELBO. Since , we have . Maximizing the ELBO over is equivalent to minimizing .
Connection to RLHF
In RLHF, the optimization objective includes a KL penalty:
where is the supervised fine-tuned (base) policy and is the reward model. The KL term prevents the fine-tuned policy from deviating too far from the base model, which would lead to reward hacking (exploiting errors in rather than genuinely improving quality). The coefficient controls the tradeoff.
Canonical Examples
KL between two Gaussians
For and :
Note: this is zero only when and . The asymmetry is visible: swapping and gives a different expression.
KL between Bernoulli distributions
For and :
If and : nats. If and : nats. Different values confirm asymmetry.
Common Confusions
KL divergence is not a distance
and the triangle inequality fails. Calling it a "distance" is common but misleading. It is a divergence: a measure of discrepancy, not a metric.
Zero KL does not mean the distributions are close pointwise
means a.e., which is a strong statement. But being small does not imply and are close in total variation. Pinsker's inequality gives , but this bound is often loose.
Minimizing forward KL is not the same as minimizing reverse KL
MLE minimizes forward KL , producing mode-covering fits. Variational inference minimizes reverse KL , producing mode-seeking fits. When is multimodal and is unimodal, forward KL places between modes; reverse KL concentrates on one mode. These are genuinely different objectives with different solutions.
Summary
- , with equality iff
- Not symmetric, not a metric
- Forward KL ( over ): mode-covering, equivalent to MLE
- Reverse KL ( over ): mode-seeking, used in variational inference
- Cross-entropy = entropy + KL divergence
- KL penalty in RLHF constrains policy deviation from the base model
Exercises
Problem
Compute and for and . Verify that they differ.
Problem
Derive the closed-form KL divergence between two multivariate Gaussians and in .
Related Comparisons
References
Canonical:
- Cover & Thomas, Elements of Information Theory (2006), Chapter 2
- Kullback & Leibler, "On Information and Sufficiency" (1951)
Current:
- Murphy, Probabilistic Machine Learning: Advanced Topics (2023), Chapter 6
- Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1.6 and 10.1
- MacKay, Information Theory, Inference, and Learning Algorithms (2003), Chapter 2
- Csiszár & Körner, Information Theory (2nd ed., 2011), Chapter 2
Next Topics
- Variational inference: using reverse KL to approximate intractable posteriors
- Mutual information: symmetric quantity built from KL divergence
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Information Theory FoundationsLayer 0B