Mathematical Infrastructure
Information Theory Foundations
The core of information theory for ML: entropy, cross-entropy, KL divergence, mutual information, data processing inequality, and the chain rules that connect them. The language of variational inference, generalization bounds, and representation learning.
Why This Matters
Information theory is the mathematical language of uncertainty, compression, and communication. In machine learning, it appears everywhere:
- Cross-entropy loss is the standard training objective for classification --- it is an information-theoretic quantity
- KL divergence is the objective in variational inference, the penalty in PAC-Bayes bounds, and the measure of distributional difference throughout ML theory
- Mutual information quantifies how much one random variable tells you about another --- central to representation learning, feature selection, and the information bottleneck
- Data processing inequality constrains what any algorithm can extract from data --- it is the foundation of minimax lower bounds
If you do not know information theory, large parts of ML theory will be opaque. The good news: the core concepts are few, and they interact through a small number of compact identities.
Mental Model
Think of entropy as measuring the "surprise" or "uncertainty" in a random variable. A fair coin has maximum entropy (1 bit); a biased coin has less (less uncertainty). Cross-entropy measures the expected surprise when you use the wrong distribution to encode data from the true distribution . KL divergence is the extra surprise from using instead of --- it is the cost of being wrong.
Mutual information measures how much knowing reduces your uncertainty about . If and are independent, mutual information is zero. If determines completely, mutual information equals the entropy of .
Core Definitions
Entropy
The entropy of a discrete random variable with distribution over alphabet is:
with the convention . When is base 2, entropy is in bits; when natural log, in nats.
Properties:
- (entropy is non-negative)
- if and only if is deterministic
- with equality iff is uniform
- Entropy is concave in
Cross-Entropy
The cross-entropy of distribution relative to distribution is:
This measures the expected number of bits needed to encode data from using a code optimized for . In ML, is the true data distribution and is the model distribution. The cross-entropy loss is:
which is the empirical cross-entropy between the data distribution and the model.
Key identity: . Since is constant (does not depend on ), minimizing cross-entropy is equivalent to minimizing KL divergence.
KL Divergence
The Kullback-Leibler divergence from to is:
KL divergence measures the information lost when is used to approximate . It is defined only when (absolute continuity). If and , then .
Critical: KL divergence is not a metric.
- Not symmetric: in general
- No triangle inequality: there exist where
Despite not being a metric, KL divergence is the most natural measure of distributional difference for many statistical problems because of its connection to likelihood ratios and sufficient statistics.
Mutual Information
The mutual information between random variables and is:
Equivalent formulations:
- (reduction in uncertainty about from knowing )
Properties:
- with equality iff (independence)
- (symmetric, unlike KL divergence)
- (a variable has maximum mutual information with itself)
Conditional Entropy
The conditional entropy of given is:
This measures the remaining uncertainty about after observing .
- (conditioning reduces entropy)
- iff is a deterministic function of
- iff
Main Theorems
Gibbs Inequality (Non-Negativity of KL)
Statement
For any probability distributions and :
with equality if and only if almost everywhere.
Equivalently: the cross-entropy is at least the entropy:
Intuition
Encoding data from using a code optimized for can never be more efficient than using the optimal code for . Any mismatch between the true distribution and the coding distribution wastes bits. The extra bits wasted are exactly .
Proof Sketch
By Jensen's inequality applied to the convex function :
Equality holds iff is constant -a.s., which requires .
Why It Matters
Gibbs inequality is the most fundamental result in information theory. It justifies using cross-entropy as a loss function: minimizing over is equivalent to minimizing , which finds the best approximation to . Since KL divergence is non-negative, the minimum is achieved at .
In ML: training a classifier by minimizing cross-entropy loss is justified because it drives the model distribution toward the true conditional distribution .
Failure Mode
KL divergence can be infinite if has support where does not. In variational inference, this means the variational family must cover the support of the posterior , or the objective becomes undefined. This is one reason why "forward KL" and "reverse KL" behave very differently.
Data Processing Inequality
Statement
If is a Markov chain (i.e., is conditionally independent of given ), then:
Processing the data through any channel to produce cannot increase the information about . No computation on can create information about that does not already contain.
Intuition
You cannot improve your estimate of by processing the observation --- you can only lose information. If you compress data, denoise it, or transform it through any deterministic or stochastic function, you cannot gain information about the source.
Think of a game of telephone: each step can only lose information, never gain it. The original message is most informative; each retelling can only degrade the signal.
Proof Sketch
By the chain rule for mutual information:
Since is Markov: (given , tells you nothing new about ). Therefore:
since .
Why It Matters
The data processing inequality is the theoretical foundation for minimax lower bounds in statistics and ML. If you want to estimate a parameter from data , any estimator satisfies .
Combined with Fano's inequality, this gives lower bounds on estimation error: there is a minimum sample size needed to estimate to accuracy , regardless of the algorithm used.
In representation learning, DPI constrains what information a learned representation can contain: where is the input, the representation, and the prediction.
Failure Mode
The inequality is tight when is a sufficient statistic for given , meaning preserves all the information that has about . In practice, finding sufficient statistics is often impossible, and most processing steps are lossy.
Chain Rule for Entropy
Statement
For random variables :
For : .
Chain rule for KL divergence:
Intuition
The joint uncertainty of decomposes into the uncertainty of , plus the additional uncertainty of given , plus the additional uncertainty of given , and so on. Each term measures how much new uncertainty each variable adds beyond what is already known.
If the variables are independent: (uncertainty is additive). If they are dependent, the conditional entropies are smaller (each variable is partly predicted by the others), so the joint entropy is less than the sum of marginals.
Proof Sketch
For : expand the definition:
The general case follows by induction, using .
Why It Matters
The chain rule is used constantly in ML theory. Autoregressive models (GPT, language models) use it directly: the joint distribution of a sequence is factored as , and the cross-entropy loss decomposes accordingly.
The chain rule for KL divergence is used in variational inference to decompose the ELBO objective and in PAC-Bayes bounds to decompose the complexity term across layers of a neural network.
Failure Mode
The chain rule applies to discrete random variables directly. For continuous random variables, entropy must be replaced by differential entropy , which can be negative and is not invariant under changes of variables. The chain rules still hold for differential entropy and KL divergence, but care is needed with absolute continuity conditions.
Key Identities and Relationships
The core information-theoretic quantities are connected by:
- (cross-entropy = entropy + KL)
- (mutual info = reduction in entropy)
- (mutual info = KL from joint to product of marginals)
- (joint entropy via mutual information)
Canonical Examples
Cross-entropy loss for binary classification
For binary classification with true label and predicted probability :
This is the cross-entropy . Minimizing this over for each yields --- the true conditional probability. The minimum value is , the conditional entropy (irreducible noise).
KL divergence between two Gaussians
For and :
This equals zero iff and . Notice the asymmetry: swapping and gives a different expression. This formula appears constantly in variational autoencoders, where the latent prior is typically .
Common Confusions
KL divergence is not symmetric
in general. The "forward KL" penalizes for having low probability where has high probability (it is "zero-avoiding" in ). The "reverse KL" penalizes for having high probability where has low probability (it is "zero-forcing" in ). Variational inference minimizes the reverse KL, which tends to underestimate the variance of the posterior.
Minimizing cross-entropy is the same as minimizing KL divergence
Since and is constant with respect to , . Students sometimes treat cross-entropy and KL divergence as different objectives. They are the same objective up to a constant when optimizing over .
Differential entropy can be negative
For continuous random variables, can be negative. For example, has . This is because differential entropy is not a limit of discrete entropy. It is a different quantity. KL divergence and mutual information remain non-negative for continuous variables.
Differential entropy is not coordinate-invariant
Discrete entropy is invariant under relabeling. Differential entropy is not. For an invertible linear map : More generally, for a smooth bijection with Jacobian : So "entropy in bits" is not a property of the random variable alone once we leave the discrete setting. KL divergence and mutual information are coordinate- invariant because the Jacobian terms cancel between numerator and denominator.
Summary
- Entropy: , measures uncertainty
- Cross-entropy:
- KL divergence: (Gibbs inequality), not symmetric, not a metric
- Mutual information:
- Data processing inequality: implies
- Chain rule:
- Minimizing cross-entropy = minimizing KL divergence (same up to constant)
- Forward vs. reverse KL: different behaviors, different applications
Exercises
Problem
Compute the entropy of a Bernoulli random variable as a function of . At what value of is the entropy maximized, and what is the maximum value (in bits)?
Problem
Show that if and only if and are independent. Use the non-negativity of KL divergence.
Problem
Prove the data processing inequality: if is a Markov chain, then .
Problem
The ELBO (Evidence Lower Bound) in variational inference is where approximates the posterior . Show that and explain why maximizing the ELBO is equivalent to minimizing the reverse KL divergence.
References
Canonical:
- Cover & Thomas, Elements of Information Theory (2nd ed., 2006), Chapters 2, 4, 8
- MacKay, Information Theory, Inference, and Learning Algorithms (2003), Chapters 1-4
Current:
- Murphy, Probabilistic Machine Learning: An Introduction (2022), Chapter 6
- Polyanskiy & Wu, Information Theory: From Coding to Learning (2024), Chapters 1-3
Next Topics
Building on information theory:
- Variational autoencoders: the ELBO objective and KL regularization
- Fano's inequality: turning information theory into minimax lower bounds
- Maximum likelihood estimation: the connection between MLE and cross-entropy minimization
Last reviewed: April 2026
Builds on This
- Bits, Nats, Perplexity, and BPBLayer 3
- Cross-Entropy Loss Deep DiveLayer 1
- Information BottleneckLayer 3
- Kelly CriterionLayer 2
- KL DivergenceLayer 1
- Perplexity and Language Model EvaluationLayer 3
- Representation Learning TheoryLayer 3
- Token Prediction and Language ModelingLayer 3
- Tokenization and Information TheoryLayer 4