Information Theory Foundations

Sneiderman, Robby

Mathematical Infrastructure

Information Theory Foundations

The core of information theory for ML: entropy, cross-entropy, KL divergence, mutual information, data processing inequality, and the chain rules that connect them. The language of variational inference, generalization bounds, and representation learning.

CoreTier 2StableCore spine~70 min

Start 8-question practice · 19 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

mathematical-infrastructure | layer 0B | tier 2. This page has 0 direct prerequisites and 13 published dependents.

Open Atlas Prerequisites Leads to

What next

Variational Autoencoders

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Information theory is the mathematical language of uncertainty, compression, and communication. In machine learning, it appears everywhere:

Cross-entropy loss is the standard training objective for classification --- it is an information-theoretic quantity
KL divergence is the objective in variational inference, the penalty in PAC-Bayes bounds, and the measure of distributional difference throughout ML theory
Mutual information quantifies how much one random variable tells you about another --- central to representation learning, feature selection, and the information bottleneck
Data processing inequality constrains what any algorithm can extract from data --- it is the foundation of minimax lower bounds

If you do not know information theory, large parts of ML theory will be opaque. The good news: the core concepts are few, and they interact through a small number of compact identities.

Mental Model

Think of entropy as measuring the "surprise" or "uncertainty" in a random variable. A fair coin has maximum entropy (1 bit); a biased coin has less (less uncertainty). Cross-entropy measures the expected surprise when you use the wrong distribution $q$ to encode data from the true distribution $p$ . KL divergence is the extra surprise from using $q$ instead of $p$ --- it is the cost of being wrong.

Mutual information measures how much knowing $X$ reduces your uncertainty about $Y$ . If $X$ and $Y$ are independent, mutual information is zero. If $X$ determines $Y$ completely, mutual information equals the entropy of $Y$ .

Information theory as uncertainty, mismatch, and overlap

Same objects, different questions: how uncertain is the source, how wrong is the code, and how much uncertainty disappears?

Core Definitions

Definition

Entropy $H (X)$

The entropy of a discrete random variable $X$ with distribution $p$ over alphabet $\mathcal{X}$ is:

$H(X) = -\sum_{x \in \mathcal{X}} p(x) \log p(x) = -\mathbb{E}_p[\log p(X)]$

with the convention $0 \log 0 = 0$ . When $\log$ is base 2, entropy is in bits; when natural log, in nats.

Properties:

$H(X) \geq 0$ (entropy is non-negative)
$H(X) = 0$ if and only if $X$ is deterministic
$H(X) \leq \log|\mathcal{X}|$ with equality iff $X$ is uniform
Entropy is concave in $p$

Definition

Cross-Entropy $H (p, q)$

The cross-entropy of distribution $q$ relative to distribution $p$ is:

$H(p, q) = -\sum_{x} p(x) \log q(x) = -\mathbb{E}_p[\log q(X)]$

This measures the expected number of bits needed to encode data from $p$ using a code optimized for $q$ . In ML, $p$ is the true data distribution and $q$ is the model distribution. The cross-entropy loss is:

$\mathcal{L}_{\text{CE}} = -\frac{1}{n}\sum_{i=1}^n \log q(y_i \mid x_i; \theta)$

which is the empirical cross-entropy between the data distribution and the model.

Key identity: $H(p, q) = H(p) + D_{\text{KL}}(p \| q)$ . Since $H(p)$ is constant (does not depend on $q$ ), minimizing cross-entropy is equivalent to minimizing KL divergence.

Definition

KL Divergence $D_{KL} (p ∥ q)$

The Kullback-Leibler divergence from $q$ to $p$ is:

$D_{\text{KL}}(p \| q) = \sum_{x} p(x) \log \frac{p(x)}{q(x)} = \mathbb{E}_p\!\left[\log \frac{p(X)}{q(X)}\right]$

KL divergence measures the information lost when $q$ is used to approximate $p$ . It is defined only when $q(x) = 0 \implies p(x) = 0$ (absolute continuity). If $p(x) > 0$ and $q(x) = 0$ , then $D_{\text{KL}}(p \| q) = +\infty$ .

Critical: KL divergence is not a metric.

Not symmetric: $D_{\text{KL}}(p \| q) \neq D_{\text{KL}}(q \| p)$ in general
No triangle inequality: there exist $p, q, r$ where $D_{\text{KL}}(p \| r) > D_{\text{KL}}(p \| q) + D_{\text{KL}}(q \| r)$

Despite not being a metric, KL divergence is the most natural measure of distributional difference for many statistical problems because of its connection to likelihood ratios and sufficient statistics.

Definition

Mutual Information $I (X; Y)$

The mutual information between random variables $X$ and $Y$ is:

$I(X; Y) = D_{\text{KL}}(p(x, y) \| p(x)p(y)) = \sum_{x, y} p(x, y) \log \frac{p(x, y)}{p(x)p(y)}$

Equivalent formulations:

$I(X; Y) = H(X) - H(X \mid Y)$ (reduction in uncertainty about $X$ from knowing $Y$ )
$I(X; Y) = H(Y) - H(Y \mid X)$
$I(X; Y) = H(X) + H(Y) - H(X, Y)$

Properties:

$I(X; Y) \geq 0$ with equality iff $X \perp Y$ (independence)
$I(X; Y) = I(Y; X)$ (symmetric, unlike KL divergence)
$I(X; X) = H(X)$ (a variable has maximum mutual information with itself)

Definition

Conditional Entropy $H (X ∣ Y)$

The conditional entropy of $X$ given $Y$ is:

$H(X \mid Y) = -\sum_{x, y} p(x, y) \log p(x \mid y) = \mathbb{E}_Y[H(X \mid Y = y)]$

This measures the remaining uncertainty about $X$ after observing $Y$ .

$H(X \mid Y) \leq H(X)$ (conditioning reduces entropy)
$H(X \mid Y) = 0$ iff $X$ is a deterministic function of $Y$
$H(X \mid Y) = H(X)$ iff $X \perp Y$

Main Theorems

Theorem

Gibbs Inequality (Non-Negativity of KL)

Statement

For any probability distributions $p$ and $q$ :

$D_{\text{KL}}(p \| q) \geq 0$

with equality if and only if $p = q$ almost everywhere.

Equivalently: the cross-entropy is at least the entropy:

$H(p, q) \geq H(p)$

Intuition

Encoding data from $p$ using a code optimized for $q$ can never be more efficient than using the optimal code for $p$ . Any mismatch between the true distribution and the coding distribution wastes bits. The extra bits wasted are exactly $D_{\text{KL}}(p \| q)$ .

Proof Sketch

By Jensen's inequality applied to the convex function $-\log$ :

$D_{\text{KL}}(p \| q) = \mathbb{E}_p\!\left[-\log \frac{q(X)}{p(X)}\right] \geq -\log \mathbb{E}_p\!\left[\frac{q(X)}{p(X)}\right] = -\log \sum_x q(x) = -\log 1 = 0$

Equality holds iff $q(x)/p(x)$ is constant $p$ -a.s., which requires $p = q$ .

Why It Matters

Gibbs inequality is the most fundamental result in information theory. It justifies using cross-entropy as a loss function: minimizing $H(p, q)$ over $q$ is equivalent to minimizing $D_{\text{KL}}(p \| q)$ , which finds the best approximation to $p$ . Since KL divergence is non-negative, the minimum is achieved at $q = p$ .

In ML: training a classifier by minimizing cross-entropy loss is justified because it drives the model distribution toward the true conditional distribution $p(y \mid x)$ .

Failure Mode

KL divergence can be infinite if $p$ has support where $q$ does not. In variational inference, this means the variational family $q$ must cover the support of the posterior $p$ , or the objective becomes undefined. This is one reason why "forward KL" $D_{\text{KL}}(p \| q)$ and "reverse KL" $D_{\text{KL}}(q \| p)$ behave very differently.

report a correction →

Theorem

Data Processing Inequality

Statement

If $X \to Y \to Z$ is a Markov chain (i.e., $Z$ is conditionally independent of $X$ given $Y$ ), then:

$I(X; Z) \leq I(X; Y)$

Processing the data $Y$ through any channel to produce $Z$ cannot increase the information about $X$ . No computation on $Y$ can create information about $X$ that $Y$ does not already contain.

Intuition

You cannot improve your estimate of $X$ by processing the observation $Y$ --- you can only lose information. If you compress data, denoise it, or transform it through any deterministic or stochastic function, you cannot gain information about the source.

Think of a game of telephone: each step can only lose information, never gain it. The original message $X$ is most informative; each retelling $Y, Z, \ldots$ can only degrade the signal.

Proof Sketch

By the chain rule for mutual information:

$I(X; Y, Z) = I(X; Z) + I(X; Y \mid Z) = I(X; Y) + I(X; Z \mid Y)$

Since $X \to Y \to Z$ is Markov: $I(X; Z \mid Y) = 0$ (given $Y$ , $Z$ tells you nothing new about $X$ ). Therefore:

$I(X; Y) = I(X; Z) + I(X; Y \mid Z) \geq I(X; Z)$

since $I(X; Y \mid Z) \geq 0$ .

Why It Matters

The data processing inequality is the theoretical foundation for minimax lower bounds in statistics and ML. If you want to estimate a parameter $\theta$ from data $X_1, \ldots, X_n$ , any estimator $\hat{\theta}(X_1, \ldots, X_n)$ satisfies $I(\theta; \hat{\theta}) \leq I(\theta; X_1, \ldots, X_n)$ .

Combined with Fano's inequality, this gives lower bounds on estimation error: there is a minimum sample size $n$ needed to estimate $\theta$ to accuracy $\epsilon$ , regardless of the algorithm used.

In representation learning, DPI constrains what information a learned representation can contain: $I(X; Z) \leq I(X; Y)$ where $X$ is the input, $Y$ the representation, and $Z$ the prediction.

Failure Mode

The inequality is tight when $Z$ is a sufficient statistic for $X$ given $Y$ , meaning $Z$ preserves all the information that $Y$ has about $X$ . In practice, finding sufficient statistics is often impossible, and most processing steps are lossy.

report a correction →

Theorem

Chain Rule for Entropy

Statement

For random variables $X_1, \ldots, X_n$ :

$H(X_1, X_2, \ldots, X_n) = \sum_{i=1}^n H(X_i \mid X_1, \ldots, X_{i-1})$

For $n = 2$ : $H(X, Y) = H(X) + H(Y \mid X) = H(Y) + H(X \mid Y)$ .

Chain rule for KL divergence: $D_{\text{KL}}(p(x, y) \| q(x, y)) = D_{\text{KL}}(p(x) \| q(x)) + \mathbb{E}_{p(x)}[D_{\text{KL}}(p(y \mid x) \| q(y \mid x))]$

Intuition

The joint uncertainty of $(X_1, \ldots, X_n)$ decomposes into the uncertainty of $X_1$ , plus the additional uncertainty of $X_2$ given $X_1$ , plus the additional uncertainty of $X_3$ given $(X_1, X_2)$ , and so on. Each term measures how much new uncertainty each variable adds beyond what is already known.

If the variables are independent: $H(X_1, \ldots, X_n) = \sum_i H(X_i)$ (uncertainty is additive). If they are dependent, the conditional entropies are smaller (each variable is partly predicted by the others), so the joint entropy is less than the sum of marginals.

Proof Sketch

For $n = 2$ : expand the definition:

$H(X, Y) = -\sum_{x,y} p(x,y) \log p(x,y) = -\sum_{x,y} p(x,y) [\log p(x) + \log p(y|x)]$

$= -\sum_x p(x) \log p(x) - \sum_{x,y} p(x,y) \log p(y|x) = H(X) + H(Y|X)$

The general case follows by induction, using $p(x_1, \ldots, x_n) = p(x_n \mid x_1, \ldots, x_{n-1}) \cdot p(x_1, \ldots, x_{n-1})$ .

Why It Matters

The chain rule is used constantly in ML theory. Autoregressive models (GPT, language models) use it directly: the joint distribution of a sequence is factored as $p(x_1, \ldots, x_n) = \prod_i p(x_i \mid x_{<i})$ , and the cross-entropy loss decomposes accordingly.

The chain rule for KL divergence is used in variational inference to decompose the ELBO objective and in PAC-Bayes bounds to decompose the complexity term across layers of a neural network.

Failure Mode

The chain rule applies to discrete random variables directly. For continuous random variables, entropy must be replaced by differential entropy $h(X) = -\int p(x) \log p(x) \, dx$ , which can be negative and is not invariant under changes of variables. The chain rules still hold for differential entropy and KL divergence, but care is needed with absolute continuity conditions.

report a correction →

Key Identities and Relationships

The core information-theoretic quantities are connected by:

$H(p, q) = H(p) + D_{\text{KL}}(p \| q)$ (cross-entropy = entropy + KL)
$I(X; Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X)$ (mutual info = reduction in entropy)
$I(X; Y) = D_{\text{KL}}(p(x,y) \| p(x)p(y))$ (mutual info = KL from joint to product of marginals)
$H(X, Y) = H(X) + H(Y) - I(X; Y)$ (joint entropy via mutual information)

Canonical Examples

Example

Cross-entropy loss for binary classification

For binary classification with true label $y \in \{0, 1\}$ and predicted probability $\hat{p} = P(Y = 1 \mid x)$ :

$\mathcal{L}_{\text{CE}} = -[y \log \hat{p} + (1 - y) \log(1 - \hat{p})]$

This is the cross-entropy $H(\text{Bernoulli}(y), \text{Bernoulli}(\hat{p}))$ . Minimizing this over $\hat{p}$ for each $x$ yields $\hat{p}^* = P(Y = 1 \mid x)$ --- the true conditional probability. The minimum value is $H(Y \mid X)$ , the conditional entropy (irreducible noise).

Example

KL divergence between two Gaussians

For $p = \mathcal{N}(\mu_1, \sigma_1^2)$ and $q = \mathcal{N}(\mu_2, \sigma_2^2)$ :

$D_{\text{KL}}(p \| q) = \log \frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}$

This equals zero iff $\mu_1 = \mu_2$ and $\sigma_1 = \sigma_2$ . Notice the asymmetry: swapping $p$ and $q$ gives a different expression. This formula appears constantly in variational autoencoders, where the latent prior is typically $\mathcal{N}(0, 1)$ .

Common Confusions

Watch Out

KL divergence is not symmetric

$D_{\text{KL}}(p \| q) \neq D_{\text{KL}}(q \| p)$ in general. The "forward KL" $D_{\text{KL}}(p \| q)$ penalizes $q$ for having low probability where $p$ has high probability (it is "zero-avoiding" in $q$ ). The "reverse KL" $D_{\text{KL}}(q \| p)$ penalizes $q$ for having high probability where $p$ has low probability (it is "zero-forcing" in $q$ ). Variational inference minimizes the reverse KL, which tends to underestimate the variance of the posterior.

Watch Out

Minimizing cross-entropy is the same as minimizing KL divergence

Since $H(p, q) = H(p) + D_{\text{KL}}(p \| q)$ and $H(p)$ is constant with respect to $q$ , $\arg\min_q H(p, q) = \arg\min_q D_{\text{KL}}(p \| q)$ . Students sometimes treat cross-entropy and KL divergence as different objectives. They are the same objective up to a constant when optimizing over $q$ .

Watch Out

Differential entropy can be negative

For continuous random variables, $h(X) = -\int p(x) \log p(x) \, dx$ can be negative. For example, $X \sim \text{Uniform}(0, 1/2)$ has $h(X) = -\log 2 < 0$ . This is because differential entropy is not a limit of discrete entropy. It is a different quantity. KL divergence and mutual information remain non-negative for continuous variables.

Watch Out

Differential entropy is not coordinate-invariant

Discrete entropy $H(X)$ is invariant under relabeling. Differential entropy $h(X)$ is not. For an invertible linear map $Y = AX$ : $h(Y) = h(X) + \log|\det A|.$ More generally, for a smooth bijection $Y = T(X)$ with Jacobian $J_T$ : $h(Y) = h(X) + \mathbb{E}\left[\log\left|\det J_T(X)\right|\right].$ So "entropy in bits" is not a property of the random variable alone once we leave the discrete setting. KL divergence and mutual information are coordinate- invariant because the Jacobian terms cancel between numerator and denominator.

Summary

Entropy: $H(X) = -\mathbb{E}[\log p(X)] \geq 0$ , measures uncertainty
Cross-entropy: $H(p, q) = -\mathbb{E}_p[\log q(X)] = H(p) + D_{\text{KL}}(p \| q)$
KL divergence: $D_{\text{KL}}(p \| q) \geq 0$ (Gibbs inequality), not symmetric, not a metric
Mutual information: $I(X; Y) = H(X) - H(X|Y) = D_{\text{KL}}(p_{XY} \| p_X p_Y) \geq 0$
Data processing inequality: $X \to Y \to Z$ implies $I(X; Z) \leq I(X; Y)$
Chain rule: $H(X, Y) = H(X) + H(Y | X)$
Minimizing cross-entropy = minimizing KL divergence (same up to constant)
Forward vs. reverse KL: different behaviors, different applications

Exercises

ExerciseCore

Problem

Compute the entropy of a Bernoulli random variable $X \sim \text{Bernoulli}(p)$ as a function of $p$ . At what value of $p$ is the entropy maximized, and what is the maximum value (in bits)?

ExerciseCore

Problem

Show that $I(X; Y) = 0$ if and only if $X$ and $Y$ are independent. Use the non-negativity of KL divergence.

ExerciseAdvanced

Problem

Prove the data processing inequality: if $X \to Y \to Z$ is a Markov chain, then $I(X; Z) \leq I(X; Y)$ .

ExerciseAdvanced

Problem

The ELBO (Evidence Lower Bound) in variational inference is $\text{ELBO}(q) = \mathbb{E}_q[\log p(x, z)] - \mathbb{E}_q[\log q(z)]$ where $q(z)$ approximates the posterior $p(z \mid x)$ . Show that $\log p(x) = \text{ELBO}(q) + D_{\text{KL}}(q(z) \| p(z \mid x))$ and explain why maximizing the ELBO is equivalent to minimizing the reverse KL divergence.

References

Canonical:

Cover & Thomas, Elements of Information Theory (2nd ed., 2006), Chapters 2, 4, 8
MacKay, Information Theory, Inference, and Learning Algorithms (2003), Chapters 1-4

Current:

Murphy, Probabilistic Machine Learning: An Introduction (2022), Chapter 6
Polyanskiy & Wu, Information Theory: From Coding to Learning (2024), Chapters 1-3

Next Topics

Building on information theory:

Variational autoencoders: the ELBO objective and KL regularization
Fano's inequality: turning information theory into minimax lower bounds
Maximum likelihood estimation: the connection between MLE and cross-entropy minimization

Last reviewed: April 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

0

No direct prerequisites are declared; this is treated as an entry point.

Derived topics

13

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
Cross-Entropy Loss: MLE, KL Divergence, and Classificationlayer 1 · tier 1
KL Divergencelayer 1 · tier 1
Variational Autoencoderslayer 3 · tier 1
CLIP, OpenCLIP, and SigLIP: Contrastive Language-Image Pretraininglayer 4 · tier 1

+8 more on the derived-topics page.

Graph-backed continuations

Variational Autoencoders Fano Inequality Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiency Bits, Nats, Perplexity, and BPB CLIP, OpenCLIP, and SigLIP: Contrastive Language-Image Pretraining Cross-Entropy Loss: MLE, KL Divergence, and Classification Information Bottleneck Kelly Criterion KL Divergence Perplexity and Language Model Evaluation Representation Learning Theory Token Prediction and Language Modeling Tokenization and Information Theory