Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Mathematical Infrastructure

Information Theory Foundations

The core of information theory for ML: entropy, cross-entropy, KL divergence, mutual information, data processing inequality, and the chain rules that connect them. The language of variational inference, generalization bounds, and representation learning.

CoreTier 2Stable~70 min

Why This Matters

Information theory is the mathematical language of uncertainty, compression, and communication. In machine learning, it appears everywhere:

  • Cross-entropy loss is the standard training objective for classification --- it is an information-theoretic quantity
  • KL divergence is the objective in variational inference, the penalty in PAC-Bayes bounds, and the measure of distributional difference throughout ML theory
  • Mutual information quantifies how much one random variable tells you about another --- central to representation learning, feature selection, and the information bottleneck
  • Data processing inequality constrains what any algorithm can extract from data --- it is the foundation of minimax lower bounds

If you do not know information theory, large parts of ML theory will be opaque. The good news: the core concepts are few, and they interact through a small number of compact identities.

Mental Model

Think of entropy as measuring the "surprise" or "uncertainty" in a random variable. A fair coin has maximum entropy (1 bit); a biased coin has less (less uncertainty). Cross-entropy measures the expected surprise when you use the wrong distribution qq to encode data from the true distribution pp. KL divergence is the extra surprise from using qq instead of pp --- it is the cost of being wrong.

Mutual information measures how much knowing XX reduces your uncertainty about YY. If XX and YY are independent, mutual information is zero. If XX determines YY completely, mutual information equals the entropy of YY.

Core Definitions

Definition

Entropy

The entropy of a discrete random variable XX with distribution pp over alphabet X\mathcal{X} is:

H(X)=xXp(x)logp(x)=Ep[logp(X)]H(X) = -\sum_{x \in \mathcal{X}} p(x) \log p(x) = -\mathbb{E}_p[\log p(X)]

with the convention 0log0=00 \log 0 = 0. When log\log is base 2, entropy is in bits; when natural log, in nats.

Properties:

  • H(X)0H(X) \geq 0 (entropy is non-negative)
  • H(X)=0H(X) = 0 if and only if XX is deterministic
  • H(X)logXH(X) \leq \log|\mathcal{X}| with equality iff XX is uniform
  • Entropy is concave in pp
Definition

Cross-Entropy

The cross-entropy of distribution qq relative to distribution pp is:

H(p,q)=xp(x)logq(x)=Ep[logq(X)]H(p, q) = -\sum_{x} p(x) \log q(x) = -\mathbb{E}_p[\log q(X)]

This measures the expected number of bits needed to encode data from pp using a code optimized for qq. In ML, pp is the true data distribution and qq is the model distribution. The cross-entropy loss is:

LCE=1ni=1nlogq(yixi;θ)\mathcal{L}_{\text{CE}} = -\frac{1}{n}\sum_{i=1}^n \log q(y_i \mid x_i; \theta)

which is the empirical cross-entropy between the data distribution and the model.

Key identity: H(p,q)=H(p)+DKL(pq)H(p, q) = H(p) + D_{\text{KL}}(p \| q). Since H(p)H(p) is constant (does not depend on qq), minimizing cross-entropy is equivalent to minimizing KL divergence.

Definition

KL Divergence

The Kullback-Leibler divergence from qq to pp is:

DKL(pq)=xp(x)logp(x)q(x)=Ep ⁣[logp(X)q(X)]D_{\text{KL}}(p \| q) = \sum_{x} p(x) \log \frac{p(x)}{q(x)} = \mathbb{E}_p\!\left[\log \frac{p(X)}{q(X)}\right]

KL divergence measures the information lost when qq is used to approximate pp. It is defined only when q(x)=0    p(x)=0q(x) = 0 \implies p(x) = 0 (absolute continuity). If p(x)>0p(x) > 0 and q(x)=0q(x) = 0, then DKL(pq)=+D_{\text{KL}}(p \| q) = +\infty.

Critical: KL divergence is not a metric.

  • Not symmetric: DKL(pq)DKL(qp)D_{\text{KL}}(p \| q) \neq D_{\text{KL}}(q \| p) in general
  • No triangle inequality: there exist p,q,rp, q, r where DKL(pr)>DKL(pq)+DKL(qr)D_{\text{KL}}(p \| r) > D_{\text{KL}}(p \| q) + D_{\text{KL}}(q \| r)

Despite not being a metric, KL divergence is the most natural measure of distributional difference for many statistical problems because of its connection to likelihood ratios and sufficient statistics.

Definition

Mutual Information

The mutual information between random variables XX and YY is:

I(X;Y)=DKL(p(x,y)p(x)p(y))=x,yp(x,y)logp(x,y)p(x)p(y)I(X; Y) = D_{\text{KL}}(p(x, y) \| p(x)p(y)) = \sum_{x, y} p(x, y) \log \frac{p(x, y)}{p(x)p(y)}

Equivalent formulations:

  • I(X;Y)=H(X)H(XY)I(X; Y) = H(X) - H(X \mid Y) (reduction in uncertainty about XX from knowing YY)
  • I(X;Y)=H(Y)H(YX)I(X; Y) = H(Y) - H(Y \mid X)
  • I(X;Y)=H(X)+H(Y)H(X,Y)I(X; Y) = H(X) + H(Y) - H(X, Y)

Properties:

  • I(X;Y)0I(X; Y) \geq 0 with equality iff XYX \perp Y (independence)
  • I(X;Y)=I(Y;X)I(X; Y) = I(Y; X) (symmetric, unlike KL divergence)
  • I(X;X)=H(X)I(X; X) = H(X) (a variable has maximum mutual information with itself)
Definition

Conditional Entropy

The conditional entropy of XX given YY is:

H(XY)=x,yp(x,y)logp(xy)=EY[H(XY=y)]H(X \mid Y) = -\sum_{x, y} p(x, y) \log p(x \mid y) = \mathbb{E}_Y[H(X \mid Y = y)]

This measures the remaining uncertainty about XX after observing YY.

  • H(XY)H(X)H(X \mid Y) \leq H(X) (conditioning reduces entropy)
  • H(XY)=0H(X \mid Y) = 0 iff XX is a deterministic function of YY
  • H(XY)=H(X)H(X \mid Y) = H(X) iff XYX \perp Y

Main Theorems

Theorem

Gibbs Inequality (Non-Negativity of KL)

Statement

For any probability distributions pp and qq:

DKL(pq)0D_{\text{KL}}(p \| q) \geq 0

with equality if and only if p=qp = q almost everywhere.

Equivalently: the cross-entropy is at least the entropy:

H(p,q)H(p)H(p, q) \geq H(p)

Intuition

Encoding data from pp using a code optimized for qq can never be more efficient than using the optimal code for pp. Any mismatch between the true distribution and the coding distribution wastes bits. The extra bits wasted are exactly DKL(pq)D_{\text{KL}}(p \| q).

Proof Sketch

By Jensen's inequality applied to the convex function log-\log:

DKL(pq)=Ep ⁣[logq(X)p(X)]logEp ⁣[q(X)p(X)]=logxq(x)=log1=0D_{\text{KL}}(p \| q) = \mathbb{E}_p\!\left[-\log \frac{q(X)}{p(X)}\right] \geq -\log \mathbb{E}_p\!\left[\frac{q(X)}{p(X)}\right] = -\log \sum_x q(x) = -\log 1 = 0

Equality holds iff q(x)/p(x)q(x)/p(x) is constant pp-a.s., which requires p=qp = q.

Why It Matters

Gibbs inequality is the most fundamental result in information theory. It justifies using cross-entropy as a loss function: minimizing H(p,q)H(p, q) over qq is equivalent to minimizing DKL(pq)D_{\text{KL}}(p \| q), which finds the best approximation to pp. Since KL divergence is non-negative, the minimum is achieved at q=pq = p.

In ML: training a classifier by minimizing cross-entropy loss is justified because it drives the model distribution toward the true conditional distribution p(yx)p(y \mid x).

Failure Mode

KL divergence can be infinite if pp has support where qq does not. In variational inference, this means the variational family qq must cover the support of the posterior pp, or the objective becomes undefined. This is one reason why "forward KL" DKL(pq)D_{\text{KL}}(p \| q) and "reverse KL" DKL(qp)D_{\text{KL}}(q \| p) behave very differently.

Theorem

Data Processing Inequality

Statement

If XYZX \to Y \to Z is a Markov chain (i.e., ZZ is conditionally independent of XX given YY), then:

I(X;Z)I(X;Y)I(X; Z) \leq I(X; Y)

Processing the data YY through any channel to produce ZZ cannot increase the information about XX. No computation on YY can create information about XX that YY does not already contain.

Intuition

You cannot improve your estimate of XX by processing the observation YY --- you can only lose information. If you compress data, denoise it, or transform it through any deterministic or stochastic function, you cannot gain information about the source.

Think of a game of telephone: each step can only lose information, never gain it. The original message XX is most informative; each retelling Y,Z,Y, Z, \ldots can only degrade the signal.

Proof Sketch

By the chain rule for mutual information:

I(X;Y,Z)=I(X;Z)+I(X;YZ)=I(X;Y)+I(X;ZY)I(X; Y, Z) = I(X; Z) + I(X; Y \mid Z) = I(X; Y) + I(X; Z \mid Y)

Since XYZX \to Y \to Z is Markov: I(X;ZY)=0I(X; Z \mid Y) = 0 (given YY, ZZ tells you nothing new about XX). Therefore:

I(X;Y)=I(X;Z)+I(X;YZ)I(X;Z)I(X; Y) = I(X; Z) + I(X; Y \mid Z) \geq I(X; Z)

since I(X;YZ)0I(X; Y \mid Z) \geq 0.

Why It Matters

The data processing inequality is the theoretical foundation for minimax lower bounds in statistics and ML. If you want to estimate a parameter θ\theta from data X1,,XnX_1, \ldots, X_n, any estimator θ^(X1,,Xn)\hat{\theta}(X_1, \ldots, X_n) satisfies I(θ;θ^)I(θ;X1,,Xn)I(\theta; \hat{\theta}) \leq I(\theta; X_1, \ldots, X_n).

Combined with Fano's inequality, this gives lower bounds on estimation error: there is a minimum sample size nn needed to estimate θ\theta to accuracy ϵ\epsilon, regardless of the algorithm used.

In representation learning, DPI constrains what information a learned representation can contain: I(X;Z)I(X;Y)I(X; Z) \leq I(X; Y) where XX is the input, YY the representation, and ZZ the prediction.

Failure Mode

The inequality is tight when ZZ is a sufficient statistic for XX given YY, meaning ZZ preserves all the information that YY has about XX. In practice, finding sufficient statistics is often impossible, and most processing steps are lossy.

Theorem

Chain Rule for Entropy

Statement

For random variables X1,,XnX_1, \ldots, X_n:

H(X1,X2,,Xn)=i=1nH(XiX1,,Xi1)H(X_1, X_2, \ldots, X_n) = \sum_{i=1}^n H(X_i \mid X_1, \ldots, X_{i-1})

For n=2n = 2: H(X,Y)=H(X)+H(YX)=H(Y)+H(XY)H(X, Y) = H(X) + H(Y \mid X) = H(Y) + H(X \mid Y).

Chain rule for KL divergence: DKL(p(x,y)q(x,y))=DKL(p(x)q(x))+Ep(x)[DKL(p(yx)q(yx))]D_{\text{KL}}(p(x, y) \| q(x, y)) = D_{\text{KL}}(p(x) \| q(x)) + \mathbb{E}_{p(x)}[D_{\text{KL}}(p(y \mid x) \| q(y \mid x))]

Intuition

The joint uncertainty of (X1,,Xn)(X_1, \ldots, X_n) decomposes into the uncertainty of X1X_1, plus the additional uncertainty of X2X_2 given X1X_1, plus the additional uncertainty of X3X_3 given (X1,X2)(X_1, X_2), and so on. Each term measures how much new uncertainty each variable adds beyond what is already known.

If the variables are independent: H(X1,,Xn)=iH(Xi)H(X_1, \ldots, X_n) = \sum_i H(X_i) (uncertainty is additive). If they are dependent, the conditional entropies are smaller (each variable is partly predicted by the others), so the joint entropy is less than the sum of marginals.

Proof Sketch

For n=2n = 2: expand the definition:

H(X,Y)=x,yp(x,y)logp(x,y)=x,yp(x,y)[logp(x)+logp(yx)]H(X, Y) = -\sum_{x,y} p(x,y) \log p(x,y) = -\sum_{x,y} p(x,y) [\log p(x) + \log p(y|x)]

=xp(x)logp(x)x,yp(x,y)logp(yx)=H(X)+H(YX)= -\sum_x p(x) \log p(x) - \sum_{x,y} p(x,y) \log p(y|x) = H(X) + H(Y|X)

The general case follows by induction, using p(x1,,xn)=p(xnx1,,xn1)p(x1,,xn1)p(x_1, \ldots, x_n) = p(x_n \mid x_1, \ldots, x_{n-1}) \cdot p(x_1, \ldots, x_{n-1}).

Why It Matters

The chain rule is used constantly in ML theory. Autoregressive models (GPT, language models) use it directly: the joint distribution of a sequence is factored as p(x1,,xn)=ip(xix<i)p(x_1, \ldots, x_n) = \prod_i p(x_i \mid x_{<i}), and the cross-entropy loss decomposes accordingly.

The chain rule for KL divergence is used in variational inference to decompose the ELBO objective and in PAC-Bayes bounds to decompose the complexity term across layers of a neural network.

Failure Mode

The chain rule applies to discrete random variables directly. For continuous random variables, entropy must be replaced by differential entropy h(X)=p(x)logp(x)dxh(X) = -\int p(x) \log p(x) \, dx, which can be negative and is not invariant under changes of variables. The chain rules still hold for differential entropy and KL divergence, but care is needed with absolute continuity conditions.

Key Identities and Relationships

The core information-theoretic quantities are connected by:

  • H(p,q)=H(p)+DKL(pq)H(p, q) = H(p) + D_{\text{KL}}(p \| q) (cross-entropy = entropy + KL)
  • I(X;Y)=H(X)H(XY)=H(Y)H(YX)I(X; Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X) (mutual info = reduction in entropy)
  • I(X;Y)=DKL(p(x,y)p(x)p(y))I(X; Y) = D_{\text{KL}}(p(x,y) \| p(x)p(y)) (mutual info = KL from joint to product of marginals)
  • H(X,Y)=H(X)+H(Y)I(X;Y)H(X, Y) = H(X) + H(Y) - I(X; Y) (joint entropy via mutual information)

Canonical Examples

Example

Cross-entropy loss for binary classification

For binary classification with true label y{0,1}y \in \{0, 1\} and predicted probability p^=P(Y=1x)\hat{p} = P(Y = 1 \mid x):

LCE=[ylogp^+(1y)log(1p^)]\mathcal{L}_{\text{CE}} = -[y \log \hat{p} + (1 - y) \log(1 - \hat{p})]

This is the cross-entropy H(Bernoulli(y),Bernoulli(p^))H(\text{Bernoulli}(y), \text{Bernoulli}(\hat{p})). Minimizing this over p^\hat{p} for each xx yields p^=P(Y=1x)\hat{p}^* = P(Y = 1 \mid x) --- the true conditional probability. The minimum value is H(YX)H(Y \mid X), the conditional entropy (irreducible noise).

Example

KL divergence between two Gaussians

For p=N(μ1,σ12)p = \mathcal{N}(\mu_1, \sigma_1^2) and q=N(μ2,σ22)q = \mathcal{N}(\mu_2, \sigma_2^2):

DKL(pq)=logσ2σ1+σ12+(μ1μ2)22σ2212D_{\text{KL}}(p \| q) = \log \frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}

This equals zero iff μ1=μ2\mu_1 = \mu_2 and σ1=σ2\sigma_1 = \sigma_2. Notice the asymmetry: swapping pp and qq gives a different expression. This formula appears constantly in variational autoencoders, where the latent prior is typically N(0,1)\mathcal{N}(0, 1).

Common Confusions

Watch Out

KL divergence is not symmetric

DKL(pq)DKL(qp)D_{\text{KL}}(p \| q) \neq D_{\text{KL}}(q \| p) in general. The "forward KL" DKL(pq)D_{\text{KL}}(p \| q) penalizes qq for having low probability where pp has high probability (it is "zero-avoiding" in qq). The "reverse KL" DKL(qp)D_{\text{KL}}(q \| p) penalizes qq for having high probability where pp has low probability (it is "zero-forcing" in qq). Variational inference minimizes the reverse KL, which tends to underestimate the variance of the posterior.

Watch Out

Minimizing cross-entropy is the same as minimizing KL divergence

Since H(p,q)=H(p)+DKL(pq)H(p, q) = H(p) + D_{\text{KL}}(p \| q) and H(p)H(p) is constant with respect to qq, argminqH(p,q)=argminqDKL(pq)\arg\min_q H(p, q) = \arg\min_q D_{\text{KL}}(p \| q). Students sometimes treat cross-entropy and KL divergence as different objectives. They are the same objective up to a constant when optimizing over qq.

Watch Out

Differential entropy can be negative

For continuous random variables, h(X)=p(x)logp(x)dxh(X) = -\int p(x) \log p(x) \, dx can be negative. For example, XUniform(0,1/2)X \sim \text{Uniform}(0, 1/2) has h(X)=log2<0h(X) = -\log 2 < 0. This is because differential entropy is not a limit of discrete entropy. It is a different quantity. KL divergence and mutual information remain non-negative for continuous variables.

Watch Out

Differential entropy is not coordinate-invariant

Discrete entropy H(X)H(X) is invariant under relabeling. Differential entropy h(X)h(X) is not. For an invertible linear map Y=AXY = AX: h(Y)=h(X)+logdetA.h(Y) = h(X) + \log|\det A|. More generally, for a smooth bijection Y=T(X)Y = T(X) with Jacobian JTJ_T: h(Y)=h(X)+E[logdetJT(X)].h(Y) = h(X) + \mathbb{E}\left[\log\left|\det J_T(X)\right|\right]. So "entropy in bits" is not a property of the random variable alone once we leave the discrete setting. KL divergence and mutual information are coordinate- invariant because the Jacobian terms cancel between numerator and denominator.

Summary

  • Entropy: H(X)=E[logp(X)]0H(X) = -\mathbb{E}[\log p(X)] \geq 0, measures uncertainty
  • Cross-entropy: H(p,q)=Ep[logq(X)]=H(p)+DKL(pq)H(p, q) = -\mathbb{E}_p[\log q(X)] = H(p) + D_{\text{KL}}(p \| q)
  • KL divergence: DKL(pq)0D_{\text{KL}}(p \| q) \geq 0 (Gibbs inequality), not symmetric, not a metric
  • Mutual information: I(X;Y)=H(X)H(XY)=DKL(pXYpXpY)0I(X; Y) = H(X) - H(X|Y) = D_{\text{KL}}(p_{XY} \| p_X p_Y) \geq 0
  • Data processing inequality: XYZX \to Y \to Z implies I(X;Z)I(X;Y)I(X; Z) \leq I(X; Y)
  • Chain rule: H(X,Y)=H(X)+H(YX)H(X, Y) = H(X) + H(Y | X)
  • Minimizing cross-entropy = minimizing KL divergence (same up to constant)
  • Forward vs. reverse KL: different behaviors, different applications

Exercises

ExerciseCore

Problem

Compute the entropy of a Bernoulli random variable XBernoulli(p)X \sim \text{Bernoulli}(p) as a function of pp. At what value of pp is the entropy maximized, and what is the maximum value (in bits)?

ExerciseCore

Problem

Show that I(X;Y)=0I(X; Y) = 0 if and only if XX and YY are independent. Use the non-negativity of KL divergence.

ExerciseAdvanced

Problem

Prove the data processing inequality: if XYZX \to Y \to Z is a Markov chain, then I(X;Z)I(X;Y)I(X; Z) \leq I(X; Y).

ExerciseAdvanced

Problem

The ELBO (Evidence Lower Bound) in variational inference is ELBO(q)=Eq[logp(x,z)]Eq[logq(z)]\text{ELBO}(q) = \mathbb{E}_q[\log p(x, z)] - \mathbb{E}_q[\log q(z)] where q(z)q(z) approximates the posterior p(zx)p(z \mid x). Show that logp(x)=ELBO(q)+DKL(q(z)p(zx))\log p(x) = \text{ELBO}(q) + D_{\text{KL}}(q(z) \| p(z \mid x)) and explain why maximizing the ELBO is equivalent to minimizing the reverse KL divergence.

References

Canonical:

  • Cover & Thomas, Elements of Information Theory (2nd ed., 2006), Chapters 2, 4, 8
  • MacKay, Information Theory, Inference, and Learning Algorithms (2003), Chapters 1-4

Current:

  • Murphy, Probabilistic Machine Learning: An Introduction (2022), Chapter 6
  • Polyanskiy & Wu, Information Theory: From Coding to Learning (2024), Chapters 1-3

Next Topics

Building on information theory:

Last reviewed: April 2026

Builds on This

Next Topics