Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Mathematical Infrastructure

Radon-Nikodym and Conditional Expectation

The Radon-Nikodym theorem: what 'density' really means. Absolute continuity, the Radon-Nikodym derivative, conditional expectation as a projection, tower property, and why this undergirds likelihood ratios, importance sampling, and KL divergence.

CoreTier 1Stable~80 min
0

Why This Matters

The word "density" appears on nearly every page of a statistics or ML textbook. But what is a density? It is not just "the derivative of the CDF." Rigorously, a density is a Radon-Nikodym derivative --- the ratio of one measure with respect to another. This single concept unifies:

  • Likelihood ratios: dPθdPθ0\frac{dP_\theta}{dP_{\theta_0}} is literally a Radon-Nikodym derivative
  • Importance sampling: reweighting samples by dPdQ\frac{dP}{dQ}
  • KL divergence: DKL(PQ)=logdPdQdPD_{\text{KL}}(P \| Q) = \int \log\frac{dP}{dQ}\,dP
  • Bayesian posteriors: the posterior density is the prior density times the likelihood, normalized
  • Conditional expectation: E[YG]\mathbb{E}[Y | \mathcal{G}] is defined via the Radon-Nikodym theorem

If you skip this topic, you will use "density" as a vague synonym for "PDF on R\mathbb{R}." You will not understand why likelihood ratios require absolute continuity, why importance sampling can fail catastrophically, or what conditional expectation actually is beyond the formula yp(yx)dy\int y\, p(y|x)\,dy.

Mental Model

Think of two measures ν\nu and μ\mu on the same space. If ν\nu is "compatible" with μ\mu --- meaning that whenever μ\mu says a set has zero size, ν\nu agrees --- then ν\nu can be expressed as a "weighted version" of μ\mu. The weight function is the Radon-Nikodym derivative dν/dμd\nu/d\mu. It tells you: at each point, how much more (or less) does ν\nu care about this region compared to μ\mu?

A PDF f(x)f(x) is precisely this: it tells you how much the probability measure PP weighs each region compared to Lebesgue measure λ\lambda. The formula P(A)=Af(x)dxP(A) = \int_A f(x)\,dx is just P(A)=AdPdλdλP(A) = \int_A \frac{dP}{d\lambda}\,d\lambda.

Formal Setup

Definition

Absolute Continuity

Let μ\mu and ν\nu be measures on (Ω,F)(\Omega, \mathcal{F}). We say ν\nu is absolutely continuous with respect to μ\mu, written νμ\nu \ll \mu, if:

μ(A)=0    ν(A)=0for all AF\mu(A) = 0 \implies \nu(A) = 0 \quad \text{for all } A \in \mathcal{F}

Equivalently: every μ\mu-null set is also ν\nu-null. If ν\nu assigns positive measure to some set that μ\mu considers negligible, then ν\nu is not absolutely continuous with respect to μ\mu.

Examples:

  • Any probability distribution with a PDF is absolutely continuous with respect to Lebesgue measure
  • A discrete distribution (point masses) is not absolutely continuous with respect to Lebesgue measure
  • Two Gaussians N(μ1,σ2)\mathcal{N}(\mu_1, \sigma^2) and N(μ2,σ2)\mathcal{N}(\mu_2, \sigma^2) are mutually absolutely continuous
Definition

Singular Measures

Two measures ν\nu and μ\mu are mutually singular, written νμ\nu \perp \mu, if there exists a set AA such that ν(A)=0\nu(A) = 0 and μ(Ac)=0\mu(A^c) = 0. They "live on disjoint sets."

Example: Lebesgue measure and the counting measure on Z\mathbb{Z} are mutually singular. Any discrete distribution is singular with respect to any continuous distribution.

Main Theorems

Theorem

Radon-Nikodym Theorem

Statement

If νμ\nu \ll \mu and both are σ\sigma-finite, then there exists a measurable function f:Ω[0,)f: \Omega \to [0, \infty) such that for every AFA \in \mathcal{F}:

ν(A)=Afdμ\nu(A) = \int_A f \, d\mu

The function ff is called the Radon-Nikodym derivative of ν\nu with respect to μ\mu, written f=dνdμf = \frac{d\nu}{d\mu}. It is unique μ\mu-almost everywhere.

Intuition

The Radon-Nikodym derivative is the "local ratio" of two measures. At each point ω\omega, dνdμ(ω)\frac{d\nu}{d\mu}(\omega) tells you how much more mass ν\nu puts near ω\omega compared to μ\mu. If ν\nu concentrates more mass somewhere, dν/dμd\nu/d\mu is large there. If ν\nu puts less mass, dν/dμd\nu/d\mu is small.

When μ\mu is Lebesgue measure and ν\nu is a probability measure with a density, then dνdμ=f\frac{d\nu}{d\mu} = f is exactly the PDF. The Radon-Nikodym theorem says: densities exist whenever absolute continuity holds, and not just on R\mathbb{R} --- on any measurable space.

Proof Sketch

(Hilbert space proof for finite measures): Consider the measure ρ=μ+ν\rho = \mu + \nu. On L2(Ω,ρ)L^2(\Omega, \rho), the map ggdνg \mapsto \int g \, d\nu is a bounded linear functional (by Cauchy-Schwarz, since νρ\nu \leq \rho). By the Riesz representation theorem, there exists hL2(ρ)h \in L^2(\rho) with gdν=ghdρ\int g \, d\nu = \int gh \, d\rho for all gL2(ρ)g \in L^2(\rho).

One shows 0h10 \leq h \leq 1 ρ\rho-a.e. (by testing with indicator functions). Then set f=h/(1h)f = h/(1 - h) on {h<1}\{h < 1\}. Absolute continuity of ν\nu with respect to μ\mu ensures {h=1}\{h = 1\} has μ\mu-measure zero. Then ν(A)=Afdμ\nu(A) = \int_A f \, d\mu.

Why It Matters

The Radon-Nikodym theorem is the rigorous foundation for:

  1. PDFs: f(x)=dP/dλf(x) = dP/d\lambda where λ\lambda is Lebesgue measure. The "density" is not a property of PP alone --- it is a relationship between PP and a reference measure.

  2. Likelihood ratios: dPθdPθ0(ω)\frac{dP_\theta}{dP_{\theta_0}}(\omega) is the likelihood ratio, which is meaningful only when PθPθ0P_\theta \ll P_{\theta_0}. If the two models assign positive probability to disjoint regions, the likelihood ratio does not exist.

  3. Importance sampling: EP[f(X)]=EQ[f(X)dPdQ(X)]\mathbb{E}_P[f(X)] = \mathbb{E}_Q[f(X) \cdot \frac{dP}{dQ}(X)]. This requires PQP \ll Q; if QQ assigns zero probability to a region where PP is positive, you will never sample there and the estimator is biased.

  4. KL divergence: DKL(PQ)=logdPdQdPD_{\text{KL}}(P \| Q) = \int \log\frac{dP}{dQ}\,dP. This requires PQP \ll Q; if not, DKL=+D_{\text{KL}} = +\infty.

Failure Mode

Without absolute continuity, the Radon-Nikodym derivative does not exist. If ν\nu is a point mass at x0x_0 and μ\mu is Lebesgue measure, then ν({x0})=1\nu(\{x_0\}) = 1 but μ({x0})=0\mu(\{x_0\}) = 0, so ν\nu is not absolutely continuous with respect to μ\mu. There is no function ff such that ν(A)=Afdμ\nu(A) = \int_A f\,d\mu. This is why you cannot write a "PDF" for a discrete distribution with respect to Lebesgue measure.

Conditional Expectation

Definition

Conditional Expectation (Measure-Theoretic)

Let (Ω,F,P)(\Omega, \mathcal{F}, \mathbb{P}) be a probability space, YY an integrable random variable, and GF\mathcal{G} \subseteq \mathcal{F} a sub-sigma-algebra. The conditional expectation E[YG]\mathbb{E}[Y | \mathcal{G}] is the G\mathcal{G}-measurable random variable ZZ satisfying:

AZdP=AYdPfor all AG\int_A Z \, d\mathbb{P} = \int_A Y \, d\mathbb{P} \quad \text{for all } A \in \mathcal{G}

It exists (by the Radon-Nikodym theorem applied to the signed measure AAYdPA \mapsto \int_A Y\,d\mathbb{P} restricted to G\mathcal{G}) and is unique P\mathbb{P}-almost surely.

Why is this the right definition? The condition says: E[YG]\mathbb{E}[Y|\mathcal{G}] is the "best guess" of YY given only the information in G\mathcal{G}, in the sense that it has the same integral as YY over every G\mathcal{G}-measurable set. It is a projection of YY onto the space of G\mathcal{G}-measurable functions.

When G=σ(X)\mathcal{G} = \sigma(X) (the sigma-algebra generated by a random variable XX), we write E[YX]\mathbb{E}[Y | X], which is a function of XX. In the special case where XX and YY are jointly continuous with density, this reduces to the familiar formula E[YX=x]=yfYX(yx)dy\mathbb{E}[Y | X = x] = \int y\, f_{Y|X}(y|x)\,dy.

Theorem

Existence of Conditional Expectation

Statement

For any integrable random variable YY and sub-sigma-algebra GF\mathcal{G} \subseteq \mathcal{F}, there exists a G\mathcal{G}-measurable random variable ZZ satisfying AZdP=AYdP\int_A Z\,d\mathbb{P} = \int_A Y\,d\mathbb{P} for all AGA \in \mathcal{G}. This Z=E[YG]Z = \mathbb{E}[Y|\mathcal{G}] is unique P\mathbb{P}-a.s.

Intuition

Think of L2(Ω,F,P)L^2(\Omega, \mathcal{F}, \mathbb{P}) as a Hilbert space. The G\mathcal{G}-measurable functions form a closed subspace. E[YG]\mathbb{E}[Y|\mathcal{G}] is the orthogonal projection of YY onto this subspace. The projection is the element of the subspace closest to YY in L2L^2 norm, which is the best G\mathcal{G}-measurable predictor of YY in the mean-squared error sense.

Proof Sketch

Define ν(A)=AYdP\nu(A) = \int_A Y\,d\mathbb{P} for AGA \in \mathcal{G}. This is a signed measure on (Ω,G)(\Omega, \mathcal{G}) that is absolutely continuous with respect to PG\mathbb{P}|_{\mathcal{G}} (since P(A)=0\mathbb{P}(A) = 0 implies AYdP=0\int_A Y\,d\mathbb{P} = 0 when YY is integrable). By the Radon-Nikodym theorem for signed measures, ν\nu has a density ZZ with respect to PG\mathbb{P}|_\mathcal{G}. This ZZ is G\mathcal{G}-measurable and satisfies the defining property.

Why It Matters

Conditional expectation is the central object in:

  • Bayesian statistics: the posterior mean is E[θdata]\mathbb{E}[\theta | \text{data}]
  • Martingale theory: a martingale satisfies E[Xt+1Ft]=Xt\mathbb{E}[X_{t+1} | \mathcal{F}_t] = X_t
  • Dynamic programming: the Bellman equation involves E[Vt+1st,at]\mathbb{E}[V_{t+1} | s_t, a_t]
  • Regression: E[YX]\mathbb{E}[Y | X] is the regression function, the optimal predictor of YY given XX under squared loss

Failure Mode

The naive formula E[YX=x]=yf(yx)dy\mathbb{E}[Y | X = x] = \int y f(y|x)\,dy requires f(x)>0f(x) > 0 and a well-defined conditional density. For general random variables (not jointly continuous), this formula does not work. The measure-theoretic definition handles all cases but is less intuitive. When working with conditional expectations in proofs, always use the abstract property (AE[YG]dP=AYdP\int_A \mathbb{E}[Y|\mathcal{G}]\,d\mathbb{P} = \int_A Y\,d\mathbb{P}) rather than the density formula.

Properties of Conditional Expectation

The following properties are used constantly in probability and ML theory. Let G,H\mathcal{G}, \mathcal{H} be sub-sigma-algebras with HGF\mathcal{H} \subseteq \mathcal{G} \subseteq \mathcal{F}.

Tower property (law of iterated expectations):

E[E[YG]H]=E[YH]\mathbb{E}[\mathbb{E}[Y | \mathcal{G}] | \mathcal{H}] = \mathbb{E}[Y | \mathcal{H}]

Coarse information washes out finer conditioning. The special case H={,Ω}\mathcal{H} = \{\emptyset, \Omega\} gives E[E[YG]]=E[Y]\mathbb{E}[\mathbb{E}[Y | \mathcal{G}]] = \mathbb{E}[Y].

Linearity: E[aY+bZG]=aE[YG]+bE[ZG]\mathbb{E}[aY + bZ | \mathcal{G}] = a\mathbb{E}[Y|\mathcal{G}] + b\mathbb{E}[Z|\mathcal{G}].

Pull-out property: If XX is G\mathcal{G}-measurable and XYXY is integrable, then E[XYG]=XE[YG]\mathbb{E}[XY | \mathcal{G}] = X \cdot \mathbb{E}[Y | \mathcal{G}].

Jensen's inequality for conditional expectation: If φ\varphi is convex, then φ(E[YG])E[φ(Y)G]\varphi(\mathbb{E}[Y | \mathcal{G}]) \leq \mathbb{E}[\varphi(Y) | \mathcal{G}].

Why "Density" Is Not Just a PDF on R

A common source of confusion: students think "density" always means a function f:R[0,)f: \mathbb{R} \to [0, \infty) that integrates to 1 with respect to Lebesgue measure. But a density is a Radon-Nikodym derivative, and the reference measure can be anything:

  • PDF on R\mathbb{R}: dP/dλdP/d\lambda where λ\lambda is Lebesgue measure
  • PMF on Z\mathbb{Z}: dP/dμdP/d\mu where μ\mu is counting measure. The PMF p(k)=P({k})p(k) = P(\{k\}) is the Radon-Nikodym derivative of PP with respect to counting measure
  • Likelihood ratio: dPθ/dPθ0dP_\theta/dP_{\theta_0} is a density of one probability measure with respect to another
  • Change of variables: if Y=g(X)Y = g(X), the density of YY with respect to Lebesgue measure involves the Jacobian, but this is just the chain rule for Radon-Nikodym derivatives

The unified view: a "density" is always dν/dμd\nu/d\mu for some pair of measures. The Radon-Nikodym theorem says this exists if and only if νμ\nu \ll \mu.

Canonical Examples

Example

Gaussian likelihood ratio

Let P0=N(0,1)P_0 = \mathcal{N}(0, 1) and P1=N(μ,1)P_1 = \mathcal{N}(\mu, 1). Since both are absolutely continuous with respect to Lebesgue measure, they are mutually absolutely continuous (P0P1P_0 \ll P_1 and P1P0P_1 \ll P_0). The likelihood ratio is:

dP1dP0(x)=f1(x)f0(x)=exp ⁣(μxμ22)\frac{dP_1}{dP_0}(x) = \frac{f_1(x)}{f_0(x)} = \exp\!\left(\mu x - \frac{\mu^2}{2}\right)

This ratio is the sufficient statistic for testing H0:μ=0H_0: \mu = 0 vs H1:μ0H_1: \mu \neq 0 (Neyman-Pearson lemma). In importance sampling, if you draw XP0X \sim P_0 and want EP1[g(X)]\mathbb{E}_{P_1}[g(X)], you compute 1nig(Xi)dP1dP0(Xi)\frac{1}{n}\sum_i g(X_i) \cdot \frac{dP_1}{dP_0}(X_i).

Example

Conditional expectation of a Gaussian given a linear observation

Let (X,Y)(X, Y) be jointly Gaussian with E[X]=E[Y]=0\mathbb{E}[X] = \mathbb{E}[Y] = 0, Var(X)=σX2\text{Var}(X) = \sigma_X^2, Var(Y)=σY2\text{Var}(Y) = \sigma_Y^2, and Cov(X,Y)=ρσXσY\text{Cov}(X, Y) = \rho \sigma_X \sigma_Y. Then:

E[YX]=ρσYσXX\mathbb{E}[Y | X] = \rho \frac{\sigma_Y}{\sigma_X} X

This is a linear function of XX. The conditional variance is Var(YX)=σY2(1ρ2)\text{Var}(Y | X) = \sigma_Y^2(1 - \rho^2), which does not depend on XX. For Gaussians, the conditional expectation is always linear, and the conditional variance is always constant. This is the foundation of linear regression.

Common Confusions

Watch Out

Density is not an intrinsic property of a distribution

The density depends on the reference measure. The standard normal has density 12πex2/2\frac{1}{\sqrt{2\pi}}e^{-x^2/2} with respect to Lebesgue measure, but density 1 with respect to itself. A Bernoulli distribution has no density with respect to Lebesgue measure (it is singular), but has a perfectly good density (its PMF) with respect to counting measure. The question "what is the density?" is incomplete without specifying "with respect to what?"

Watch Out

Conditional expectation is a random variable, not a number

E[YX]\mathbb{E}[Y | X] is a function of XX, hence a random variable. Only when you condition on a specific value X=xX = x do you get a number E[YX=x]\mathbb{E}[Y | X = x]. A common mistake is to treat E[YG]\mathbb{E}[Y | \mathcal{G}] as if it were a fixed number. It is not --- it depends on the outcome ω\omega through the information in G\mathcal{G}.

Watch Out

Tower property requires the inclusion G contains H

The tower property E[E[YG]H]=E[YH]\mathbb{E}[\mathbb{E}[Y|\mathcal{G}]|\mathcal{H}] = \mathbb{E}[Y|\mathcal{H}] requires HG\mathcal{H} \subseteq \mathcal{G}. You are conditioning on coarser information in the outer expectation. If H\mathcal{H} and G\mathcal{G} are unrelated sigma-algebras, the tower property does not apply. A common error is to apply it when the nesting condition fails.

Summary

  • Absolute continuity νμ\nu \ll \mu means ν\nu cannot assign positive mass where μ\mu assigns zero
  • Radon-Nikodym: if νμ\nu \ll \mu, then ν(A)=Adνdμdμ\nu(A) = \int_A \frac{d\nu}{d\mu}\,d\mu
  • A "density" is always a Radon-Nikodym derivative with respect to some reference measure
  • Conditional expectation E[YG]\mathbb{E}[Y|\mathcal{G}] is defined as the Radon-Nikodym derivative of AAYdPA \mapsto \int_A Y\,d\mathbb{P} on G\mathcal{G}
  • Tower property: E[E[YG]H]=E[YH]\mathbb{E}[\mathbb{E}[Y|\mathcal{G}]|\mathcal{H}] = \mathbb{E}[Y|\mathcal{H}] when HG\mathcal{H} \subseteq \mathcal{G}
  • Likelihood ratios, importance sampling weights, and KL divergence are all functions of Radon-Nikodym derivatives
  • Conditional expectation is a random variable (function of the conditioning information), not a fixed number

Exercises

ExerciseCore

Problem

Let P=N(1,1)P = \mathcal{N}(1, 1) and Q=N(0,1)Q = \mathcal{N}(0, 1). Compute the Radon-Nikodym derivative dP/dQdP/dQ and verify that EQ[dP/dQ]=1\mathbb{E}_Q[dP/dQ] = 1.

ExerciseAdvanced

Problem

Use the tower property to prove that if θ^\hat{\theta} is an unbiased estimator of θ\theta (i.e., E[θ^]=θ\mathbb{E}[\hat{\theta}] = \theta) and TT is a sufficient statistic, then θ~=E[θ^T]\tilde{\theta} = \mathbb{E}[\hat{\theta} | T] is also unbiased and has variance at most that of θ^\hat{\theta} (Rao-Blackwell theorem).

ExerciseResearch

Problem

Give an example where PQP \ll Q but the importance sampling estimator 1ni=1nf(Xi)dPdQ(Xi)\frac{1}{n}\sum_{i=1}^n f(X_i)\frac{dP}{dQ}(X_i) with XiQX_i \sim Q has infinite variance, even though EP[f(X)]\mathbb{E}_P[f(X)] is finite. What property of dP/dQdP/dQ causes this, and what does it imply for practical importance sampling?

References

Canonical:

  • Billingsley, Probability and Measure (3rd ed., 1995), Chapter 32
  • Durrett, Probability: Theory and Examples (5th ed., 2019), Sections 4.1, 5.1
  • Williams, Probability with Martingales (1991), Chapters 6, 9

Current:

Next Topics

Building on the Radon-Nikodym theorem:

  • Maximum likelihood estimation: the likelihood function is a Radon-Nikodym derivative
  • Importance sampling: reweighting by dP/dQdP/dQ to estimate expectations under PP using samples from QQ
  • Concentration inequalities: the first application of measure-theoretic tools to bounding tail probabilities

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics