Radon-Nikodym and Conditional Expectation

Sneiderman, Robby

Mathematical Infrastructure

Radon-Nikodym and Conditional Expectation

The Radon-Nikodym theorem: what 'density' really means. Absolute continuity, the Radon-Nikodym derivative, conditional expectation as a projection, tower property, and why this undergirds likelihood ratios, importance sampling, and KL divergence.

CoreTier 1StableCore spine~80 min

Prerequisites

Measure Theoretic Probability

Start 8-question practice · 8 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

mathematical-infrastructure | layer 0B | tier 1. This page has 1 direct prerequisite and 5 published dependents.

Open Atlas Prerequisites Leads to

What next

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiency

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

The word "density" appears on nearly every page of a statistics or ML textbook. But what is a density? It is not just "the derivative of the CDF." Rigorously, a density is a Radon-Nikodym derivative --- the ratio of one measure with respect to another. This single concept unifies:

Five-panel infographic: the Radon-Nikodym theorem (when one measure is absolutely continuous w.r.t. another, the density dP/dQ exists and is unique a.e.), examples (likelihood ratios, importance sampling weights), conditional expectation as a Radon-Nikodym derivative on a sub-sigma-algebra, properties (tower, contractive in L^2), and ML applications (importance sampling, off-policy RL, change-of-measure in score matching). — Radon-Nikodym makes 'density of one measure with respect to another' rigorous. Conditional expectation is its main consequence: the best L^2 prediction given a coarser sigma-algebra.

Likelihood ratios: $\frac{dP_\theta}{dP_{\theta_0}}$ is literally a Radon-Nikodym derivative
Importance sampling: reweighting samples by $\frac{dP}{dQ}$
KL divergence: $D_{\text{KL}}(P \| Q) = \int \log\frac{dP}{dQ}\,dP$
Bayesian posteriors: the posterior density is the prior density times the likelihood, normalized
Conditional expectation: $\mathbb{E}[Y | \mathcal{G}]$ is defined via the Radon-Nikodym theorem

If you skip this topic, you will use "density" as a vague synonym for "PDF on $\mathbb{R}$ ." You will not understand why likelihood ratios require absolute continuity, why importance sampling can fail catastrophically, or what conditional expectation actually is beyond the formula $\int y\, p(y|x)\,dy$ .

Mental Model

Think of two measures $\nu$ and $\mu$ on the same space. If $\nu$ is "compatible" with $\mu$ --- meaning that whenever $\mu$ says a set has zero size, $\nu$ agrees --- then $\nu$ can be expressed as a "weighted version" of $\mu$ . The weight function is the Radon-Nikodym derivative $d\nu/d\mu$ . It tells you: at each point, how much more (or less) does $\nu$ care about this region compared to $\mu$ ?

A PDF $f(x)$ is precisely this: it tells you how much the probability measure $P$ weighs each region compared to Lebesgue measure $\lambda$ . The formula $P(A) = \int_A f(x)\,dx$ is just $P(A) = \int_A \frac{dP}{d\lambda}\,d\lambda$ .

Two measure-theoretic viewpoints

Compare one measure to a reference measure, or average a random variable over a coarser sigma-algebra.

The two tabs highlight the page's central equivalence of viewpoint. In the Radon-Nikodym view, you compare one measure against a reference measure and read off a density ratio. In the conditional-expectation view, you keep the same underlying probability space but collapse information down to a coarser $\sigma$ -algebra, replacing fine variation by the unique $\mathcal{G}$ -measurable function with matching integrals on every $\mathcal{G}$ -measurable set.

Formal Setup

Definition

Absolute Continuity $ν ≪ μ$

Let $\mu$ and $\nu$ be measures on $(\Omega, \mathcal{F})$ . We say $\nu$ is absolutely continuous with respect to $\mu$ , written $\nu \ll \mu$ , if and only if:

$\mu(A) = 0 \implies \nu(A) = 0 \quad \text{for all } A \in \mathcal{F}$

Equivalently: every $\mu$ -null set is also $\nu$ -null. If $\nu$ assigns positive measure to some set that $\mu$ considers negligible, then $\nu$ is not absolutely continuous with respect to $\mu$ .

Examples:

Any probability distribution with a PDF is absolutely continuous with respect to Lebesgue measure
A discrete distribution (point masses) is not absolutely continuous with respect to Lebesgue measure
Two Gaussians $\mathcal{N}(\mu_1, \sigma^2)$ and $\mathcal{N}(\mu_2, \sigma^2)$ are mutually absolutely continuous

Definition

Singular Measures $ν ⊥ μ$

Two measures $\nu$ and $\mu$ are mutually singular, written $\nu \perp \mu$ , if and only if there exists a set $A$ such that $\nu(A) = 0$ and $\mu(A^c) = 0$ . They "live on disjoint sets."

Example: Lebesgue measure and the counting measure on $\mathbb{Z}$ are mutually singular. Any discrete distribution is singular with respect to any continuous distribution.

Main Theorems

Theorem

Radon-Nikodym Theorem

Statement

If $\nu \ll \mu$ and both are $\sigma$ -finite, then there exists a measurable function $f: \Omega \to [0, \infty)$ such that for every $A \in \mathcal{F}$ :

$\nu(A) = \int_A f \, d\mu$

The function $f$ is called the Radon-Nikodym derivative of $\nu$ with respect to $\mu$ , written $f = \frac{d\nu}{d\mu}$ . It is unique $\mu$ -almost everywhere.

Intuition

The Radon-Nikodym derivative is the "local ratio" of two measures. At each point $\omega$ , $\frac{d\nu}{d\mu}(\omega)$ tells you how much more mass $\nu$ puts near $\omega$ compared to $\mu$ . If $\nu$ concentrates more mass somewhere, $d\nu/d\mu$ is large there. If $\nu$ puts less mass, $d\nu/d\mu$ is small.

When $\mu$ is Lebesgue measure and $\nu$ is a probability measure with a density, then $\frac{d\nu}{d\mu} = f$ is exactly the PDF. The Radon-Nikodym theorem says: densities exist whenever absolute continuity holds, and not just on $\mathbb{R}$ --- on any measurable space.

Proof Sketch

(Hilbert space proof for finite measures): Consider the measure $\rho = \mu + \nu$ . On $L^2(\Omega, \rho)$ , the map $g \mapsto \int g \, d\nu$ is a bounded linear functional (by Cauchy-Schwarz, since $\nu \leq \rho$ ). By the Riesz representation theorem, there exists $h \in L^2(\rho)$ with $\int g \, d\nu = \int gh \, d\rho$ for all $g \in L^2(\rho)$ .

One shows $0 \leq h \leq 1$ $\rho$ -a.e. (by testing with indicator functions). Then set $f = h/(1 - h)$ on $\{h < 1\}$ . Absolute continuity of $\nu$ with respect to $\mu$ ensures $\{h = 1\}$ has $\mu$ -measure zero. Then $\nu(A) = \int_A f \, d\mu$ .

Why It Matters

The Radon-Nikodym theorem is the rigorous foundation for:

PDFs: $f(x) = dP/d\lambda$ where $\lambda$ is Lebesgue measure. The "density" is not a property of $P$ alone --- it is a relationship between $P$ and a reference measure.
Likelihood ratios: $\frac{dP_\theta}{dP_{\theta_0}}(\omega)$ is the likelihood ratio, and it is meaningful only when $P_\theta \ll P_{\theta_0}$ . If the two models assign positive probability to disjoint regions, the likelihood ratio does not exist.
Importance sampling: $\mathbb{E}_P[f(X)] = \mathbb{E}_Q[f(X) \cdot \frac{dP}{dQ}(X)]$ . This requires $P \ll Q$ ; if $Q$ assigns zero probability to a region where $P$ is positive, you will never sample there and the estimator is biased.
KL divergence: $D_{\text{KL}}(P \| Q) = \int \log\frac{dP}{dQ}\,dP$ . This requires $P \ll Q$ ; if not, $D_{\text{KL}} = +\infty$ .

Failure Mode

Without absolute continuity, the Radon-Nikodym derivative does not exist. If $\nu$ is a point mass at $x_0$ and $\mu$ is Lebesgue measure, then $\nu(\{x_0\}) = 1$ but $\mu(\{x_0\}) = 0$ , so $\nu$ is not absolutely continuous with respect to $\mu$ . There is no function $f$ such that $\nu(A) = \int_A f\,d\mu$ . This is why you cannot write a "PDF" for a discrete distribution with respect to Lebesgue measure.

report a correction →

Conditional Expectation

Definition

Conditional Expectation (Measure-Theoretic) $E [Y ∣ G]$

Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space, $Y$ an integrable random variable, and $\mathcal{G} \subseteq \mathcal{F}$ a sub-sigma-algebra. The conditional expectation $\mathbb{E}[Y | \mathcal{G}]$ is the $\mathcal{G}$ -measurable random variable $Z$ satisfying:

$\int_A Z \, d\mathbb{P} = \int_A Y \, d\mathbb{P} \quad \text{for all } A \in \mathcal{G}$

It exists (by the Radon-Nikodym theorem applied to the signed measure $A \mapsto \int_A Y\,d\mathbb{P}$ restricted to $\mathcal{G}$ ) and is unique $\mathbb{P}$ -almost surely.

Why is this the right definition? The condition says: $\mathbb{E}[Y|\mathcal{G}]$ is the "best guess" of $Y$ given only the information in $\mathcal{G}$ , in the sense that it has the same integral as $Y$ over every $\mathcal{G}$ -measurable set. It is a projection of $Y$ onto the space of $\mathcal{G}$ -measurable functions.

When $\mathcal{G} = \sigma(X)$ (the sigma-algebra generated by a random variable $X$ ), we write $\mathbb{E}[Y | X]$ , which is a function of $X$ . In the special case where $X$ and $Y$ are jointly continuous with density, this reduces to the familiar formula $\mathbb{E}[Y | X = x] = \int y\, f_{Y|X}(y|x)\,dy$ .

Theorem

Existence of Conditional Expectation

Statement

For any integrable random variable $Y$ and sub-sigma-algebra $\mathcal{G} \subseteq \mathcal{F}$ , there exists a $\mathcal{G}$ -measurable random variable $Z$ satisfying $\int_A Z\,d\mathbb{P} = \int_A Y\,d\mathbb{P}$ for all $A \in \mathcal{G}$ . This $Z = \mathbb{E}[Y|\mathcal{G}]$ is unique $\mathbb{P}$ -a.s.

Intuition

Think of $L^2(\Omega, \mathcal{F}, \mathbb{P})$ as a Hilbert space. The $\mathcal{G}$ -measurable functions form a closed subspace. $\mathbb{E}[Y|\mathcal{G}]$ is the orthogonal projection of $Y$ onto this subspace. The projection is the element of the subspace closest to $Y$ in $L^2$ norm, which is the best $\mathcal{G}$ -measurable predictor of $Y$ in the mean-squared error sense.

Proof Sketch

Define $\nu(A) = \int_A Y\,d\mathbb{P}$ for $A \in \mathcal{G}$ . This is a signed measure on $(\Omega, \mathcal{G})$ that is absolutely continuous with respect to $\mathbb{P}|_{\mathcal{G}}$ (since $\mathbb{P}(A) = 0$ implies $\int_A Y\,d\mathbb{P} = 0$ when $Y$ is integrable). By the Radon-Nikodym theorem for signed measures, $\nu$ has a density $Z$ with respect to $\mathbb{P}|_\mathcal{G}$ . This $Z$ is $\mathcal{G}$ -measurable and satisfies the defining property.

Why It Matters

Conditional expectation is the central object in:

Bayesian statistics: the posterior mean is $\mathbb{E}[\theta | \text{data}]$
Martingale theory: a martingale satisfies $\mathbb{E}[X_{t+1} | \mathcal{F}_t] = X_t$
Dynamic programming: the Bellman equation involves $\mathbb{E}[V_{t+1} | s_t, a_t]$
Regression: $\mathbb{E}[Y | X]$ is the regression function, the optimal predictor of $Y$ given $X$ under squared loss

Failure Mode

The naive formula $\mathbb{E}[Y | X = x] = \int y f(y|x)\,dy$ requires $f(x) > 0$ and a well-defined conditional density. For general random variables (not jointly continuous), this formula does not work. The measure-theoretic definition handles all cases but is less intuitive. When working with conditional expectations in proofs, always use the abstract property ( $\int_A \mathbb{E}[Y|\mathcal{G}]\,d\mathbb{P} = \int_A Y\,d\mathbb{P}$ ) rather than the density formula.

report a correction →

Properties of Conditional Expectation

The following properties are used constantly in probability and ML theory. Let $\mathcal{G}, \mathcal{H}$ be sub-sigma-algebras with $\mathcal{H} \subseteq \mathcal{G} \subseteq \mathcal{F}$ .

Tower property (law of iterated expectations):

$\mathbb{E}[\mathbb{E}[Y | \mathcal{G}] | \mathcal{H}] = \mathbb{E}[Y | \mathcal{H}]$

Coarse information washes out finer conditioning. The special case $\mathcal{H} = \{\emptyset, \Omega\}$ gives $\mathbb{E}[\mathbb{E}[Y | \mathcal{G}]] = \mathbb{E}[Y]$ .

Linearity: $\mathbb{E}[aY + bZ | \mathcal{G}] = a\mathbb{E}[Y|\mathcal{G}] + b\mathbb{E}[Z|\mathcal{G}]$ .

Pull-out property: If $X$ is $\mathcal{G}$ -measurable and $XY$ is integrable, then $\mathbb{E}[XY | \mathcal{G}] = X \cdot \mathbb{E}[Y | \mathcal{G}]$ .

Jensen's inequality for conditional expectation: If $\varphi$ is convex, then $\varphi(\mathbb{E}[Y | \mathcal{G}]) \leq \mathbb{E}[\varphi(Y) | \mathcal{G}]$ .

Why "Density" Is Not Just a PDF on R

A common source of confusion: students think "density" always means a function $f: \mathbb{R} \to [0, \infty)$ that integrates to 1 with respect to Lebesgue measure. But a density is a Radon-Nikodym derivative, and the reference measure can be anything:

PDF on $\mathbb{R}$ : $dP/d\lambda$ where $\lambda$ is Lebesgue measure
PMF on $\mathbb{Z}$ : $dP/d\mu$ where $\mu$ is counting measure. The PMF $p(k) = P(\{k\})$ is the Radon-Nikodym derivative of $P$ with respect to counting measure
Likelihood ratio: $dP_\theta/dP_{\theta_0}$ is a density of one probability measure with respect to another
Change of variables: if $Y = g(X)$ , the density of $Y$ with respect to Lebesgue measure involves the Jacobian, but this is just the chain rule for Radon-Nikodym derivatives

The unified view: a "density" is always $d\nu/d\mu$ for some pair of measures. The Radon-Nikodym theorem says this exists if and only if $\nu \ll \mu$ .

Quantity	Reference measure	Radon-Nikodym derivative	What it means
PDF on $\mathbb{R}^d$	Lebesgue measure $\lambda$	$dP/d\lambda$	ordinary continuous density
PMF on a countable set	counting measure $\mu_{\mathrm{count}}$	$dP/d\mu_{\mathrm{count}}$	discrete mass function
Likelihood ratio	baseline model $Q$	$dP/dQ$	how one model reweights another
Conditional expectation	the restricted measure $\mathbb{P}_{\mathcal{G}}$ for the signed measure $A \mapsto \int_A Y\,d\mathbb{P}$	$d\nu/d\mathbb{P}_{\mathcal{G}}$	the best $\mathcal{G}$ -measurable average of $Y$

Canonical Examples

Example

Gaussian likelihood ratio

Let $P_0 = \mathcal{N}(0, 1)$ and $P_1 = \mathcal{N}(\mu, 1)$ . Since both are absolutely continuous with respect to Lebesgue measure, they are mutually absolutely continuous ( $P_0 \ll P_1$ and $P_1 \ll P_0$ ). The likelihood ratio is:

$\frac{dP_1}{dP_0}(x) = \frac{f_1(x)}{f_0(x)} = \exp\!\left(\mu x - \frac{\mu^2}{2}\right)$

This ratio is the sufficient statistic for testing $H_0: \mu = 0$ vs $H_1: \mu \neq 0$ (Neyman-Pearson lemma). In importance sampling, if you draw $X \sim P_0$ and want $\mathbb{E}_{P_1}[g(X)]$ , you compute $\frac{1}{n}\sum_i g(X_i) \cdot \frac{dP_1}{dP_0}(X_i)$ .

Example

Conditional expectation of a Gaussian given a linear observation

Let $(X, Y)$ be jointly Gaussian with $\mathbb{E}[X] = \mathbb{E}[Y] = 0$ , $\text{Var}(X) = \sigma_X^2$ , $\text{Var}(Y) = \sigma_Y^2$ , and $\text{Cov}(X, Y) = \rho \sigma_X \sigma_Y$ . Then:

$\mathbb{E}[Y | X] = \rho \frac{\sigma_Y}{\sigma_X} X$

This is a linear function of $X$ . The conditional variance is $\text{Var}(Y | X) = \sigma_Y^2(1 - \rho^2)$ , which does not depend on $X$ . For Gaussians, the conditional expectation is always linear, and the conditional variance is always constant. This is the foundation of linear regression.

Common Confusions

Watch Out

Density is not an intrinsic property of a distribution

The density depends on the reference measure. The standard normal has density $\frac{1}{\sqrt{2\pi}}e^{-x^2/2}$ with respect to Lebesgue measure, but density 1 with respect to itself. A Bernoulli distribution has no density with respect to Lebesgue measure (it is singular), but has a perfectly good density (its PMF) with respect to counting measure. The question "what is the density?" is incomplete without specifying "with respect to what?"

Watch Out

Conditional expectation is a random variable, not a number

$\mathbb{E}[Y | X]$ is a function of $X$ , hence a random variable. Only when you condition on a specific value $X = x$ do you get a number $\mathbb{E}[Y | X = x]$ . A common mistake is to treat $\mathbb{E}[Y | \mathcal{G}]$ as if it were a fixed number. It is not --- it depends on the outcome $\omega$ through the information in $\mathcal{G}$ .

Watch Out

Tower property requires the inclusion G contains H

The tower property $\mathbb{E}[\mathbb{E}[Y|\mathcal{G}]|\mathcal{H}] = \mathbb{E}[Y|\mathcal{H}]$ requires $\mathcal{H} \subseteq \mathcal{G}$ . You are conditioning on coarser information in the outer expectation. If $\mathcal{H}$ and $\mathcal{G}$ are unrelated sigma-algebras, the tower property does not apply. A common error is to apply it when the nesting condition fails.

Summary

Absolute continuity $\nu \ll \mu$ means $\nu$ cannot assign positive mass where $\mu$ assigns zero
Radon-Nikodym: if $\nu \ll \mu$ , then $\nu(A) = \int_A \frac{d\nu}{d\mu}\,d\mu$
A "density" is always a Radon-Nikodym derivative with respect to some reference measure
Conditional expectation $\mathbb{E}[Y|\mathcal{G}]$ is defined as the Radon-Nikodym derivative of $A \mapsto \int_A Y\,d\mathbb{P}$ on $\mathcal{G}$
Tower property: $\mathbb{E}[\mathbb{E}[Y|\mathcal{G}]|\mathcal{H}] = \mathbb{E}[Y|\mathcal{H}]$ when $\mathcal{H} \subseteq \mathcal{G}$
Likelihood ratios, importance sampling weights, and KL divergence are all functions of Radon-Nikodym derivatives
Conditional expectation is a random variable (function of the conditioning information), not a fixed number

Exercises

ExerciseCore

Problem

Let $P = \mathcal{N}(1, 1)$ and $Q = \mathcal{N}(0, 1)$ . Compute the Radon-Nikodym derivative $dP/dQ$ and verify that $\mathbb{E}_Q[dP/dQ] = 1$ .

ExerciseAdvanced

Problem

Use the tower property to prove that if $\hat{\theta}$ is an unbiased estimator of $\theta$ (i.e., $\mathbb{E}[\hat{\theta}] = \theta$ ) and $T$ is a sufficient statistic, then $\tilde{\theta} = \mathbb{E}[\hat{\theta} | T]$ is also unbiased and has variance at most that of $\hat{\theta}$ (Rao-Blackwell theorem).

ExerciseResearch

Problem

Give an example where $P \ll Q$ but the importance sampling estimator $\frac{1}{n}\sum_{i=1}^n f(X_i)\frac{dP}{dQ}(X_i)$ with $X_i \sim Q$ has infinite variance, even though $\mathbb{E}_P[f(X)]$ is finite. What property of $dP/dQ$ causes this, and what does it imply for practical importance sampling?

References

Canonical:

Billingsley, Probability and Measure (3rd ed., 1995), Chapter 32
Durrett, Probability: Theory and Examples (5th ed., 2019), Sections 4.1, 5.1
Williams, Probability with Martingales (1991), Chapters 6, 9
Kallenberg, Foundations of Modern Probability (3rd ed., 2021), Chapter 3 (Radon-Nikodym) and Chapter 9 (conditional expectation)

Current:

Schervish, Theory of Statistics (1995), Chapter 1
Pollard, A User's Guide to Measure Theoretic Probability (2002), Chapter 5

Next Topics

Building on the Radon-Nikodym theorem:

Maximum likelihood estimation: the likelihood function is a Radon-Nikodym derivative
Importance sampling: reweighting by $dP/dQ$ to estimate expectations under $P$ using samples from $Q$
Concentration inequalities: the first application of measure-theoretic tools to bounding tail probabilities

Last reviewed: April 23, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Measure-Theoretic Probabilitylayer 0B · tier 1

Derived topics

5

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiencylayer 0B · tier 1
Concentration Inequalitieslayer 1 · tier 1
Importance Samplinglayer 2 · tier 1
Weighted Conformal Prediction Under Covariate Shiftlayer 3 · tier 1
Adaptive Learning Is Not IIDlayer 3 · tier 2

Graph-backed continuations

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiency Importance Sampling Concentration Inequalities Adaptive Learning Is Not IID Weighted Conformal Prediction Under Covariate Shift