Measure-Theoretic Probability

Sneiderman, Robby

Mathematical Infrastructure

Measure-Theoretic Probability

The rigorous foundations of probability: sigma-algebras, measures, measurable functions as random variables, Lebesgue integration, and the convergence theorems that make modern probability and statistics possible.

ReferenceCoreTier 1StableCore spine~80 min

For:Research

Prerequisites

Cardinality and Countability Integration and Change of Variables Kolmogorov Probability Axioms Random Variables

Start 8-question practice · 27 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

mathematical-infrastructure | layer 0B | tier 1. This page has 6 direct prerequisites and 17 published dependents.

Open Atlas Prerequisites Leads to

What next

Concentration Inequalities

This is the first curated or graph-derived continuation from the current page.

Evidence badge

5 Lean-backed claims

4 match the public claim scope; 1 are dependency proofs. 0 sorry/admit markers are recorded.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Every rigorous result in probability, statistics, and machine learning theory rests on measure-theoretic foundations. When you write $\mathbb{E}[f(X)]$ or $\Pr[X \in A]$ , you are using the and probability measures, whether you know it or not. These are the foundations beneath expectation, variance, and moments.

Three Greek letters tag the three pieces of every probability space: for outcomes, for measurable events, and for the measure itself.

Why can't you just use "naive" probability (counting outcomes) or Riemann integration? Three reasons:

Conditional expectation on continuous random variables requires measure theory. The expression $\mathbb{E}[Y | X = x]$ does not make sense as a ratio $\Pr[Y, X=x]/\Pr[X=x]$ because $\Pr[X = x] = 0$ for continuous $X$ . You need the Radon-Nikodym theorem.
Convergence of integrals requires Lebesgue's dominated convergence theorem (DCT). When you interchange limits and expectations (which happens in every consistency proof, every asymptotic argument, every convergence theorem), you are implicitly using DCT. The Riemann integral does not support this interchange in general.
Not all subsets are measurable. The Banach-Tarski paradox shows that without restricting to a sigma-algebra, you can "construct" sets with no meaningful volume. Measure theory tells you which sets you are allowed to assign probabilities to.

If you skip measure theory, you will be able to use results in ML theory but not understand why the proofs are valid or where they might break.

Mental Model

Think of measure theory as providing three layers of infrastructure:

What can you measure?: The sigma-algebra $\mathcal{F}$ tells you which events you are allowed to assign probabilities to. Not all subsets of $\mathbb{R}$ are measurable (Vitali sets), so you must specify the collection of "nice" sets upfront.
How do you measure?: The measure $\mu$ (or probability $\mathbb{P}$ ) assigns a non-negative number to each measurable set, satisfying countable additivity: $\mu(\bigcup_n A_n) = \sum_n \mu(A_n)$ for disjoint $A_n$ .
How do you integrate?: The Lebesgue integral $\int f \, d\mu$ generalizes the Riemann integral to handle limits, conditional expectations, and abstract spaces. Expectation is just integration: $\mathbb{E}[X] = \int X \, d\mathbb{P}$ .

Formal Setup

Definition

Sigma-Algebra $F$

A sigma-algebra (or $\sigma$ -algebra) on a set $\Omega$ is a collection $\mathcal{F}$ of subsets of $\Omega$ if and only if it satisfies:

$\Omega \in \mathcal{F}$ (the whole space is measurable)
If $A \in \mathcal{F}$ , then $A^c \in \mathcal{F}$ (closed under complements)
If $A_1, A_2, \ldots \in \mathcal{F}$ , then $\bigcup_{n=1}^\infty A_n \in \mathcal{F}$ (closed under countable unions)

Why we need this: Without restricting to a sigma-algebra, we could construct non-measurable sets (Vitali set) that cannot consistently be assigned a probability. The sigma-algebra is the collection of events for which probability is well-defined.

The Borel sigma-algebra $\mathcal{B}(\mathbb{R})$ is the sigma-algebra generated by all open sets in $\mathbb{R}$ . It contains all intervals, all open sets, all closed sets, and all countable operations on these. Every set you will encounter in applied probability is Borel-measurable.

Definition

Measure $μ$

A measure on $(\Omega, \mathcal{F})$ is a function $\mu: \mathcal{F} \to [0, \infty]$ satisfying:

$\mu(\emptyset) = 0$
Countable additivity: If $A_1, A_2, \ldots \in \mathcal{F}$ are pairwise disjoint, then $\mu(\bigcup_n A_n) = \sum_n \mu(A_n)$

The triple $(\Omega, \mathcal{F}, \mu)$ is called a measure space.

Definition

Probability Measure $P$

A probability measure is a measure $\mathbb{P}$ with $\mathbb{P}(\Omega) = 1$ . The triple $(\Omega, \mathcal{F}, \mathbb{P})$ is a probability space.

This is the formal foundation: $\Omega$ is the sample space (set of all outcomes), $\mathcal{F}$ is the set of events, and $\mathbb{P}$ assigns probabilities. The three axioms of probability (Kolmogorov) are exactly the axioms of a probability measure.

Definition

Measurable Function / Random Variable $X : Ω \to R$

A function $X: (\Omega, \mathcal{F}) \to (\mathbb{R}, \mathcal{B}(\mathbb{R}))$ is measurable if and only if for every Borel set $B \in \mathcal{B}(\mathbb{R})$ :

$X^{-1}(B) = \{\omega \in \Omega : X(\omega) \in B\} \in \mathcal{F}$

A random variable is simply a measurable function from a probability space to $\mathbb{R}$ . The measurability condition ensures that events like $\{X \leq t\}$ and $\{X \in [a, b]\}$ are in $\mathcal{F}$ , so you can assign probabilities to them.

The Lebesgue Integral

Definition

Lebesgue Integral $\int f d μ$

The Lebesgue integral of a measurable function $f$ with respect to measure $\mu$ is constructed in three steps:

Step 1 (Simple functions): A simple function $s = \sum_{k=1}^n a_k \mathbf{1}_{A_k}$ (finite linear combination of indicator functions) has integral $\int s \, d\mu = \sum_k a_k \mu(A_k)$ .

Step 2 (Non-negative functions): For $f \geq 0$ measurable: $\int f \, d\mu = \sup\{\int s \, d\mu : 0 \leq s \leq f, \; s \text{ simple}\}$ .

Step 3 (General functions): Write $f = f^+ - f^-$ where $f^+ = \max(f, 0)$ and $f^- = \max(-f, 0)$ . Then $\int f \, d\mu = \int f^+ d\mu - \int f^- d\mu$ (provided at least one is finite).

Lebesgue vs. Riemann. The Riemann integral partitions the domain into small intervals and approximates the function on each. The Lebesgue integral partitions the range and measures how much of the domain maps into each part. This is why Lebesgue can integrate functions that Riemann cannot (like $\mathbf{1}_{\mathbb{Q}}$ , the indicator of rationals).

For probability: $\mathbb{E}[X] = \int_\Omega X(\omega) \, d\mathbb{P}(\omega)$ . This is just the Lebesgue integral of the random variable $X$ against the probability measure $\mathbb{P}$ .

Main Theorems

The three convergence theorems are the workhorses of measure-theoretic probability. They tell you when you can interchange limits and integrals.

Theorem

Monotone Convergence Theorem (MCT)

Statement

If $f_n \geq 0$ are measurable functions with $f_1 \leq f_2 \leq \cdots$ pointwise, and $f = \lim_{n \to \infty} f_n$ , then:

$\int f \, d\mu = \lim_{n \to \infty} \int f_n \, d\mu$

Equivalently: $\int \lim_n f_n \, d\mu = \lim_n \int f_n \, d\mu$ .

Intuition

For non-negative, non-decreasing sequences, the limit of the integrals equals the integral of the limit. No additional conditions are needed beyond monotonicity and non-negativity. Intuitively: if the functions grow monotonically toward $f$ , the areas under them grow monotonically toward the area under $f$ .

Proof Sketch

One direction ( $\leq$ ) is immediate: since $f_n \leq f$ , $\int f_n \leq \int f$ , so $\lim \int f_n \leq \int f$ .

The other direction ( $\geq$ ) uses the definition of the Lebesgue integral as a supremum over simple functions. For any simple $s \leq f$ and any $\alpha < 1$ , the sets $E_n = \{f_n \geq \alpha s\}$ increase to $\Omega$ . Then $\int f_n \geq \int_{E_n} f_n \geq \alpha \int_{E_n} s$ . Taking $n \to \infty$ : $\lim \int f_n \geq \alpha \int s$ . Since $\alpha < 1$ and $s \leq f$ are arbitrary, $\lim \int f_n \geq \int f$ .

Why It Matters

MCT is the foundation for all other convergence theorems. It is used to prove Fatou's lemma, which in turn is used to prove DCT. In probability, MCT justifies interchanging expectations with monotone limits: if $0 \leq X_n \uparrow X$ , then $\mathbb{E}[X_n] \uparrow \mathbb{E}[X]$ . This is used constantly when working with truncated random variables, constructing the Lebesgue integral itself, and proving properties of conditional expectation.

Failure Mode

MCT requires non-negativity and monotonicity. Without monotonicity, the conclusion can fail: $f_n = \mathbf{1}_{[n, n+1]}$ on $\mathbb{R}$ has $\int f_n = 1$ for all $n$ but $f_n \to 0$ pointwise, so $\int \lim f_n = 0 \neq 1$ . (These are not monotone.) Without non-negativity, consider $f_n = -\mathbf{1}_{[0, n]}$ : $\int f_n = -n \to -\infty$ but we need careful treatment of the limit.

report a correction →

Lemma

Fatou's Lemma

Statement

If $f_n \geq 0$ are measurable, then:

$\int \liminf_{n \to \infty} f_n \, d\mu \leq \liminf_{n \to \infty} \int f_n \, d\mu$

In probability notation: $\mathbb{E}[\liminf X_n] \leq \liminf \mathbb{E}[X_n]$ .

Intuition

"The integral of the limit is at most the limit of the integrals." Fatou says: mass can disappear in the limit (escape to infinity or concentrate on a null set), so the limit might have less integral than you expect. But mass cannot spontaneously appear, so the integral of the limit is a lower bound.

Proof Sketch

Define $g_n = \inf_{k \geq n} f_k$ . Then $g_n \leq f_n$ and $g_n$ is non-decreasing with $g_n \uparrow \liminf f_n$ . By MCT: $\int \liminf f_n = \lim_n \int g_n$ . Since $g_n \leq f_n$ : $\int g_n \leq \int f_n$ . Therefore $\lim_n \int g_n \leq \liminf_n \int f_n$ .

Why It Matters

Fatou's lemma is the key tool when you do not have the conditions for DCT (no dominating function). It gives you at least a one-sided inequality, which is often enough. In probability, Fatou is used to prove the lower semicontinuity of variance, the consistency of risk functionals, and many other one-sided limit results.

Failure Mode

The inequality can be strict. Classic example: $f_n = n \cdot \mathbf{1}_{[0, 1/n]}$ on $[0, 1]$ . Then $\int f_n = 1$ for all $n$ , but $f_n \to 0$ pointwise, so $\int \liminf f_n = 0 < 1 = \liminf \int f_n$ . Mass "escapes" to the spike at zero.

report a correction →

Theorem

Dominated Convergence Theorem (DCT)

Statement

If $f_n \to f$ pointwise (or $\mu$ -almost everywhere), and there exists an integrable function $g$ with $|f_n| \leq g$ for all $n$ , then:

$\lim_{n \to \infty} \int f_n \, d\mu = \int f \, d\mu$

Equivalently: $\int \lim_n f_n = \lim_n \int f_n$ .

In probability notation: if $|X_n| \leq Y$ with $\mathbb{E}[Y] < \infty$ and $X_n \to X$ a.s., then $\mathbb{E}[X_n] \to \mathbb{E}[X]$ .

Intuition

DCT says: if the functions converge pointwise and are all bounded by a single integrable function $g$ (the "dominator"), then you can swap the limit and the integral. The dominator prevents mass from escaping to infinity. It acts as an "envelope" that keeps all the $f_n$ under control.

The dominator condition is the key: without it, the limit can lose mass (as in the Fatou counterexample above). With it, Fatou applied to $g + f_n \geq 0$ and $g - f_n \geq 0$ gives both directions of the inequality.

Proof Sketch

Apply Fatou's lemma to $g + f_n \geq 0$ : $\int g + \int f \leq \liminf \int(g + f_n) = \int g + \liminf \int f_n$ . So $\int f \leq \liminf \int f_n$ .

Apply Fatou to $g - f_n \geq 0$ : $\int g - \int f \leq \liminf \int(g - f_n) = \int g - \limsup \int f_n$ . So $\int f \geq \limsup \int f_n$ .

Combined: $\limsup \int f_n \leq \int f \leq \liminf \int f_n$ , forcing $\lim \int f_n = \int f$ .

Why It Matters

DCT is the most-used convergence theorem in probability and statistics. Every time you:

Differentiate under the integral sign ( $\partial_\theta \mathbb{E}[f(X, \theta)]$ )
Pass limits through expectations in consistency proofs
Justify asymptotic expansions of integrals
Prove continuity of distribution functions

...you are (implicitly or explicitly) applying DCT. The dominator condition is usually verified by finding a uniform bound on $|f_n|$ .

In ML theory, DCT appears in: proving consistency of MLE, justifying score function estimators, and any argument that exchanges $\lim_{n \to \infty}$ with $\mathbb{E}[\cdot]$ .

Failure Mode

You must find an integrable dominator $g$ . If no such $g$ exists, DCT does not apply, and the interchange can fail. A common mistake: claiming DCT applies because " $f_n$ is bounded" without checking that the bound is integrable on the full space (e.g., bounded functions on $\mathbb{R}$ are not necessarily integrable on $\mathbb{R}$ because the domain has infinite measure).

On finite probability spaces ( $\mathbb{P}(\Omega) = 1$ ), bounded convergence automatically gives a dominator ( $g = \sup_n |f_n|$ ), so the bounded convergence theorem is a special case of DCT.

report a correction →

The Borel-Cantelli Lemmas

The Borel-Cantelli lemmas relate the summability of event probabilities to whether those events occur infinitely often (i.o.). Throughout, define $\{A_n \text{ i.o.}\} = \bigcap_{N=1}^\infty \bigcup_{n \geq N} A_n = \limsup_n A_n$ , the set of outcomes that lie in infinitely many $A_n$ .

Lemma

First Borel-Cantelli Lemma

Statement

If $\sum_{n=1}^\infty \Pr[A_n] < \infty$ , then $\Pr[A_n \text{ i.o.}] = 0.$ No independence assumption is needed.

Intuition

If the probabilities are summable, the expected number of events that occur, $\mathbb{E}[\sum_n \mathbf{1}_{A_n}] = \sum_n \Pr[A_n]$ , is finite. A random variable with finite mean is finite almost surely, so only finitely many $A_n$ occur with probability 1.

Proof Sketch

By monotonicity, $\Pr[A_n \text{ i.o.}] \leq \Pr[\bigcup_{n \geq N} A_n] \leq \sum_{n \geq N} \Pr[A_n]$ for every $N$ . The right side is the tail of a convergent series, so it tends to 0 as $N \to \infty$ .

Why It Matters

This is the standard tool for proving almost-sure convergence. Many strong-law and concentration results rely on showing $\sum_n \Pr[|X_n - X| > \epsilon] < \infty$ , then invoking the first Borel-Cantelli lemma to conclude $X_n \to X$ a.s.

Failure Mode

The lemma only gives one direction. Summability is sufficient but not necessary: there are sequences with $\sum_n \Pr[A_n] = \infty$ for which $\Pr[A_n \text{ i.o.}] = 0$ (when the events are strongly dependent). The converse requires independence.

report a correction →

Lemma

Second Borel-Cantelli Lemma

Statement

If $\{A_n\}$ are pairwise independent and $\sum_{n=1}^\infty \Pr[A_n] = \infty$ , then $\Pr[A_n \text{ i.o.}] = 1.$

Intuition

Under (mutual) independence, the probability that all later events fail factors as a vanishing product. Pairwise independence is enough but requires a different proof: a second-moment / variance argument applied to the partial sums $S_N = \sum_{n \leq N} \mathbf{1}_{A_n}$ .

Proof Sketch

The classical product proof $\Pr\!\Big[\bigcap_{n=N}^M A_n^c\Big] = \prod_{n=N}^M (1 - \Pr[A_n]) \leq \exp\!\Big(-\sum_{n=N}^M \Pr[A_n]\Big) \to 0$ uses the mutual independence of the finite subcollection (the indicator variables must factor jointly) and gives the result under that stronger hypothesis. To obtain the conclusion from only pairwise independence, replace the product argument by Chebyshev / Paley-Zygmund applied to $S_N = \sum_{n=1}^N \mathbf{1}_{A_n}$ : pairwise independence makes $\mathrm{Var}(S_N) = \sum_n \Pr[A_n](1 - \Pr[A_n]) \leq \mathbb{E}[S_N]$ , so $\Pr[S_N > \mathbb{E}[S_N]/2] \to 1$ as $\mathbb{E}[S_N] \to \infty$ , and then $\Pr[A_n \text{ i.o.}] = 1$ follows by taking $N \to \infty$ . (See Durrett 5e Thm. 2.3.7, or the original Erdős-Rényi 1959 / Kochen-Stone 1964 argument.)

Why It Matters

This lemma turns a divergent probability sum into an almost-sure "infinitely often" conclusion. It is the standard way to show that rare events must eventually recur under independence, and it provides many counterexamples separating convergence in probability from almost-sure convergence.

Failure Mode

Independence is required. Without it, the conclusion can fail: take $A_n = A$ for all $n$ with $0 < \Pr[A] < 1$ . Then $\sum_n \Pr[A_n] = \infty$ , but $\Pr[A_n \text{ i.o.}] = \Pr[A] < 1$ . Pairwise independence is sufficient; mutual independence is not required.

report a correction →

Watch Out

The two Borel-Cantelli lemmas are not symmetric

The first lemma (summable probabilities imply $\Pr[\text{i.o.}] = 0$ ) is unconditional: no independence assumption is required. The second lemma (divergent sum implies $\Pr[\text{i.o.}] = 1$ ) requires pairwise independence of the events. The asymmetry is real. Without independence, the second conclusion can fail even when $\sum_n \Pr[A_n] = \infty$ . The canonical counterexample is $A_n = A$ for all $n$ : the sum diverges but $\Pr[A_n \text{ i.o.}] = \Pr[A]$ , which can be any value in $[0, 1]$ .

Why Measure Theory is Necessary for ML Theory

Three concrete examples where measure theory is unavoidable:

1. Conditional expectation. For continuous random variables, the "intuitive" definition $\mathbb{E}[Y | X = x] = \int y \, p(y|x)\,dy$ works in simple cases but fails for general $X$ (what if $X$ is a random function?). The measure-theoretic definition: $\mathbb{E}[Y | \mathcal{G}]$ is the $\mathcal{G}$ -measurable function $Z$ satisfying $\int_A Z \, d\mathbb{P} = \int_A Y \, d\mathbb{P}$ for all $A \in \mathcal{G}$ . This exists by the Radon-Nikodym theorem and is unique a.s.

2. Radon-Nikodym derivatives. The likelihood ratio $\frac{dP_\theta}{dP_{\theta_0}}$ . which appears in MLE, hypothesis testing, and importance sampling. is a Radon-Nikodym derivative. It exists if and only if $P_\theta$ is absolutely continuous with respect to $P_{\theta_0}$ . This concept is purely measure-theoretic.

3. Martingales. The theory of martingales (used in online learning, sequential analysis, and stochastic optimization) requires filtrations (increasing sequences of sigma-algebras), which are a measure-theoretic construction. Without sigma-algebras, you cannot formalize "information available at time $t$ ."

Canonical Examples

Example

Lebesgue measure on [0,1]

The standard probability space for continuous uniform random variables is $([0,1], \mathcal{B}([0,1]), \lambda)$ where $\lambda$ is Lebesgue measure: $\lambda([a,b]) = b - a$ . A uniform random variable $U$ on $[0,1]$ is just the identity function $U(\omega) = \omega$ . The CDF is $\Pr[U \leq t] = \lambda([0, t]) = t$ .

Every probability distribution on $\mathbb{R}$ can be realized as a function of a uniform random variable (inverse CDF transform). So this single probability space is, in a sense, universal.

Example

Why not all subsets of [0,1] are measurable

The Vitali set construction shows that one cannot consistently assign a "length" to every subset of $[0,1]$ while preserving translation invariance and countable additivity. Using the axiom of choice, pick one representative from each equivalence class of $\mathbb{R}$ under the relation $x \sim y \iff x - y \in \mathbb{Q}$ , with all representatives in $[0,1]$ , to form a set $V$ . The countable family of rational translates $\{V + q : q \in \mathbb{Q} \cap [-1, 1]\}$ is pairwise disjoint, covers $[0,1]$ , and is contained in $[-1, 2]$ . If $\lambda(V)$ existed, translation invariance and countable additivity would give $1 \leq \sum_q \lambda(V) \leq 3$ , but $\sum_q \lambda(V)$ is either $0$ (if $\lambda(V) = 0$ ) or $\infty$ (if $\lambda(V) > 0$ ) — contradiction. The bounded enclosure $[-1, 2]$ , not an exact partition of $[0,1]$ , is what forces the contradiction.

This is why we restrict to the Borel sigma-algebra: it is rich enough to contain all sets we ever need in practice, while excluding pathological sets that break countable additivity.

Example

DCT in action: differentiating under the integral

Let $f(x, \theta)$ be a function where we want to compute $\frac{d}{d\theta}\mathbb{E}[f(X, \theta)] = \frac{d}{d\theta}\int f(x, \theta) \, dP(x)$ .

If $|\partial_\theta f(x, \theta)| \leq g(x)$ for all $\theta$ near $\theta_0$ with $\mathbb{E}[g(X)] < \infty$ , then DCT gives:

$\frac{d}{d\theta}\mathbb{E}[f(X, \theta)]\Big|_{\theta_0} = \mathbb{E}\!\left[\frac{\partial f}{\partial \theta}(X, \theta_0)\right]$

This interchange is used constantly: in computing score functions for MLE, in the "log-derivative trick" for policy gradients, and in variational inference. Every valid use requires checking the dominator condition.

Common Confusions

Watch Out

Probability zero does not mean impossible

In measure theory, $\mathbb{P}(A) = 0$ means the event $A$ has zero probability, but it does not mean $A$ is empty. For a continuous uniform random variable on $[0,1]$ , every single point has probability zero: $\Pr[U = x] = 0$ for all $x$ . Yet one of these "impossible" events must occur. This is perfectly consistent: countable additivity requires that $\Pr[\bigcup_{i=1}^\infty \{x_i\}] = 0$ for any countable collection, but $[0,1]$ is uncountable, so the union of all singletons is not a countable union.

Watch Out

Almost surely is not surely

" $X_n \to X$ almost surely" means $\Pr[\{\omega : X_n(\omega) \to X(\omega)\}] = 1$ . There may be a null set (probability-zero set) where convergence fails. This is different from " $X_n \to X$ everywhere." In measure theory, we routinely ignore null sets because they do not affect integrals or probabilities. But you must be careful: a countable union of null sets is still null, but an uncountable union might not be.

Watch Out

Riemann-integrable functions are Lebesgue-integrable, but not vice versa

Every Riemann-integrable function on $[a, b]$ is also Lebesgue-integrable, and the integrals agree. But the Lebesgue integral handles more functions (like $\mathbf{1}_\mathbb{Q}$ , which is Lebesgue-integrable with integral 0 but not Riemann-integrable). The Lebesgue integral also has better convergence theorems (MCT, DCT), which is the real reason we use it.

Watch Out

Borel vs. Lebesgue sigma-algebra

The Borel sigma-algebra $\mathcal{B}(\mathbb{R})$ is generated by open sets. The Lebesgue sigma-algebra $\mathcal{L}(\mathbb{R})$ is the completion of $\mathcal{B}$ with respect to Lebesgue measure (adding all subsets of null sets). $\mathcal{L}$ is strictly larger: it contains non-Borel sets. For probability, the Borel sigma-algebra is almost always sufficient, and most textbooks work exclusively with $\mathcal{B}$ .

Summary

A probability space $(\Omega, \mathcal{F}, \mathbb{P})$ consists of a sample space, sigma-algebra, and probability measure
Sigma-algebras restrict which events can have probabilities (to avoid paradoxes)
Random variables are measurable functions; measurability ensures $\{X \leq t\} \in \mathcal{F}$
Expectation is the Lebesgue integral: $\mathbb{E}[X] = \int X \, d\mathbb{P}$
MCT: for $0 \leq f_n \uparrow f$ , the integral of the limit equals the limit of the integrals
Fatou: $\int \liminf f_n \leq \liminf \int f_n$ (mass can disappear, not appear)
DCT: if $|f_n| \leq g$ (integrable) and $f_n \to f$ , then $\int f_n \to \int f$
DCT is the tool for interchanging limits and expectations
Measure theory is necessary for: conditional expectation, Radon-Nikodym, martingales, and any rigorous asymptotic argument

Exercises

ExerciseCore

Problem

Let $f_n(x) = n \cdot x^n$ on $[0, 1]$ with Lebesgue measure. Compute $\lim_{n \to \infty} \int_0^1 f_n(x) \, dx$ and verify that it equals $\int_0^1 \lim_{n \to \infty} f_n(x) \, dx$ . Does DCT apply here?

ExerciseCore

Problem

Let $X_n$ be a sequence of random variables with $\mathbb{E}[X_n] = 1$ and $X_n \to X$ almost surely. Can you conclude that $\mathbb{E}[X] = 1$ ? State precisely what additional condition you need.

ExerciseAdvanced

Problem

Prove that if $X_n \to X$ in $L^1$ (i.e., $\mathbb{E}[|X_n - X|] \to 0$ ), then $\mathbb{E}[X_n] \to \mathbb{E}[X]$ . Conversely, give an example where $\mathbb{E}[X_n] \to \mathbb{E}[X]$ but $X_n$ does not converge to $X$ in $L^1$ .

ExerciseAdvanced

Problem

Use the monotone convergence theorem to prove that for any non-negative random variable $X$ : $\mathbb{E}[X] = \int_0^\infty \Pr[X > t] \, dt$ . This is the "layer cake" representation of the expectation.

Advanced: Frostman's Lemma and Capacity

Frostman's lemma connects measure theory to geometric set theory and potential theory. It characterizes the "size" of a set via the measures it can support.

Frostman's lemma. A Borel set $E \subseteq \mathbb{R}^d$ has Hausdorff dimension $\geq s$ if and only if there exists a non-zero Borel measure $\mu$ supported on $E$ such that $\mu(B(x, r)) \leq C r^s$ for all $x$ and all $r > 0$ .

The measure $\mu$ is called a Frostman measure. The energy integral $I_s(\mu) = \iint |x - y|^{-s} \, d\mu(x) \, d\mu(y)$ is finite for such measures when $s < \dim_H(E)$ .

This result matters in probability because it connects Hausdorff dimension (a geometric measure of set complexity) to the existence of measures with controlled local growth. It appears in the theory of random sets, Brownian motion (the range of a Brownian motion in $\mathbb{R}^d$ has Hausdorff dimension $\min(2, d)$ , proved using Frostman-type energy arguments), and fractal geometry.

References

Canonical:

Billingsley, Probability and Measure (3rd ed., 1995), Chapters 1-5
Durrett, Probability: Theory and Examples (5th ed., 2019), Chapters 1-2
Rudin, Real and Complex Analysis (3rd ed., 1987), Chapters 1-3

Potential theory and capacity:

Mattila, Geometry of Sets and Measures in Euclidean Spaces (1995), Chapter 8. The definitive reference for Frostman's lemma and energy integrals.
Kahane, Some Random Series of Functions (2nd ed., 1985), Chapter 10

Current:

Tao, An Introduction to Measure Theory (2011)
Schilling, Measures, Integrals and Martingales (2nd ed., 2017)

Next Topics

Building on measure-theoretic foundations:

Concentration inequalities: the first application of measure-theoretic tools to learning theory
Common probability distributions: the standard distributions, now understood as measures on the Borel sigma-algebra

Last reviewed: April 26, 2026

Claim evidence

Selected claims on this topic have machine-checked support.

Collapsed by default because this is audit material. Open it to see exact theorem names, claim scopes, and source roles.

Click to expand

4

claim matches

1

dependency proofs

0

incomplete markers

This is claim-level evidence, not a whole-page badge. The checked theorem must match the recorded claim scope. Supporting lemmas stay labeled as dependency proofs, not full claim matches.