Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Mathematical Infrastructure

Measure-Theoretic Probability

The rigorous foundations of probability: sigma-algebras, measures, measurable functions as random variables, Lebesgue integration, and the convergence theorems that make modern probability and statistics possible.

CoreTier 1Stable~80 min

Why This Matters

Ω(sample space)ABA∩BAᶜA∪Bσ-algebra ℱ must satisfy:Ω ∈ ℱThe full space is measurableA ∈ ℱ ⇒ Aᶜ ∈ ℱClosed under complementA₁, A₂, ... ∈ ℱ ⇒ ∪ Aᵢ ∈ ℱClosed under countable unionThese imply closure under:countable intersection (∩ Aᵢ)set difference (A ∖ B)A sigma-algebra is a collection of measurable events, closed under complement and countable union.

Every rigorous result in probability, statistics, and machine learning theory rests on measure-theoretic foundations. When you write E[f(X)]\mathbb{E}[f(X)] or Pr[XA]\Pr[X \in A], you are using the Lebesgue integral and probability measures, whether you know it or not. These are the foundations beneath expectation, variance, and moments.

Why can't you just use "naive" probability (counting outcomes) or Riemann integration? Three reasons:

  1. Conditional expectation on continuous random variables requires measure theory. The expression E[YX=x]\mathbb{E}[Y | X = x] does not make sense as a ratio Pr[Y,X=x]/Pr[X=x]\Pr[Y, X=x]/\Pr[X=x] because Pr[X=x]=0\Pr[X = x] = 0 for continuous XX. You need the Radon-Nikodym theorem.

  2. Convergence of integrals requires Lebesgue's dominated convergence theorem (DCT). When you interchange limits and expectations (which happens in every consistency proof, every asymptotic argument, every convergence theorem), you are implicitly using DCT. The Riemann integral does not support this interchange in general.

  3. Not all subsets are measurable. The Banach-Tarski paradox shows that without restricting to a sigma-algebra, you can "construct" sets with no meaningful volume. Measure theory tells you which sets you are allowed to assign probabilities to.

If you skip measure theory, you will be able to use results in ML theory but not understand why the proofs are valid or where they might break.

Mental Model

Think of measure theory as providing three layers of infrastructure:

  1. What can you measure?: The sigma-algebra F\mathcal{F} tells you which events you are allowed to assign probabilities to. Not all subsets of R\mathbb{R} are measurable (Vitali sets), so you must specify the collection of "nice" sets upfront.

  2. How do you measure?: The measure μ\mu (or probability P\mathbb{P}) assigns a non-negative number to each measurable set, satisfying countable additivity: μ(nAn)=nμ(An)\mu(\bigcup_n A_n) = \sum_n \mu(A_n) for disjoint AnA_n.

  3. How do you integrate?: The Lebesgue integral fdμ\int f \, d\mu generalizes the Riemann integral to handle limits, conditional expectations, and abstract spaces. Expectation is just integration: E[X]=XdP\mathbb{E}[X] = \int X \, d\mathbb{P}.

Formal Setup

Definition

Sigma-Algebra

A sigma-algebra (or σ\sigma-algebra) on a set Ω\Omega is a collection F\mathcal{F} of subsets of Ω\Omega satisfying:

  1. ΩF\Omega \in \mathcal{F} (the whole space is measurable)
  2. If AFA \in \mathcal{F}, then AcFA^c \in \mathcal{F} (closed under complements)
  3. If A1,A2,FA_1, A_2, \ldots \in \mathcal{F}, then n=1AnF\bigcup_{n=1}^\infty A_n \in \mathcal{F} (closed under countable unions)

Why we need this: Without restricting to a sigma-algebra, we could construct non-measurable sets (Vitali set) that cannot consistently be assigned a probability. The sigma-algebra is the collection of events for which probability is well-defined.

The Borel sigma-algebra B(R)\mathcal{B}(\mathbb{R}) is the sigma-algebra generated by all open sets in R\mathbb{R}. It contains all intervals, all open sets, all closed sets, and all countable operations on these. Every set you will encounter in applied probability is Borel-measurable.

Definition

Measure

A measure on (Ω,F)(\Omega, \mathcal{F}) is a function μ:F[0,]\mu: \mathcal{F} \to [0, \infty] satisfying:

  1. μ()=0\mu(\emptyset) = 0
  2. Countable additivity: If A1,A2,FA_1, A_2, \ldots \in \mathcal{F} are pairwise disjoint, then μ(nAn)=nμ(An)\mu(\bigcup_n A_n) = \sum_n \mu(A_n)

The triple (Ω,F,μ)(\Omega, \mathcal{F}, \mu) is called a measure space.

Definition

Probability Measure

A probability measure is a measure P\mathbb{P} with P(Ω)=1\mathbb{P}(\Omega) = 1. The triple (Ω,F,P)(\Omega, \mathcal{F}, \mathbb{P}) is a probability space.

This is the formal foundation: Ω\Omega is the sample space (set of all outcomes), F\mathcal{F} is the set of events, and P\mathbb{P} assigns probabilities. The three axioms of probability (Kolmogorov) are exactly the axioms of a probability measure.

Definition

Measurable Function / Random Variable

A function X:(Ω,F)(R,B(R))X: (\Omega, \mathcal{F}) \to (\mathbb{R}, \mathcal{B}(\mathbb{R})) is measurable if for every Borel set BB(R)B \in \mathcal{B}(\mathbb{R}):

X1(B)={ωΩ:X(ω)B}FX^{-1}(B) = \{\omega \in \Omega : X(\omega) \in B\} \in \mathcal{F}

A random variable is simply a measurable function from a probability space to R\mathbb{R}. The measurability condition ensures that events like {Xt}\{X \leq t\} and {X[a,b]}\{X \in [a, b]\} are in F\mathcal{F}, so you can assign probabilities to them.

The Lebesgue Integral

Definition

Lebesgue Integral

The Lebesgue integral of a measurable function ff with respect to measure μ\mu is constructed in three steps:

Step 1 (Simple functions): A simple function s=k=1nak1Aks = \sum_{k=1}^n a_k \mathbf{1}_{A_k} (finite linear combination of indicator functions) has integral sdμ=kakμ(Ak)\int s \, d\mu = \sum_k a_k \mu(A_k).

Step 2 (Non-negative functions): For f0f \geq 0 measurable: fdμ=sup{sdμ:0sf,  s simple}\int f \, d\mu = \sup\{\int s \, d\mu : 0 \leq s \leq f, \; s \text{ simple}\}.

Step 3 (General functions): Write f=f+ff = f^+ - f^- where f+=max(f,0)f^+ = \max(f, 0) and f=max(f,0)f^- = \max(-f, 0). Then fdμ=f+dμfdμ\int f \, d\mu = \int f^+ d\mu - \int f^- d\mu (provided at least one is finite).

Lebesgue vs. Riemann. The Riemann integral partitions the domain into small intervals and approximates the function on each. The Lebesgue integral partitions the range and measures how much of the domain maps into each part. This is why Lebesgue can integrate functions that Riemann cannot (like 1Q\mathbf{1}_{\mathbb{Q}}, the indicator of rationals).

For probability: E[X]=ΩX(ω)dP(ω)\mathbb{E}[X] = \int_\Omega X(\omega) \, d\mathbb{P}(\omega). This is just the Lebesgue integral of the random variable XX against the probability measure P\mathbb{P}.

Main Theorems

The three convergence theorems are the workhorses of measure-theoretic probability. They tell you when you can interchange limits and integrals.

Theorem

Monotone Convergence Theorem (MCT)

Statement

If fn0f_n \geq 0 are measurable functions with f1f2f_1 \leq f_2 \leq \cdots pointwise, and f=limnfnf = \lim_{n \to \infty} f_n, then:

fdμ=limnfndμ\int f \, d\mu = \lim_{n \to \infty} \int f_n \, d\mu

Equivalently: limnfndμ=limnfndμ\int \lim_n f_n \, d\mu = \lim_n \int f_n \, d\mu.

Intuition

For non-negative, non-decreasing sequences, the limit of the integrals equals the integral of the limit. No additional conditions are needed beyond monotonicity and non-negativity. Intuitively: if the functions grow monotonically toward ff, the areas under them grow monotonically toward the area under ff.

Proof Sketch

One direction (\leq) is immediate: since fnff_n \leq f, fnf\int f_n \leq \int f, so limfnf\lim \int f_n \leq \int f.

The other direction (\geq) uses the definition of the Lebesgue integral as a supremum over simple functions. For any simple sfs \leq f and any α<1\alpha < 1, the sets En={fnαs}E_n = \{f_n \geq \alpha s\} increase to Ω\Omega. Then fnEnfnαEns\int f_n \geq \int_{E_n} f_n \geq \alpha \int_{E_n} s. Taking nn \to \infty: limfnαs\lim \int f_n \geq \alpha \int s. Since α<1\alpha < 1 and sfs \leq f are arbitrary, limfnf\lim \int f_n \geq \int f.

Why It Matters

MCT is the foundation for all other convergence theorems. It is used to prove Fatou's lemma, which in turn is used to prove DCT. In probability, MCT justifies interchanging expectations with monotone limits: if 0XnX0 \leq X_n \uparrow X, then E[Xn]E[X]\mathbb{E}[X_n] \uparrow \mathbb{E}[X]. This is used constantly when working with truncated random variables, constructing the Lebesgue integral itself, and proving properties of conditional expectation.

Failure Mode

MCT requires non-negativity and monotonicity. Without monotonicity, the conclusion can fail: fn=1[n,n+1]f_n = \mathbf{1}_{[n, n+1]} on R\mathbb{R} has fn=1\int f_n = 1 for all nn but fn0f_n \to 0 pointwise, so limfn=01\int \lim f_n = 0 \neq 1. (These are not monotone.) Without non-negativity, consider fn=1[0,n]f_n = -\mathbf{1}_{[0, n]}: fn=n\int f_n = -n \to -\infty but we need careful treatment of the limit.

Lemma

Fatou's Lemma

Statement

If fn0f_n \geq 0 are measurable, then:

lim infnfndμlim infnfndμ\int \liminf_{n \to \infty} f_n \, d\mu \leq \liminf_{n \to \infty} \int f_n \, d\mu

In probability notation: E[lim infXn]lim infE[Xn]\mathbb{E}[\liminf X_n] \leq \liminf \mathbb{E}[X_n].

Intuition

"The integral of the limit is at most the limit of the integrals." Fatou says: mass can disappear in the limit (escape to infinity or concentrate on a null set), so the limit might have less integral than you expect. But mass cannot spontaneously appear, so the integral of the limit is a lower bound.

Proof Sketch

Define gn=infknfkg_n = \inf_{k \geq n} f_k. Then gnfng_n \leq f_n and gng_n is non-decreasing with gnlim inffng_n \uparrow \liminf f_n. By MCT: lim inffn=limngn\int \liminf f_n = \lim_n \int g_n. Since gnfng_n \leq f_n: gnfn\int g_n \leq \int f_n. Therefore limngnlim infnfn\lim_n \int g_n \leq \liminf_n \int f_n.

Why It Matters

Fatou's lemma is the key tool when you do not have the conditions for DCT (no dominating function). It gives you at least a one-sided inequality, which is often enough. In probability, Fatou is used to prove the lower semicontinuity of variance, the consistency of risk functionals, and many other one-sided limit results.

Failure Mode

The inequality can be strict. Classic example: fn=n1[0,1/n]f_n = n \cdot \mathbf{1}_{[0, 1/n]} on [0,1][0, 1]. Then fn=1\int f_n = 1 for all nn, but fn0f_n \to 0 pointwise, so lim inffn=0<1=lim inffn\int \liminf f_n = 0 < 1 = \liminf \int f_n. Mass "escapes" to the spike at zero.

Theorem

Dominated Convergence Theorem (DCT)

Statement

If fnff_n \to f pointwise (or μ\mu-almost everywhere), and there exists an integrable function gg with fng|f_n| \leq g for all nn, then:

limnfndμ=fdμ\lim_{n \to \infty} \int f_n \, d\mu = \int f \, d\mu

Equivalently: limnfn=limnfn\int \lim_n f_n = \lim_n \int f_n.

In probability notation: if XnY|X_n| \leq Y with E[Y]<\mathbb{E}[Y] < \infty and XnXX_n \to X a.s., then E[Xn]E[X]\mathbb{E}[X_n] \to \mathbb{E}[X].

Intuition

DCT says: if the functions converge pointwise and are all bounded by a single integrable function gg (the "dominator"), then you can swap the limit and the integral. The dominator prevents mass from escaping to infinity. It acts as an "envelope" that keeps all the fnf_n under control.

The dominator condition is the key: without it, the limit can lose mass (as in the Fatou counterexample above). With it, Fatou applied to g+fn0g + f_n \geq 0 and gfn0g - f_n \geq 0 gives both directions of the inequality.

Proof Sketch

Apply Fatou's lemma to g+fn0g + f_n \geq 0: g+flim inf(g+fn)=g+lim inffn\int g + \int f \leq \liminf \int(g + f_n) = \int g + \liminf \int f_n. So flim inffn\int f \leq \liminf \int f_n.

Apply Fatou to gfn0g - f_n \geq 0: gflim inf(gfn)=glim supfn\int g - \int f \leq \liminf \int(g - f_n) = \int g - \limsup \int f_n. So flim supfn\int f \geq \limsup \int f_n.

Combined: lim supfnflim inffn\limsup \int f_n \leq \int f \leq \liminf \int f_n, forcing limfn=f\lim \int f_n = \int f.

Why It Matters

DCT is the most-used convergence theorem in probability and statistics. Every time you:

  • Differentiate under the integral sign (θE[f(X,θ)]\partial_\theta \mathbb{E}[f(X, \theta)])
  • Pass limits through expectations in consistency proofs
  • Justify asymptotic expansions of integrals
  • Prove continuity of distribution functions

...you are (implicitly or explicitly) applying DCT. The dominator condition is usually verified by finding a uniform bound on fn|f_n|.

In ML theory, DCT appears in: proving consistency of MLE, justifying score function estimators, and any argument that exchanges limn\lim_{n \to \infty} with E[]\mathbb{E}[\cdot].

Failure Mode

You must find an integrable dominator gg. If no such gg exists, DCT does not apply, and the interchange can fail. A common mistake: claiming DCT applies because "fnf_n is bounded" without checking that the bound is integrable on the full space (e.g., bounded functions on R\mathbb{R} are not necessarily integrable on R\mathbb{R} because the domain has infinite measure).

On finite probability spaces (P(Ω)=1\mathbb{P}(\Omega) = 1), bounded convergence automatically gives a dominator (g=supnfng = \sup_n |f_n|), so the bounded convergence theorem is a special case of DCT.

The Borel-Cantelli Lemmas

The Borel-Cantelli lemmas relate the summability of event probabilities to whether those events occur infinitely often (i.o.). Throughout, define {An i.o.}=N=1nNAn=lim supnAn\{A_n \text{ i.o.}\} = \bigcap_{N=1}^\infty \bigcup_{n \geq N} A_n = \limsup_n A_n, the set of outcomes that lie in infinitely many AnA_n.

Lemma

First Borel-Cantelli Lemma

Statement

If n=1Pr[An]<\sum_{n=1}^\infty \Pr[A_n] < \infty, then Pr[An i.o.]=0.\Pr[A_n \text{ i.o.}] = 0. No independence assumption is needed.

Intuition

If the probabilities are summable, the expected number of events that occur, E[n1An]=nPr[An]\mathbb{E}[\sum_n \mathbf{1}_{A_n}] = \sum_n \Pr[A_n], is finite. A random variable with finite mean is finite almost surely, so only finitely many AnA_n occur with probability 1.

Proof Sketch

By monotonicity, Pr[An i.o.]Pr[nNAn]nNPr[An]\Pr[A_n \text{ i.o.}] \leq \Pr[\bigcup_{n \geq N} A_n] \leq \sum_{n \geq N} \Pr[A_n] for every NN. The right side is the tail of a convergent series, so it tends to 0 as NN \to \infty.

Why It Matters

This is the standard tool for proving almost-sure convergence. Many strong-law and concentration results rely on showing nPr[XnX>ϵ]<\sum_n \Pr[|X_n - X| > \epsilon] < \infty, then invoking the first Borel-Cantelli lemma to conclude XnXX_n \to X a.s.

Failure Mode

The lemma only gives one direction. Summability is sufficient but not necessary: there are sequences with nPr[An]=\sum_n \Pr[A_n] = \infty for which Pr[An i.o.]=0\Pr[A_n \text{ i.o.}] = 0 (when the events are strongly dependent). The converse requires independence.

Lemma

Second Borel-Cantelli Lemma

Statement

If {An}\{A_n\} are pairwise independent and n=1Pr[An]=\sum_{n=1}^\infty \Pr[A_n] = \infty, then Pr[An i.o.]=1.\Pr[A_n \text{ i.o.}] = 1.

Intuition

Under independence, divergence of the probability sum forces the events to keep occurring. The probability that all later events fail factors as a product that converges to 0.

Proof Sketch

It suffices to show Pr[n=NMAnc]0\Pr[\bigcap_{n = N}^M A_n^c] \to 0 as MM \to \infty for each NN. By independence, Pr[n=NMAnc]=n=NM(1Pr[An])exp(n=NMPr[An])0\Pr[\bigcap_{n=N}^M A_n^c] = \prod_{n=N}^M (1 - \Pr[A_n]) \leq \exp(-\sum_{n=N}^M \Pr[A_n]) \to 0. Therefore Pr[nNAn]=1\Pr[\bigcup_{n \geq N} A_n] = 1 for every NN, which gives Pr[An i.o.]=1\Pr[A_n \text{ i.o.}] = 1.

Why It Matters

This lemma turns a divergent probability sum into an almost-sure "infinitely often" conclusion. It is the standard way to show that rare events must eventually recur under independence, and it provides many counterexamples separating convergence in probability from almost-sure convergence.

Failure Mode

Independence is required. Without it, the conclusion can fail: take An=AA_n = A for all nn with 0<Pr[A]<10 < \Pr[A] < 1. Then nPr[An]=\sum_n \Pr[A_n] = \infty, but Pr[An i.o.]=Pr[A]<1\Pr[A_n \text{ i.o.}] = \Pr[A] < 1. Pairwise independence is sufficient; mutual independence is not required.

Watch Out

The two Borel-Cantelli lemmas are not symmetric

The first lemma (summable probabilities imply Pr[i.o.]=0\Pr[\text{i.o.}] = 0) is unconditional: no independence assumption is required. The second lemma (divergent sum implies Pr[i.o.]=1\Pr[\text{i.o.}] = 1) requires pairwise independence of the events. The asymmetry is real. Without independence, the second conclusion can fail even when nPr[An]=\sum_n \Pr[A_n] = \infty. The canonical counterexample is An=AA_n = A for all nn: the sum diverges but Pr[An i.o.]=Pr[A]\Pr[A_n \text{ i.o.}] = \Pr[A], which can be any value in [0,1][0, 1].

Why Measure Theory is Necessary for ML Theory

Three concrete examples where measure theory is unavoidable:

1. Conditional expectation. For continuous random variables, the "intuitive" definition E[YX=x]=yp(yx)dy\mathbb{E}[Y | X = x] = \int y \, p(y|x)\,dy works in simple cases but fails for general XX (what if XX is a random function?). The measure-theoretic definition: E[YG]\mathbb{E}[Y | \mathcal{G}] is the G\mathcal{G}-measurable function ZZ satisfying AZdP=AYdP\int_A Z \, d\mathbb{P} = \int_A Y \, d\mathbb{P} for all AGA \in \mathcal{G}. This exists by the Radon-Nikodym theorem and is unique a.s.

2. Radon-Nikodym derivatives. The likelihood ratio dPθdPθ0\frac{dP_\theta}{dP_{\theta_0}}. which appears in MLE, hypothesis testing, and importance sampling. is a Radon-Nikodym derivative. It exists if and only if PθP_\theta is absolutely continuous with respect to Pθ0P_{\theta_0}. This concept is purely measure-theoretic.

3. Martingales. The theory of martingales (used in online learning, sequential analysis, and stochastic optimization) requires filtrations (increasing sequences of sigma-algebras), which are a measure-theoretic construction. Without sigma-algebras, you cannot formalize "information available at time tt."

Canonical Examples

Example

Lebesgue measure on [0,1]

The standard probability space for continuous uniform random variables is ([0,1],B([0,1]),λ)([0,1], \mathcal{B}([0,1]), \lambda) where λ\lambda is Lebesgue measure: λ([a,b])=ba\lambda([a,b]) = b - a. A uniform random variable UU on [0,1][0,1] is just the identity function U(ω)=ωU(\omega) = \omega. The CDF is Pr[Ut]=λ([0,t])=t\Pr[U \leq t] = \lambda([0, t]) = t.

Every probability distribution on R\mathbb{R} can be realized as a function of a uniform random variable (inverse CDF transform). So this single probability space is, in a sense, universal.

Example

Why not all subsets of [0,1] are measurable

The Vitali set construction shows that one cannot consistently assign a "length" to every subset of [0,1][0,1] while preserving translation invariance and countable additivity. The construction uses the axiom of choice to produce a set VV such that [0,1][0,1] is a countable disjoint union of translates of VV. If λ(V)\lambda(V) existed, then 1=λ([0,1])=nλ(V)1 = \lambda([0,1]) = \sum_n \lambda(V), which is impossible (the sum is either 0 or \infty).

This is why we restrict to the Borel sigma-algebra: it is rich enough to contain all sets we ever need in practice, while excluding pathological sets that break countable additivity.

Example

DCT in action: differentiating under the integral

Let f(x,θ)f(x, \theta) be a function where we want to compute ddθE[f(X,θ)]=ddθf(x,θ)dP(x)\frac{d}{d\theta}\mathbb{E}[f(X, \theta)] = \frac{d}{d\theta}\int f(x, \theta) \, dP(x).

If θf(x,θ)g(x)|\partial_\theta f(x, \theta)| \leq g(x) for all θ\theta near θ0\theta_0 with E[g(X)]<\mathbb{E}[g(X)] < \infty, then DCT gives:

ddθE[f(X,θ)]θ0=E ⁣[fθ(X,θ0)]\frac{d}{d\theta}\mathbb{E}[f(X, \theta)]\Big|_{\theta_0} = \mathbb{E}\!\left[\frac{\partial f}{\partial \theta}(X, \theta_0)\right]

This interchange is used constantly: in computing score functions for MLE, in the "log-derivative trick" for policy gradients, and in variational inference. Every valid use requires checking the dominator condition.

Common Confusions

Watch Out

Probability zero does not mean impossible

In measure theory, P(A)=0\mathbb{P}(A) = 0 means the event AA has zero probability, but it does not mean AA is empty. For a continuous uniform random variable on [0,1][0,1], every single point has probability zero: Pr[U=x]=0\Pr[U = x] = 0 for all xx. Yet one of these "impossible" events must occur. This is perfectly consistent: countable additivity requires that Pr[i=1{xi}]=0\Pr[\bigcup_{i=1}^\infty \{x_i\}] = 0 for any countable collection, but [0,1][0,1] is uncountable, so the union of all singletons is not a countable union.

Watch Out

Almost surely is not surely

"XnXX_n \to X almost surely" means Pr[{ω:Xn(ω)X(ω)}]=1\Pr[\{\omega : X_n(\omega) \to X(\omega)\}] = 1. There may be a null set (probability-zero set) where convergence fails. This is different from "XnXX_n \to X everywhere." In measure theory, we routinely ignore null sets because they do not affect integrals or probabilities. But you must be careful: a countable union of null sets is still null, but an uncountable union might not be.

Watch Out

Riemann-integrable functions are Lebesgue-integrable, but not vice versa

Every Riemann-integrable function on [a,b][a, b] is also Lebesgue-integrable, and the integrals agree. But the Lebesgue integral handles more functions (like 1Q\mathbf{1}_\mathbb{Q}, which is Lebesgue-integrable with integral 0 but not Riemann-integrable). The Lebesgue integral also has better convergence theorems (MCT, DCT), which is the real reason we use it.

Watch Out

Borel vs. Lebesgue sigma-algebra

The Borel sigma-algebra B(R)\mathcal{B}(\mathbb{R}) is generated by open sets. The Lebesgue sigma-algebra L(R)\mathcal{L}(\mathbb{R}) is the completion of B\mathcal{B} with respect to Lebesgue measure (adding all subsets of null sets). L\mathcal{L} is strictly larger: it contains non-Borel sets. For probability, the Borel sigma-algebra is almost always sufficient, and most textbooks work exclusively with B\mathcal{B}.

Summary

  • A probability space (Ω,F,P)(\Omega, \mathcal{F}, \mathbb{P}) consists of a sample space, sigma-algebra, and probability measure
  • Sigma-algebras restrict which events can have probabilities (to avoid paradoxes)
  • Random variables are measurable functions; measurability ensures {Xt}F\{X \leq t\} \in \mathcal{F}
  • Expectation is the Lebesgue integral: E[X]=XdP\mathbb{E}[X] = \int X \, d\mathbb{P}
  • MCT: for 0fnf0 \leq f_n \uparrow f, the integral of the limit equals the limit of the integrals
  • Fatou: lim inffnlim inffn\int \liminf f_n \leq \liminf \int f_n (mass can disappear, not appear)
  • DCT: if fng|f_n| \leq g (integrable) and fnff_n \to f, then fnf\int f_n \to \int f
  • DCT is the tool for interchanging limits and expectations
  • Measure theory is necessary for: conditional expectation, Radon-Nikodym, martingales, and any rigorous asymptotic argument

Exercises

ExerciseCore

Problem

Let fn(x)=nxnf_n(x) = n \cdot x^n on [0,1][0, 1] with Lebesgue measure. Compute limn01fn(x)dx\lim_{n \to \infty} \int_0^1 f_n(x) \, dx and verify that it equals 01limnfn(x)dx\int_0^1 \lim_{n \to \infty} f_n(x) \, dx. Does DCT apply here?

ExerciseCore

Problem

Let XnX_n be a sequence of random variables with E[Xn]=1\mathbb{E}[X_n] = 1 and XnXX_n \to X almost surely. Can you conclude that E[X]=1\mathbb{E}[X] = 1? State precisely what additional condition you need.

ExerciseAdvanced

Problem

Prove that if XnXX_n \to X in L1L^1 (i.e., E[XnX]0\mathbb{E}[|X_n - X|] \to 0), then E[Xn]E[X]\mathbb{E}[X_n] \to \mathbb{E}[X]. Conversely, give an example where E[Xn]E[X]\mathbb{E}[X_n] \to \mathbb{E}[X] but XnX_n does not converge to XX in L1L^1.

ExerciseAdvanced

Problem

Use the monotone convergence theorem to prove that for any non-negative random variable XX: E[X]=0Pr[X>t]dt\mathbb{E}[X] = \int_0^\infty \Pr[X > t] \, dt. This is the "layer cake" representation of the expectation.

Advanced: Frostman's Lemma and Capacity

Frostman's lemma connects measure theory to geometric set theory and potential theory. It characterizes the "size" of a set via the measures it can support.

Frostman's lemma. A Borel set ERdE \subseteq \mathbb{R}^d has Hausdorff dimension s\geq s if and only if there exists a non-zero Borel measure μ\mu supported on EE such that μ(B(x,r))Crs\mu(B(x, r)) \leq C r^s for all xx and all r>0r > 0.

The measure μ\mu is called a Frostman measure. The energy integral Is(μ)=xysdμ(x)dμ(y)I_s(\mu) = \iint |x - y|^{-s} \, d\mu(x) \, d\mu(y) is finite for such measures when s<dimH(E)s < \dim_H(E).

This result matters in probability because it connects Hausdorff dimension (a geometric measure of set complexity) to the existence of measures with controlled local growth. It appears in the theory of random sets, Brownian motion (the range of a Brownian motion in Rd\mathbb{R}^d has Hausdorff dimension min(2,d)\min(2, d), proved using Frostman-type energy arguments), and fractal geometry.

References

Canonical:

  • Billingsley, Probability and Measure (3rd ed., 1995), Chapters 1-5
  • Durrett, Probability: Theory and Examples (5th ed., 2019), Chapters 1-2
  • Rudin, Real and Complex Analysis (3rd ed., 1987), Chapters 1-3

Potential theory and capacity:

  • Mattila, Geometry of Sets and Measures in Euclidean Spaces (1995), Chapter 8. The definitive reference for Frostman's lemma and energy integrals.
  • Kahane, Some Random Series of Functions (2nd ed., 1985), Chapter 10

Current:

  • Tao, An Introduction to Measure Theory (2011)
  • Schilling, Measures, Integrals and Martingales (2nd ed., 2017)

Next Topics

Building on measure-theoretic foundations:

Last reviewed: April 2026

Builds on This

Next Topics