Skip to main content

Mathematical Infrastructure

Modes of Convergence of Random Variables

The four standard senses in which a sequence of random variables can converge: almost surely, in probability, in Lp, and in distribution. Their hierarchy, the strict counterexamples that separate them, and the supporting tools (Slutsky, continuous mapping, Skorokhod).

CoreTier 1Stable~40 min
0

Why This Matters

When a learning theorem says "the empirical risk converges to the true risk," that statement has four possible meanings, each with different strength and different proof techniques. The law of large numbers gives almost-sure convergence (strong) or in-probability convergence (weak). The central limit theorem gives convergence in distribution. Stochastic gradient descent's convergence guarantees come in all four flavors depending on the assumptions you accept.

You cannot read asymptotic statistics, stochastic approximation, or any modern generalization bound without keeping these four modes straight. This page is the reference: definitions, the strict implication hierarchy, the counterexamples that separate them, and the standard tools used to move between them.

The Four Modes

Throughout, (Xn)n1(X_n)_{n \geq 1} is a sequence of random variables on a common probability space (Ω,F,P)(\Omega, \mathcal{F}, P), and XX is a random variable on the same space (except for convergence in distribution, which only needs the XnX_n and XX to share a state space).

Definition

Almost Sure Convergence

XnXX_n \to X almost surely if P ⁣({ω:limnXn(ω)=X(ω)})=1.P\!\left(\{\omega : \lim_{n \to \infty} X_n(\omega) = X(\omega)\}\right) = 1. Equivalently, the set of ω\omega where the pointwise limit fails has probability zero.

Definition

Convergence in Probability

XnXX_n \to X in probability if for every ε>0\varepsilon > 0, limnP(XnX>ε)=0.\lim_{n \to \infty} P(|X_n - X| > \varepsilon) = 0. The probability of being far from XX shrinks to zero, but no single trajectory ω(Xn(ω))n\omega \mapsto (X_n(\omega))_n is required to converge.

Definition

Convergence in Lp

For p1p \geq 1, XnXX_n \to X in LpL^p if EXnp<\mathbb{E}|X_n|^p < \infty, EXp<\mathbb{E}|X|^p < \infty, and limnEXnXp=0.\lim_{n \to \infty} \mathbb{E}|X_n - X|^p = 0. The case p=2p = 2 is mean-square convergence; the case p=1p = 1 is convergence in mean.

Definition

Convergence in Distribution

XnXX_n \to X in distribution (or weakly) if limnFXn(x)=FX(x)\lim_{n \to \infty} F_{X_n}(x) = F_X(x) at every point xx where FXF_X is continuous, where FY(x)=P(Yx)F_Y(x) = P(Y \leq x). For Rd\mathbb{R}^d-valued variables: E[g(Xn)]E[g(X)]\mathbb{E}[g(X_n)] \to \mathbb{E}[g(X)] for every bounded continuous gg.

The first three modes use XnX|X_n - X| and so require the sequence and limit to be on the same probability space. Convergence in distribution is a statement about laws (distributions), not about ω\omega-by-ω\omega behavior; the variables can live on entirely different probability spaces.

The Hierarchy

Theorem

Implications Among Modes of Convergence

Statement

The following implications hold and are strict (each non-implication has a counterexample below):

  • Xna.s.X    XnPXX_n \xrightarrow{a.s.} X \implies X_n \xrightarrow{P} X
  • XnLpX    XnPXX_n \xrightarrow{L^p} X \implies X_n \xrightarrow{P} X for p1p \geq 1
  • XnPX    XndXX_n \xrightarrow{P} X \implies X_n \xrightarrow{d} X
  • XnLqX    XnLpXX_n \xrightarrow{L^q} X \implies X_n \xrightarrow{L^p} X for qp1q \geq p \geq 1

The reverse implications all fail in general. The exception: XndcX_n \xrightarrow{d} c to a constant cc implies XnPcX_n \xrightarrow{P} c, because in distribution to a point mass collapses to convergence in probability.

Intuition

A diagram: a.s. and LpL^p both sit above convergence in probability, which sits above convergence in distribution. Almost sure and LpL^p are incomparable: a.s. constrains the pointwise behavior; LpL^p constrains the average size. A sequence can be a.s. convergent without bounded moments, and LpL^p convergent without pointwise convergence. Convergence in distribution is the weakest because it ignores joint behavior across nn entirely.

Proof Sketch

a.s. \Rightarrow in probability: If XnXX_n \to X a.s., the set Aε={XnX>ε infinitely often}A_\varepsilon = \{|X_n - X| > \varepsilon \text{ infinitely often}\} has probability zero. By continuity from above of PP, P(XnX>ε)0P(|X_n - X| > \varepsilon) \to 0.

LpL^p \Rightarrow in probability: Markov's inequality gives P(XnX>ε)EXnXp/εp0P(|X_n - X| > \varepsilon) \leq \mathbb{E}|X_n - X|^p / \varepsilon^p \to 0.

In probability \Rightarrow in distribution: For gg bounded and continuous, split E[g(Xn)]E[g(X)]\mathbb{E}[g(X_n)] - \mathbb{E}[g(X)] into the contribution where XnXδ|X_n - X| \leq \delta (small by uniform continuity of gg on a bounded set) and where XnX>δ|X_n - X| > \delta (small by convergence in probability and boundedness of gg).

LqLpL^q \Rightarrow L^p for qpq \geq p: Jensen's inequality on xp|x|^p as a function of xq|x|^q.

Why It Matters

The hierarchy tells you which mode to prove and which to use. To prove the strongest claim (a.s. convergence), use Borel-Cantelli or martingale arguments. To use a convergence guarantee in a CLT-style asymptotic argument, you only need convergence in distribution. Many generalization bounds give convergence in probability with explicit rate; upgrading to a.s. requires summability of the rate (first Borel-Cantelli).

Failure Mode

A common slip is treating "XnXX_n \to X in distribution" as if it implied XnX0X_n - X \to 0 in any sense. It does not. Take XN(0,1)X \sim \mathcal{N}(0, 1) and Xn=XX_n = -X for all nn. Then XndXX_n \xrightarrow{d} X trivially (each XnX_n already has the same distribution as XX), but XnX=2X|X_n - X| = 2|X| never shrinks. Convergence in distribution is about marginal laws, not about pathwise closeness.

The Strict Counterexamples

Memorizing one counterexample for each non-implication is the cheapest way to keep the hierarchy straight.

In probability does not imply almost surely (the typewriter sequence). Take Ω=[0,1]\Omega = [0, 1] with Lebesgue measure. Index a sequence of intervals by (k,j)(k, j) where k1k \geq 1 and 1j2k1 \leq j \leq 2^k: let Ik,j=[(j1)/2k,j/2k]I_{k, j} = [(j-1)/2^k, j/2^k]. Order them lexicographically as X1=1[0,1]X_1 = \mathbf{1}_{[0, 1]}, X2=1[0,1/2]X_2 = \mathbf{1}_{[0, 1/2]}, X3=1[1/2,1]X_3 = \mathbf{1}_{[1/2, 1]}, X4=1[0,1/4]X_4 = \mathbf{1}_{[0, 1/4]}, and so on. Then P(Xn=1)=P(X_n = 1) = length of the indicator's support 0\to 0, so XnP0X_n \xrightarrow{P} 0. But for every ω[0,1]\omega \in [0, 1], Xn(ω)=1X_n(\omega) = 1 infinitely often (every ω\omega falls inside infinitely many dyadic intervals), so Xn(ω)↛0X_n(\omega) \not\to 0. The pointwise limit fails everywhere.

In probability does not imply LpL^p. Take Ω=[0,1]\Omega = [0, 1] with Lebesgue measure and Xn(ω)=n1[0,1/n](ω)X_n(\omega) = n \cdot \mathbf{1}_{[0, 1/n]}(\omega). Then P(Xn>ε)=1/n0P(X_n > \varepsilon) = 1/n \to 0 for any ε\varepsilon, so XnP0X_n \xrightarrow{P} 0. But E[Xn]=1\mathbb{E}[X_n] = 1 for every nn, so Xn↛0X_n \not\to 0 in L1L^1.

Almost surely does not imply LpL^p. Same sequence: Xn0X_n \to 0 pointwise on (0,1](0, 1] (which has full measure), so Xna.s.0X_n \xrightarrow{a.s.} 0, but E[Xn]=1\mathbb{E}[X_n] = 1.

LpL^p does not imply almost surely. The typewriter sequence above is in L1L^1 (E[Xn]=\mathbb{E}[X_n] = interval length 0\to 0), so it converges to 0 in L1L^1 but not almost surely.

In distribution does not imply in probability. Take XN(0,1)X \sim \mathcal{N}(0, 1) and Xn=XX_n = -X. The XnX_n have the same distribution as XX, so XndXX_n \xrightarrow{d} X. But XnX=2X|X_n - X| = 2|X|, which does not converge to 0 in probability.

The pattern: the four counterexamples use just two constructions (typewriter and bump). Memorize those two and the rest follows.

Tools That Move Between Modes

Three tools let you upgrade a convergence claim to a stronger mode or transport convergence through a transformation.

Theorem

Continuous Mapping Theorem

Statement

Let g:RdRkg : \mathbb{R}^d \to \mathbb{R}^k be continuous on a set AA with P(XA)=1P(X \in A) = 1. Then:

  • Xna.s.X    g(Xn)a.s.g(X)X_n \xrightarrow{a.s.} X \implies g(X_n) \xrightarrow{a.s.} g(X)
  • XnPX    g(Xn)Pg(X)X_n \xrightarrow{P} X \implies g(X_n) \xrightarrow{P} g(X)
  • XndX    g(Xn)dg(X)X_n \xrightarrow{d} X \implies g(X_n) \xrightarrow{d} g(X)

The same convergence mode is preserved through any continuous transformation (continuous on the support of XX).

Intuition

Continuous functions preserve closeness. If XnX_n is close to XX in some sense, g(Xn)g(X_n) is close to g(X)g(X) in the same sense, because gg does not introduce discontinuities. The "continuous on AA with P(XA)=1P(X \in A) = 1" hedge handles cases like g(x)=1/xg(x) = 1/x when XX has continuous distribution not putting mass on 0.

Proof Sketch

For a.s. convergence, continuity at X(ω)X(\omega) for ω\omega in the probability-1 set gives the result by the sequential characterization of continuity. For convergence in probability, fix ε\varepsilon and use that {g(Xn)g(X)>ε}\{|g(X_n) - g(X)| > \varepsilon\} implies either XnX|X_n - X| is large or XX is in a region where gg has variation at least ε\varepsilon over a small ball; both shrink. For convergence in distribution, use the E[h(Xn)]E[h(X)]\mathbb{E}[h(X_n)] \to \mathbb{E}[h(X)] for hh bounded continuous formulation and apply to h=fgh = f \circ g for any bounded continuous ff.

Why It Matters

This is what lets you take a CLT statement like n(Xˉnμ)dN(0,σ2)\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} \mathcal{N}(0, \sigma^2) and conclude n(Xˉnμ)2dσ2χ12n (\bar{X}_n - \mu)^2 \xrightarrow{d} \sigma^2 \chi_1^2 by applying the continuous map g(x)=x2g(x) = x^2. Most "delta method" arguments in asymptotic statistics use this theorem under the hood.

Failure Mode

Discontinuity at a point of positive limit-mass breaks the theorem. If g(x)=1x0g(x) = \mathbf{1}_{x \geq 0} and Xn=1/nX_n = -1/n deterministically, then Xna.s.0X_n \xrightarrow{a.s.} 0 but g(Xn)=0g(X_n) = 0 while g(0)=1g(0) = 1. The limit random variable X=0X = 0 falls on the discontinuity set of gg, which has P(X{0})=10P(X \in \{0\}) = 1 \neq 0.

Slutsky's theorem (defined on the asymptotic-statistics page) handles addition and multiplication when one sequence converges in distribution and the other in probability to a constant. The combination "Slutsky + continuous mapping" covers most asymptotic distribution arguments in the wild.

Theorem

Skorokhod Representation

Statement

If XndXX_n \xrightarrow{d} X in a separable metric space, there exists a probability space (Ω~,F~,P~)(\tilde\Omega, \tilde{\mathcal{F}}, \tilde{P}) and random variables X~n,X~\tilde X_n, \tilde X on it such that:

  • X~n=dXn\tilde X_n \stackrel{d}{=} X_n for every nn (same distribution),
  • X~=dX\tilde X \stackrel{d}{=} X,
  • X~na.s.X~\tilde X_n \xrightarrow{a.s.} \tilde X on Ω~\tilde\Omega.

Intuition

Convergence in distribution looks weak because it ignores the joint structure of the XnX_n. Skorokhod says you can always re-couple the sequence on a new probability space to make pointwise convergence true, without changing any marginal distribution. The original XnX_n may not converge a.s. on the original space, but a distributional copy of them does on a possibly larger space.

Proof Sketch

On Ω~=(0,1)\tilde\Omega = (0, 1) with Lebesgue measure, define X~n\tilde X_n as the quantile transform FXn1(U)F_{X_n}^{-1}(U) where UU is uniform on (0,1)(0, 1). Convergence of FXnF_{X_n} to FXF_X at continuity points implies pointwise convergence of the inverse functions almost everywhere on (0,1)(0, 1). Generalizing to Rd\mathbb{R}^d and separable metric spaces requires more care but the construction is the same in spirit.

Why It Matters

Skorokhod converts convergence-in-distribution problems into pointwise problems. Many proofs in weak convergence theory (Donsker's invariance principle, weak convergence of empirical processes) become easier once you can pretend the limit is pointwise. It also exposes the looseness of convergence in distribution: the joint structure across nn is unspecified, and you are free to pick a convenient coupling.

Failure Mode

The construction relies on separability of the underlying space. For non-separable spaces (e.g., LL^\infty) Skorokhod-type representations exist only under additional hypotheses. For finite-dimensional Rd\mathbb{R}^d this is never an issue.

Special Cases and Extensions

Sub-sequences. Convergence in probability implies that some sub-sequence converges almost surely. This is the standard upgrade trick: to prove a sub-sequential a.s. claim, get the rate of in-probability convergence and extract a sub-sequence whose tail probabilities are summable, then apply the first Borel-Cantelli lemma.

Uniform integrability. Convergence in probability plus uniform integrability of {Xnp}\{|X_n|^p\} implies LpL^p convergence. This is the canonical fix when you have in-probability convergence and want to upgrade to LpL^p for moment-of-the-limit arguments.

Convergence to a constant. When the limit is deterministic, the four modes collapse: convergence in distribution to a constant equals convergence in probability to that constant. This is why "consistent estimator" can be defined as in-probability convergence to the truth without ambiguity.

Common Confusions

Watch Out

Convergence in distribution does not constrain the joint law

XndXX_n \xrightarrow{d} X tells you the marginal distribution of XnX_n converges to that of XX. It says nothing about the joint distribution of (Xn,X)(X_n, X) or the correlations across different nn. This is why convergence in distribution does not, in general, imply XnX0X_n - X \to 0 in any sense, and why Skorokhod has to construct a new coupling rather than use the original.

Watch Out

Almost sure and L^p are not comparable

There is no implication in either direction between a.s. convergence and LpL^p convergence in general. A sequence can be a.s. convergent with diverging L1L^1 norms (the bump example), or LpL^p convergent without pointwise convergence (the typewriter). The right combination is "a.s. + dominating integrable function" (dominated convergence) or "in probability + uniform integrability" (Vitali). Without a uniformity hypothesis, you cannot trade one for the other.

Watch Out

The exception for constant limits is real and useful

"Convergence in distribution implies convergence in probability" is false for general limit XX, but true when XX is a constant. This single exception is what licenses the Slutsky-style decomposition Xn=(Xnc)+cX_n = (X_n - c) + c when XndcX_n \xrightarrow{d} c: you upgrade to in-probability for the constant part, and use this fact silently in nearly every consistency proof of an estimator.

Summary

  • Four modes of convergence: almost sure, in probability, in LpL^p, in distribution.
  • a.s. and LpL^p each imply in probability; in probability implies in distribution; LqL^q implies LpL^p for qpq \geq p.
  • a.s. and LpL^p are incomparable; the typewriter and bump sequences separate every other pair.
  • The continuous mapping theorem preserves all three on-space modes; Skorokhod re-couples convergence in distribution into a.s. convergence on a new space.
  • Sub-sequential a.s. convergence and uniform integrability are the standard tools for upgrading convergence in probability.
  • Convergence in distribution to a constant collapses to convergence in probability to that constant.

Exercises

ExerciseCore

Problem

Let XnX_n be independent with P(Xn=n2)=1/n2P(X_n = n^2) = 1/n^2 and P(Xn=0)=11/n2P(X_n = 0) = 1 - 1/n^2. Determine which modes of convergence hold for Xn0X_n \to 0.

ExerciseAdvanced

Problem

Let UUniform(0,1)U \sim \text{Uniform}(0, 1) and define Xn=1{U1/2+(1)n/n}X_n = \mathbf{1}\{U \leq 1/2 + (-1)^n / n\} for n1n \geq 1. Determine whether XnX_n converges in distribution, in probability, or almost surely, to a Bernoulli(1/2)(1/2) random variable.

References

Standard graduate texts:

  • Billingsley, "Probability and Measure" (3rd edition, Wiley, 1995), Sections 25-29
  • Durrett, "Probability: Theory and Examples" (5th edition, Cambridge, 2019), Sections 2.3, 3.2
  • Williams, "Probability with Martingales" (Cambridge, 1991), Chapters 7, 13
  • Resnick, "A Probability Path" (Birkhauser, 1999), Chapters 6-8

Weak convergence and Skorokhod:

  • Billingsley, "Convergence of Probability Measures" (2nd edition, Wiley, 1999), Chapter 1
  • van der Vaart, "Asymptotic Statistics" (Cambridge, 1998), Chapters 2-3

Reference handbook:

  • Folland, "Real Analysis" (2nd edition, Wiley, 1999), Chapter 2 covers the analogous modes for measurable functions

Next Topics

Last reviewed: April 18, 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics