Modes of Convergence of Random Variables

Sneiderman, Robby

Mathematical Infrastructure

Modes of Convergence of Random Variables

The four standard senses in which a sequence of random variables can converge: almost surely, in probability, in Lp, and in distribution. Their hierarchy, the strict counterexamples that separate them, and the supporting tools (Slutsky, continuous mapping, Skorokhod).

CoreTier 1StableCore spine~40 min

Prerequisites

Measure Theoretic Probability Metric Spaces Convergence Completeness

Start 8-question practice · 13 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

mathematical-infrastructure | layer 0B | tier 1. This page has 2 direct prerequisites and 10 published dependents.

Open Atlas Prerequisites Leads to

What next

Law of Large Numbers

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

theorem visual

Convergence Hierarchy

$The valid implications form a strict hierarchy; counterexamples tell you exactly where the reverse arrows fail.$

a.s.

L^p

↓

in probability

↓

in distribution

Counterexample

typewriter

in L^{1} but not almost surely

Holds

L^{p}

in probability

in distribution

Fails

a.s.

When a learning theorem says "the empirical risk converges to the true risk," that statement has four possible meanings, each with different strength and different proof techniques. The law of large numbers gives almost-sure convergence (strong) or in-probability convergence (weak). The central limit theorem gives convergence in distribution. Stochastic gradient descent's convergence guarantees come in all four flavors depending on the assumptions you accept.

You cannot read asymptotic statistics, stochastic approximation, or any modern generalization bound without keeping these four modes straight. This page is the reference: definitions, the strict implication hierarchy, the counterexamples that separate them, and the standard tools used to move between them.

If you want to see the weak/strong LLN and the CLT split into separate pictures before reading the formal hierarchy, the Sampling and Limit Laws Lab is the quickest way to build the mental model. It shows exactly why one theorem is about one path stabilizing while the other is about many reruns forming a Gaussian shape.

Quick Version

Almost sure convergence. One sample path eventually settles. This is stronger than convergence in probability, and it is the mode used in strong LLN or Borel-Cantelli conclusions.
Convergence in probability. Large errors become unlikely. This is the language of weak LLN and many consistency results.
$L^p$ convergence. The average size of the error vanishes. This is the mode behind mean-square convergence and many risk bounds.
Convergence in distribution. Only the law of $X_n$ converges. This is the weakest of the four, and it is the mode used in the CLT and asymptotic normality.

One sentence to keep straight: almost sure is about trajectories, $L^p$ is about average size, in probability is about rare bad events, and in distribution is only about the limiting law.

The Four Modes

Throughout, $(X_n)_{n \geq 1}$ is a sequence of random variables on a common probability space $(\Omega, \mathcal{F}, P)$ , and $X$ is a random variable on the same space (except for convergence in distribution, which only needs the $X_n$ and $X$ to share a state space).

Definition

Almost Sure Convergence $X_{n} a.s. X$

$X_n \to X$ almost surely if and only if $P\!\left(\{\omega : \lim_{n \to \infty} X_n(\omega) = X(\omega)\}\right) = 1.$ Equivalently, the set of $\omega$ where the pointwise limit fails has probability zero.

Definition

Convergence in Probability $X_{n} P X$

$X_n \to X$ in probability if and only if for every $\varepsilon > 0$ , $\lim_{n \to \infty} P(|X_n - X| > \varepsilon) = 0.$ The probability of being far from $X$ shrinks to zero, but no single trajectory $\omega \mapsto (X_n(\omega))_n$ is required to converge.

Definition

Convergence in Lp $X_{n} L^{p} X$

For $p \geq 1$ , $X_n \to X$ in $L^p$ if and only if $\mathbb{E}|X_n|^p < \infty$ , $\mathbb{E}|X|^p < \infty$ , and $\lim_{n \to \infty} \mathbb{E}|X_n - X|^p = 0.$ The case $p = 2$ is mean-square convergence; the case $p = 1$ is convergence in mean.

Definition

Convergence in Distribution $X_{n} d X$

$X_n \to X$ in distribution (or weakly) if and only if $\lim_{n \to \infty} F_{X_n}(x) = F_X(x)$ at every point $x$ where $F_X$ is continuous, where $F_Y(x) = P(Y \leq x)$ . For $\mathbb{R}^d$ -valued variables: $\mathbb{E}[g(X_n)] \to \mathbb{E}[g(X)]$ for every bounded continuous $g$ .

The first three modes use $|X_n - X|$ and so require the sequence and limit to be on the same probability space. Convergence in distribution is a statement about laws (distributions), not about $\omega$ -by- $\omega$ behavior; the variables can live on entirely different probability spaces.

The Hierarchy

Theorem

Implications Among Modes of Convergence

Statement

The following implications hold and are strict (each non-implication has a counterexample below):

$X_n \xrightarrow{a.s.} X \implies X_n \xrightarrow{P} X$
$X_n \xrightarrow{L^p} X \implies X_n \xrightarrow{P} X$ for $p \geq 1$
$X_n \xrightarrow{P} X \implies X_n \xrightarrow{d} X$
$X_n \xrightarrow{L^q} X \implies X_n \xrightarrow{L^p} X$ for $q \geq p \geq 1$

The reverse implications all fail in general. The exception: $X_n \xrightarrow{d} c$ to a constant $c$ implies $X_n \xrightarrow{P} c$ , because in distribution to a point mass collapses to convergence in probability.

Intuition

A diagram: a.s. and $L^p$ both sit above convergence in probability, which sits above convergence in distribution. Almost sure and $L^p$ are incomparable: a.s. constrains the pointwise behavior; $L^p$ constrains the average size. A sequence can be a.s. convergent without bounded moments, and $L^p$ convergent without pointwise convergence. Convergence in distribution is the weakest because it ignores joint behavior across $n$ entirely.

Proof Sketch

a.s. $\Rightarrow$ in probability: If $X_n \to X$ a.s., the set $A_\varepsilon = \{|X_n - X| > \varepsilon \text{ infinitely often}\}$ has probability zero. By continuity from above of $P$ , $P(|X_n - X| > \varepsilon) \to 0$ .

$L^p \Rightarrow$ in probability: Markov's inequality gives $P(|X_n - X| > \varepsilon) \leq \mathbb{E}|X_n - X|^p / \varepsilon^p \to 0$ .

In probability $\Rightarrow$ in distribution: For $g$ bounded and continuous, split $\mathbb{E}[g(X_n)] - \mathbb{E}[g(X)]$ into the contribution where $|X_n - X| \leq \delta$ (small by uniform continuity of $g$ on a bounded set) and where $|X_n - X| > \delta$ (small by convergence in probability and boundedness of $g$ ).

$L^q \Rightarrow L^p$ for $q \geq p$ : Jensen's inequality on $|x|^p$ as a function of $|x|^q$ .

Why It Matters

The hierarchy tells you which mode to prove and which to use. To prove the strongest claim (a.s. convergence), use Borel-Cantelli or martingale arguments. To use a convergence guarantee in a CLT-style asymptotic argument, you only need convergence in distribution. Many generalization bounds give convergence in probability with explicit rate; upgrading to a.s. requires summability of the rate (first Borel-Cantelli).

Failure Mode

A common slip is treating " $X_n \to X$ in distribution" as if it implied $X_n - X \to 0$ in any sense. It does not. Take $X \sim \mathcal{N}(0, 1)$ and $X_n = -X$ for all $n$ . Then $X_n \xrightarrow{d} X$ trivially (each $X_n$ already has the same distribution as $X$ ), but $|X_n - X| = 2|X|$ never shrinks. Convergence in distribution is about marginal laws, not about pathwise closeness.

report a correction →

The Strict Counterexamples

Memorizing one counterexample for each non-implication is the cheapest way to keep the hierarchy straight.

In probability does not imply almost surely (the typewriter sequence). Take $\Omega = [0, 1]$ with Lebesgue measure. Index a sequence of intervals by $(k, j)$ where $k \geq 1$ and $1 \leq j \leq 2^k$ : let $I_{k, j} = [(j-1)/2^k, j/2^k]$ . Order them lexicographically as $X_1 = \mathbf{1}_{[0, 1]}$ , $X_2 = \mathbf{1}_{[0, 1/2]}$ , $X_3 = \mathbf{1}_{[1/2, 1]}$ , $X_4 = \mathbf{1}_{[0, 1/4]}$ , and so on. Then $P(X_n = 1) =$ length of the indicator's support $\to 0$ , so $X_n \xrightarrow{P} 0$ . But for every $\omega \in [0, 1]$ , $X_n(\omega) = 1$ infinitely often (every $\omega$ falls inside infinitely many dyadic intervals), so $X_n(\omega) \not\to 0$ . The pointwise limit fails everywhere.

In probability does not imply $L^p$ . Take $\Omega = [0, 1]$ with Lebesgue measure and $X_n(\omega) = n \cdot \mathbf{1}_{[0, 1/n]}(\omega)$ . Then $P(X_n > \varepsilon) = 1/n \to 0$ for any $\varepsilon$ , so $X_n \xrightarrow{P} 0$ . But $\mathbb{E}[X_n] = 1$ for every $n$ , so $X_n \not\to 0$ in $L^1$ .

Almost surely does not imply $L^p$ . Same sequence: $X_n \to 0$ pointwise on $(0, 1]$ (which has full measure), so $X_n \xrightarrow{a.s.} 0$ , but $\mathbb{E}[X_n] = 1$ .

$L^p$ does not imply almost surely. The typewriter sequence above is in $L^1$ ( $\mathbb{E}[X_n] =$ interval length $\to 0$ ), so it converges to 0 in $L^1$ but not almost surely.

In distribution does not imply in probability. Take $X \sim \mathcal{N}(0, 1)$ and $X_n = -X$ . The $X_n$ have the same distribution as $X$ , so $X_n \xrightarrow{d} X$ . But $|X_n - X| = 2|X|$ , which does not converge to 0 in probability.

The pattern: the four counterexamples use just two constructions (typewriter and bump). Memorize those two and the rest follows.

Tools That Move Between Modes

Three tools let you upgrade a convergence claim to a stronger mode or transport convergence through a transformation.

Theorem

Continuous Mapping Theorem

Statement

Let $g : \mathbb{R}^d \to \mathbb{R}^k$ be continuous on a set $A$ with $P(X \in A) = 1$ . Then:

$X_n \xrightarrow{a.s.} X \implies g(X_n) \xrightarrow{a.s.} g(X)$
$X_n \xrightarrow{P} X \implies g(X_n) \xrightarrow{P} g(X)$
$X_n \xrightarrow{d} X \implies g(X_n) \xrightarrow{d} g(X)$

The same convergence mode is preserved through any continuous transformation (continuous on the support of $X$ ).

Intuition

Continuous functions preserve closeness. If $X_n$ is close to $X$ in some sense, $g(X_n)$ is close to $g(X)$ in the same sense, because $g$ does not introduce discontinuities. The "continuous on $A$ with $P(X \in A) = 1$ " hedge handles cases like $g(x) = 1/x$ when $X$ has continuous distribution not putting mass on 0.

Proof Sketch

For a.s. convergence, continuity at $X(\omega)$ for $\omega$ in the probability-1 set gives the result by the sequential characterization of continuity. For convergence in probability, fix $\varepsilon$ and use that $\{|g(X_n) - g(X)| > \varepsilon\}$ implies either $|X_n - X|$ is large or $X$ is in a region where $g$ has variation at least $\varepsilon$ over a small ball; both shrink. For convergence in distribution, use the $\mathbb{E}[h(X_n)] \to \mathbb{E}[h(X)]$ for $h$ bounded continuous formulation and apply to $h = f \circ g$ for any bounded continuous $f$ .

Why It Matters

This is what lets you take a CLT statement like $\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} \mathcal{N}(0, \sigma^2)$ and conclude $n (\bar{X}_n - \mu)^2 \xrightarrow{d} \sigma^2 \chi_1^2$ by applying the continuous map $g(x) = x^2$ . Most "delta method" arguments in asymptotic statistics use this theorem under the hood.

Failure Mode

Discontinuity at a point of positive limit-mass breaks the theorem. If $g(x) = \mathbf{1}_{x \geq 0}$ and $X_n = -1/n$ deterministically, then $X_n \xrightarrow{a.s.} 0$ but $g(X_n) = 0$ while $g(0) = 1$ . The limit random variable $X = 0$ falls on the discontinuity set of $g$ , which has $P(X \in \{0\}) = 1 \neq 0$ .

report a correction →

Slutsky's theorem (defined on the asymptotic-statistics page) handles addition and multiplication when one sequence converges in distribution and the other in probability to a constant. The combination "Slutsky + continuous mapping" covers most asymptotic distribution arguments in the wild.

Theorem

Skorokhod Representation

Statement

If $X_n \xrightarrow{d} X$ in a separable metric space, there exists a probability space $(\tilde\Omega, \tilde{\mathcal{F}}, \tilde{P})$ and random variables $\tilde X_n, \tilde X$ on it such that:

$\tilde X_n \stackrel{d}{=} X_n$ for every $n$ (same distribution),
$\tilde X \stackrel{d}{=} X$ ,
$\tilde X_n \xrightarrow{a.s.} \tilde X$ on $\tilde\Omega$ .

Intuition

Convergence in distribution looks weak because it ignores the joint structure of the $X_n$ . Skorokhod says you can always re-couple the sequence on a new probability space to make pointwise convergence true, without changing any marginal distribution. The original $X_n$ may not converge a.s. on the original space, but a distributional copy of them does on a possibly larger space.

Proof Sketch

On $\tilde\Omega = (0, 1)$ with Lebesgue measure, define $\tilde X_n$ as the quantile transform $F_{X_n}^{-1}(U)$ where $U$ is uniform on $(0, 1)$ . Convergence of $F_{X_n}$ to $F_X$ at continuity points implies pointwise convergence of the inverse functions almost everywhere on $(0, 1)$ . Generalizing to $\mathbb{R}^d$ and separable metric spaces requires more care but the construction is the same in spirit.

Why It Matters

Skorokhod converts convergence-in-distribution problems into pointwise problems. Many proofs in weak convergence theory (Donsker's invariance principle, weak convergence of empirical processes) become easier once you can pretend the limit is pointwise. It also exposes the looseness of convergence in distribution: the joint structure across $n$ is unspecified, and you are free to pick a convenient coupling.

Failure Mode

The construction relies on separability of the underlying space. For non-separable spaces (e.g., $L^\infty$ ) Skorokhod-type representations exist only under additional hypotheses. For finite-dimensional $\mathbb{R}^d$ this is never an issue.

report a correction →

Special Cases and Extensions

Sub-sequences. Convergence in probability implies that some sub-sequence converges almost surely. This is the standard upgrade trick: to prove a sub-sequential a.s. claim, get the rate of in-probability convergence and extract a sub-sequence whose tail probabilities are summable, then apply the first Borel-Cantelli lemma.

Uniform integrability. Convergence in probability plus uniform integrability of $\{|X_n|^p\}$ implies $L^p$ convergence. This is the canonical fix when you have in-probability convergence and want to upgrade to $L^p$ for moment-of-the-limit arguments.

Convergence to a constant. When the limit is a constant $c$ , the gap between the two weakest modes vanishes: convergence in distribution to $c$ is equivalent to convergence in probability to $c$ . This is the only collapse — almost-sure convergence and $L^p$ convergence remain strictly stronger even with a constant limit. The typewriter sequence converges in probability (and hence in distribution) to $0$ but not almost surely, and the bump sequence $X_n = n \cdot \mathbf 1_{[0, 1/n]}$ converges to $0$ in probability but not in $L^1$ . The constant-limit collapse is what licenses defining "consistent estimator" as in-probability convergence without ambiguity, and what powers the Slutsky-style decomposition $X_n = (X_n - c) + c$ on the next page.

Common Confusions

Watch Out

Convergence in distribution does not constrain the joint law

$X_n \xrightarrow{d} X$ tells you the marginal distribution of $X_n$ converges to that of $X$ . It says nothing about the joint distribution of $(X_n, X)$ or the correlations across different $n$ . This is why convergence in distribution does not, in general, imply $X_n - X \to 0$ in any sense, and why Skorokhod has to construct a new coupling rather than use the original.

Watch Out

Almost sure and L^p are not comparable

There is no implication in either direction between a.s. convergence and $L^p$ convergence in general. A sequence can be a.s. convergent with diverging $L^1$ norms (the bump example), or $L^p$ convergent without pointwise convergence (the typewriter). The right combination is "a.s. + dominating integrable function" (dominated convergence) or "in probability + uniform integrability" (Vitali). Without a uniformity hypothesis, you cannot trade one for the other.

Watch Out

The exception for constant limits is real and useful

"Convergence in distribution implies convergence in probability" is false for general limit $X$ , but true when $X$ is a constant. This single exception is what licenses the Slutsky-style decomposition $X_n = (X_n - c) + c$ when $X_n \xrightarrow{d} c$ : you upgrade to in-probability for the constant part, and use this fact silently in nearly every consistency proof of an estimator.

Summary

Four modes of convergence: almost sure, in probability, in $L^p$ , in distribution.
a.s. and $L^p$ each imply in probability; in probability implies in distribution; $L^q$ implies $L^p$ for $q \geq p$ .
a.s. and $L^p$ are incomparable; the typewriter and bump sequences separate every other pair.
The continuous mapping theorem preserves all three on-space modes; Skorokhod re-couples convergence in distribution into a.s. convergence on a new space.
Sub-sequential a.s. convergence and uniform integrability are the standard tools for upgrading convergence in probability.
Convergence in distribution to a constant collapses to convergence in probability to that constant.

Exercises

ExerciseCore

Problem

Let $X_n$ be independent with $P(X_n = n^2) = 1/n^2$ and $P(X_n = 0) = 1 - 1/n^2$ . Determine which modes of convergence hold for $X_n \to 0$ .

ExerciseAdvanced

Problem

Let $U \sim \text{Uniform}(0, 1)$ and define $X_n = \mathbf{1}\{U \leq 1/2 + (-1)^n / n\}$ for $n \geq 1$ . Determine whether $X_n$ converges in distribution, in probability, or almost surely, to a Bernoulli $(1/2)$ random variable.

References

Standard graduate texts:

Billingsley, "Probability and Measure" (3rd edition, Wiley, 1995), Sections 25-29
Durrett, "Probability: Theory and Examples" (5th edition, Cambridge, 2019), Sections 2.3, 3.2
Williams, "Probability with Martingales" (Cambridge, 1991), Chapters 7, 13
Resnick, "A Probability Path" (Birkhauser, 1999), Chapters 6-8

Weak convergence and Skorokhod:

Billingsley, "Convergence of Probability Measures" (2nd edition, Wiley, 1999), Chapter 1
van der Vaart, "Asymptotic Statistics" (Cambridge, 1998), Chapters 2-3

Reference handbook:

Folland, "Real Analysis" (2nd edition, Wiley, 1999), Chapter 2 covers the analogous modes for measurable functions

Next Topics

Borel-Cantelli lemmas: the bridge from in-probability to almost-sure convergence
Law of large numbers: applies all four modes to sample averages
Central limit theorem: the canonical convergence-in-distribution result
Asymptotic statistics: Slutsky, delta method, and the toolbox in action

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Metric Spaces, Convergence, and Completenesslayer 0A · tier 1
Measure-Theoretic Probabilitylayer 0B · tier 1

Derived topics

10

Asymptotic Statistics: M-Estimators, Delta Method, LANlayer 0B · tier 1
Borel-Cantelli Lemmaslayer 0B · tier 1
Central Limit Theoremlayer 0B · tier 1
Law of Large Numberslayer 0B · tier 1
Delta Methodlayer 1 · tier 1

+5 more on the derived-topics page.

Graph-backed continuations

Law of Large Numbers Central Limit Theorem Borel-Cantelli Lemmas Asymptotic Statistics: M-Estimators, Delta Method, LAN