Mathematical Infrastructure
Modes of Convergence of Random Variables
The four standard senses in which a sequence of random variables can converge: almost surely, in probability, in Lp, and in distribution. Their hierarchy, the strict counterexamples that separate them, and the supporting tools (Slutsky, continuous mapping, Skorokhod).
Why This Matters
When a learning theorem says "the empirical risk converges to the true risk," that statement has four possible meanings, each with different strength and different proof techniques. The law of large numbers gives almost-sure convergence (strong) or in-probability convergence (weak). The central limit theorem gives convergence in distribution. Stochastic gradient descent's convergence guarantees come in all four flavors depending on the assumptions you accept.
You cannot read asymptotic statistics, stochastic approximation, or any modern generalization bound without keeping these four modes straight. This page is the reference: definitions, the strict implication hierarchy, the counterexamples that separate them, and the standard tools used to move between them.
The Four Modes
Throughout, is a sequence of random variables on a common probability space , and is a random variable on the same space (except for convergence in distribution, which only needs the and to share a state space).
Almost Sure Convergence
almost surely if Equivalently, the set of where the pointwise limit fails has probability zero.
Convergence in Probability
in probability if for every , The probability of being far from shrinks to zero, but no single trajectory is required to converge.
Convergence in Lp
For , in if , , and The case is mean-square convergence; the case is convergence in mean.
Convergence in Distribution
in distribution (or weakly) if at every point where is continuous, where . For -valued variables: for every bounded continuous .
The first three modes use and so require the sequence and limit to be on the same probability space. Convergence in distribution is a statement about laws (distributions), not about -by- behavior; the variables can live on entirely different probability spaces.
The Hierarchy
Implications Among Modes of Convergence
Statement
The following implications hold and are strict (each non-implication has a counterexample below):
- for
- for
The reverse implications all fail in general. The exception: to a constant implies , because in distribution to a point mass collapses to convergence in probability.
Intuition
A diagram: a.s. and both sit above convergence in probability, which sits above convergence in distribution. Almost sure and are incomparable: a.s. constrains the pointwise behavior; constrains the average size. A sequence can be a.s. convergent without bounded moments, and convergent without pointwise convergence. Convergence in distribution is the weakest because it ignores joint behavior across entirely.
Proof Sketch
a.s. in probability: If a.s., the set has probability zero. By continuity from above of , .
in probability: Markov's inequality gives .
In probability in distribution: For bounded and continuous, split into the contribution where (small by uniform continuity of on a bounded set) and where (small by convergence in probability and boundedness of ).
for : Jensen's inequality on as a function of .
Why It Matters
The hierarchy tells you which mode to prove and which to use. To prove the strongest claim (a.s. convergence), use Borel-Cantelli or martingale arguments. To use a convergence guarantee in a CLT-style asymptotic argument, you only need convergence in distribution. Many generalization bounds give convergence in probability with explicit rate; upgrading to a.s. requires summability of the rate (first Borel-Cantelli).
Failure Mode
A common slip is treating " in distribution" as if it implied in any sense. It does not. Take and for all . Then trivially (each already has the same distribution as ), but never shrinks. Convergence in distribution is about marginal laws, not about pathwise closeness.
The Strict Counterexamples
Memorizing one counterexample for each non-implication is the cheapest way to keep the hierarchy straight.
In probability does not imply almost surely (the typewriter sequence). Take with Lebesgue measure. Index a sequence of intervals by where and : let . Order them lexicographically as , , , , and so on. Then length of the indicator's support , so . But for every , infinitely often (every falls inside infinitely many dyadic intervals), so . The pointwise limit fails everywhere.
In probability does not imply . Take with Lebesgue measure and . Then for any , so . But for every , so in .
Almost surely does not imply . Same sequence: pointwise on (which has full measure), so , but .
does not imply almost surely. The typewriter sequence above is in ( interval length ), so it converges to 0 in but not almost surely.
In distribution does not imply in probability. Take and . The have the same distribution as , so . But , which does not converge to 0 in probability.
The pattern: the four counterexamples use just two constructions (typewriter and bump). Memorize those two and the rest follows.
Tools That Move Between Modes
Three tools let you upgrade a convergence claim to a stronger mode or transport convergence through a transformation.
Continuous Mapping Theorem
Statement
Let be continuous on a set with . Then:
The same convergence mode is preserved through any continuous transformation (continuous on the support of ).
Intuition
Continuous functions preserve closeness. If is close to in some sense, is close to in the same sense, because does not introduce discontinuities. The "continuous on with " hedge handles cases like when has continuous distribution not putting mass on 0.
Proof Sketch
For a.s. convergence, continuity at for in the probability-1 set gives the result by the sequential characterization of continuity. For convergence in probability, fix and use that implies either is large or is in a region where has variation at least over a small ball; both shrink. For convergence in distribution, use the for bounded continuous formulation and apply to for any bounded continuous .
Why It Matters
This is what lets you take a CLT statement like and conclude by applying the continuous map . Most "delta method" arguments in asymptotic statistics use this theorem under the hood.
Failure Mode
Discontinuity at a point of positive limit-mass breaks the theorem. If and deterministically, then but while . The limit random variable falls on the discontinuity set of , which has .
Slutsky's theorem (defined on the asymptotic-statistics page) handles addition and multiplication when one sequence converges in distribution and the other in probability to a constant. The combination "Slutsky + continuous mapping" covers most asymptotic distribution arguments in the wild.
Skorokhod Representation
Statement
If in a separable metric space, there exists a probability space and random variables on it such that:
- for every (same distribution),
- ,
- on .
Intuition
Convergence in distribution looks weak because it ignores the joint structure of the . Skorokhod says you can always re-couple the sequence on a new probability space to make pointwise convergence true, without changing any marginal distribution. The original may not converge a.s. on the original space, but a distributional copy of them does on a possibly larger space.
Proof Sketch
On with Lebesgue measure, define as the quantile transform where is uniform on . Convergence of to at continuity points implies pointwise convergence of the inverse functions almost everywhere on . Generalizing to and separable metric spaces requires more care but the construction is the same in spirit.
Why It Matters
Skorokhod converts convergence-in-distribution problems into pointwise problems. Many proofs in weak convergence theory (Donsker's invariance principle, weak convergence of empirical processes) become easier once you can pretend the limit is pointwise. It also exposes the looseness of convergence in distribution: the joint structure across is unspecified, and you are free to pick a convenient coupling.
Failure Mode
The construction relies on separability of the underlying space. For non-separable spaces (e.g., ) Skorokhod-type representations exist only under additional hypotheses. For finite-dimensional this is never an issue.
Special Cases and Extensions
Sub-sequences. Convergence in probability implies that some sub-sequence converges almost surely. This is the standard upgrade trick: to prove a sub-sequential a.s. claim, get the rate of in-probability convergence and extract a sub-sequence whose tail probabilities are summable, then apply the first Borel-Cantelli lemma.
Uniform integrability. Convergence in probability plus uniform integrability of implies convergence. This is the canonical fix when you have in-probability convergence and want to upgrade to for moment-of-the-limit arguments.
Convergence to a constant. When the limit is deterministic, the four modes collapse: convergence in distribution to a constant equals convergence in probability to that constant. This is why "consistent estimator" can be defined as in-probability convergence to the truth without ambiguity.
Common Confusions
Convergence in distribution does not constrain the joint law
tells you the marginal distribution of converges to that of . It says nothing about the joint distribution of or the correlations across different . This is why convergence in distribution does not, in general, imply in any sense, and why Skorokhod has to construct a new coupling rather than use the original.
Almost sure and L^p are not comparable
There is no implication in either direction between a.s. convergence and convergence in general. A sequence can be a.s. convergent with diverging norms (the bump example), or convergent without pointwise convergence (the typewriter). The right combination is "a.s. + dominating integrable function" (dominated convergence) or "in probability + uniform integrability" (Vitali). Without a uniformity hypothesis, you cannot trade one for the other.
The exception for constant limits is real and useful
"Convergence in distribution implies convergence in probability" is false for general limit , but true when is a constant. This single exception is what licenses the Slutsky-style decomposition when : you upgrade to in-probability for the constant part, and use this fact silently in nearly every consistency proof of an estimator.
Summary
- Four modes of convergence: almost sure, in probability, in , in distribution.
- a.s. and each imply in probability; in probability implies in distribution; implies for .
- a.s. and are incomparable; the typewriter and bump sequences separate every other pair.
- The continuous mapping theorem preserves all three on-space modes; Skorokhod re-couples convergence in distribution into a.s. convergence on a new space.
- Sub-sequential a.s. convergence and uniform integrability are the standard tools for upgrading convergence in probability.
- Convergence in distribution to a constant collapses to convergence in probability to that constant.
Exercises
Problem
Let be independent with and . Determine which modes of convergence hold for .
Problem
Let and define for . Determine whether converges in distribution, in probability, or almost surely, to a Bernoulli random variable.
References
Standard graduate texts:
- Billingsley, "Probability and Measure" (3rd edition, Wiley, 1995), Sections 25-29
- Durrett, "Probability: Theory and Examples" (5th edition, Cambridge, 2019), Sections 2.3, 3.2
- Williams, "Probability with Martingales" (Cambridge, 1991), Chapters 7, 13
- Resnick, "A Probability Path" (Birkhauser, 1999), Chapters 6-8
Weak convergence and Skorokhod:
- Billingsley, "Convergence of Probability Measures" (2nd edition, Wiley, 1999), Chapter 1
- van der Vaart, "Asymptotic Statistics" (Cambridge, 1998), Chapters 2-3
Reference handbook:
- Folland, "Real Analysis" (2nd edition, Wiley, 1999), Chapter 2 covers the analogous modes for measurable functions
Next Topics
- Borel-Cantelli lemmas: the bridge from in-probability to almost-sure convergence
- Law of large numbers: applies all four modes to sample averages
- Central limit theorem: the canonical convergence-in-distribution result
- Asymptotic statistics: Slutsky, delta method, and the toolbox in action
Last reviewed: April 18, 2026
Prerequisites
Foundations this topic depends on.
Builds on This
- Borel-Cantelli LemmasLayer 0B