Statistical Estimation
Law of Large Numbers
The weak and strong laws of large numbers: the sample mean converges to the population mean. Kolmogorov's conditions, the rate of convergence from the CLT, and why LLN justifies using empirical risk as a proxy for population risk.
Prerequisites
Why This Matters
The law of large numbers is the most fundamental result in all of statistics. It says: if you average enough i.i.d. observations, the average converges to the expected value. This single fact justifies:
- Empirical risk minimization: the training loss converges to the population risk as . Without the LLN, there is no reason to believe that minimizing training loss has anything to do with minimizing true risk.
-
Consistency of estimators: the sample mean is a consistent estimator of , where expectation is the population quantity we target. More generally, maximum likelihood estimators are consistent under regularity conditions, and the proof ultimately relies on the LLN.
-
Monte Carlo methods: estimating by works because of the LLN. Every MCMC method, every stochastic gradient estimator, and every simulation study rests on this.
If you do not understand the LLN, you do not understand why averaging works.
Mental Model
Flip a fair coin 10 times: you might get 7 heads (70%). Flip it 1,000 times: you will almost certainly get close to 50%. Flip it a million times: the fraction of heads will be within 0.1% of 50% with overwhelming probability.
The LLN formalizes this: the sample average converges to as . The question is: what kind of convergence?
Modes of Convergence
Convergence in Probability
A sequence converges in probability to if for every :
This says: the probability of being far from vanishes, but it does not rule out occasional deviations. There might be rare "bad" events where is far from , as long as these events become increasingly unlikely.
Almost Sure Convergence
A sequence converges almost surely (a.s.) to if:
Equivalently: .
This is strictly stronger than convergence in probability. Almost sure convergence says: for almost every outcome , the sequence converges to . Not just "unlikely to be far away," but "actually converges along each path."
The relationship: Almost sure convergence implies convergence in probability, but not vice versa. The distinction matters in practice: convergence in probability allows for occasional large deviations that become rare; almost sure convergence says the sequence eventually settles down for each outcome.
Main Theorems
Weak Law of Large Numbers (WLLN)
Statement
If are i.i.d. random variables with , then the sample mean converges in probability to :
That is, for every : .
Intuition
Averaging reduces fluctuations. Each fluctuates around , but when you average many of them, the positive and negative deviations tend to cancel. The more you average, the less the sample mean fluctuates. Eventually, the probability of any fixed-size deviation vanishes.
Proof Sketch
(Simple proof assuming finite variance): If , then . By Chebyshev's inequality:
(General proof with only finite mean): Use truncation. Define . Show that: (1) by dominated convergence, (2) , and (3) . Apply Chebyshev to the truncated variables.
Why It Matters
The WLLN is the justification for using sample statistics as estimates of population quantities. When you compute a sample mean, a sample variance, or an empirical risk, you are relying on the WLLN to guarantee that these quantities are close to their population counterparts for large .
Crucially, the WLLN only requires a finite mean --- not a finite variance. The Cauchy distribution has no finite mean, and the LLN genuinely fails: the sample mean of i.i.d. Cauchy variables does not converge. Finite mean is both necessary and sufficient for the WLLN.
Failure Mode
The WLLN fails when . The canonical example is the Cauchy distribution: the sample mean of i.i.d. Cauchy variables has the same Cauchy distribution for every . Averaging does not help because the tails are too heavy. This is why concentration inequalities (which give non-asymptotic bounds) always require moment conditions.
Strong Law of Large Numbers (SLLN)
Statement
If are i.i.d. with , then:
That is, .
Intuition
The strong law says something more powerful than the weak law: not just that large deviations become unlikely, but that the sample mean actually converges for almost every sequence of outcomes. If you ran the experiment once and watched , the sequence would converge to (with probability 1).
Proof Sketch
(Proof assuming finite fourth moment, for intuition): Compute . After expanding and using independence: . Then . By Markov's inequality: for every . By the first Borel-Cantelli lemma: . This gives almost sure convergence.
(General proof with only finite mean): Uses Kolmogorov's truncation technique and the Kolmogorov three-series theorem. The key idea is to truncate, apply the SLLN for bounded variables (using Borel-Cantelli), and show the truncation error vanishes.
Why It Matters
The SLLN justifies Monte Carlo simulation. When you estimate by running a simulation and averaging , you want to know that the estimate converges to the true value as you run longer. The SLLN guarantees this: with probability 1, your simulation will eventually give the right answer.
For ML: the SLLN implies that the empirical distribution converges to the true distribution (in a pointwise sense), which is the starting point for uniform convergence arguments that control the generalization gap.
Failure Mode
Like the WLLN, the SLLN requires . An important subtlety: the SLLN can fail for non-identically distributed variables even with finite means. Kolmogorov's conditions for the SLLN with independent (not identically distributed) variables require: (Kolmogorov's criterion).
Rate of Convergence
The LLN tells you that , but it does not tell you how fast. The rate of convergence comes from the central limit theorem:
This says the fluctuations of around are of order . The rate is universal (for distributions with finite variance) and explains why:
- Halving the error requires quadrupling the sample size
- Monte Carlo estimates converge slowly: 4x more computation for 2x more accuracy
- Concentration inequalities give bounds of order
The CLT goes beyond the LLN by characterizing the shape of the fluctuations (Gaussian), not just their decay.
Kolmogorov's Conditions
For independent (but not necessarily identically distributed) random variables with , the SLLN holds under Kolmogorov's condition:
This is satisfied, for instance, if the variances are uniformly bounded. The denominator comes from the Kronecker lemma combined with the Kolmogorov three-series theorem.
Canonical Examples
Coin flips
Let . Then is the fraction of heads in flips. The LLN says . Chebyshev gives the rate: . For : suffices for the probability to be at most .
The CLT gives a tighter bound: where is the standard normal CDF.
Empirical risk as LLN
The empirical risk is the sample mean of , which are i.i.d. with mean . By the SLLN:
for each fixed . This is the pointwise convergence of empirical risk to population risk. The challenge of learning theory is to make this convergence uniform over , which requires concentration inequalities and complexity measures (VC dimension, Rademacher complexity).
Common Confusions
Convergence in probability is NOT almost sure convergence
Consider: let with probability and otherwise (independently). Then (since ). But , so by the second Borel-Cantelli lemma (the events are independent), infinitely often with probability 1. Thus .
In this example, large deviations keep happening, just less and less frequently. Convergence in probability allows this; almost sure convergence does not.
The LLN requires finite mean, not finite variance
The simplest proof of the WLLN uses Chebyshev's inequality and requires finite variance. But the WLLN holds with only a finite mean (the proof uses truncation). The SLLN also holds with only a finite mean. Finite variance gives you the rate of convergence (via the CLT), but convergence itself needs only a finite mean.
Conversely, if , the LLN fails. The sample mean does not converge to any fixed value.
The LLN is about the sample mean, not individual observations
A common misstatement: "by the law of large numbers, extreme values become rare." This is wrong. Individual observations always have the same distribution, no matter how large is. It is the average that converges. Extreme values keep occurring at the same rate; they just get diluted by the average.
Summary
- WLLN: (convergence in probability) --- requires only
- SLLN: (almost sure convergence) --- also requires only
- The SLLN is strictly stronger than the WLLN
- Rate of convergence: fluctuations are (from CLT)
- LLN justifies: empirical risk as proxy for population risk, consistency of estimators, Monte Carlo methods
- LLN fails for distributions with infinite mean (e.g., Cauchy)
Exercises
Problem
Simulate i.i.d. coin flips () and plot as a function of for . Run 10 independent simulations on the same plot. Verify visually that the sample mean converges to 0.5 and that the fluctuations decrease as .
Problem
Give an example of a sequence of independent random variables with for all such that does not converge to 0 almost surely. Why does this not contradict the SLLN?
References
Canonical:
- Durrett, Probability: Theory and Examples (5th ed., 2019), Sections 2.2-2.4
- Billingsley, Probability and Measure (3rd ed., 1995), Sections 6, 22
Current:
-
Vershynin, High-Dimensional Probability (2018), Chapter 0 (motivation)
-
Wainwright, High-Dimensional Statistics (2019), Section 2.1
-
Casella & Berger, Statistical Inference (2002), Chapters 5-10
-
Lehmann & Casella, Theory of Point Estimation (1998), Chapters 1-6
Next Topics
Building on the law of large numbers:
- Central limit theorem: the rate and shape of convergence ---
- Empirical risk minimization: the LLN in action for learning theory
- Concentration inequalities: non-asymptotic versions of the LLN
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A