Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Statistical Estimation

Law of Large Numbers

The weak and strong laws of large numbers: the sample mean converges to the population mean. Kolmogorov's conditions, the rate of convergence from the CLT, and why LLN justifies using empirical risk as a proxy for population risk.

CoreTier 1Stable~50 min

Why This Matters

The law of large numbers is the most fundamental result in all of statistics. It says: if you average enough i.i.d. observations, the average converges to the expected value. This single fact justifies:

  • Empirical risk minimization: the training loss 1ni(h(xi),yi)\frac{1}{n}\sum_i \ell(h(x_i), y_i) converges to the population risk E[(h(X),Y)]\mathbb{E}[\ell(h(X), Y)] as nn \to \infty. Without the LLN, there is no reason to believe that minimizing training loss has anything to do with minimizing true risk.
μ0.20.30.40.50.60.70.8Sample mean0100200300400500Sample size nSequence 1Sequence 2Sequence 3All sequences converge to the true mean regardless of starting fluctuations
  • Consistency of estimators: the sample mean Xˉn\bar{X}_n is a consistent estimator of μ=E[X]\mu = \mathbb{E}[X], where expectation is the population quantity we target. More generally, maximum likelihood estimators are consistent under regularity conditions, and the proof ultimately relies on the LLN.

  • Monte Carlo methods: estimating E[f(X)]\mathbb{E}[f(X)] by 1nif(Xi)\frac{1}{n}\sum_i f(X_i) works because of the LLN. Every MCMC method, every stochastic gradient estimator, and every simulation study rests on this.

If you do not understand the LLN, you do not understand why averaging works.

Mental Model

Flip a fair coin 10 times: you might get 7 heads (70%). Flip it 1,000 times: you will almost certainly get close to 50%. Flip it a million times: the fraction of heads will be within 0.1% of 50% with overwhelming probability.

The LLN formalizes this: the sample average Xˉn=1ni=1nXi\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i converges to μ=E[Xi]\mu = \mathbb{E}[X_i] as nn \to \infty. The question is: what kind of convergence?

Modes of Convergence

Definition

Convergence in Probability

A sequence XnX_n converges in probability to XX if for every ϵ>0\epsilon > 0:

limnPr[XnX>ϵ]=0\lim_{n \to \infty} \Pr[|X_n - X| > \epsilon] = 0

This says: the probability of XnX_n being far from XX vanishes, but it does not rule out occasional deviations. There might be rare "bad" events where XnX_n is far from XX, as long as these events become increasingly unlikely.

Definition

Almost Sure Convergence

A sequence XnX_n converges almost surely (a.s.) to XX if:

Pr ⁣[limnXn=X]=1\Pr\!\left[\lim_{n \to \infty} X_n = X\right] = 1

Equivalently: Pr[{ω:Xn(ω)X(ω)}]=1\Pr[\{\omega : X_n(\omega) \to X(\omega)\}] = 1.

This is strictly stronger than convergence in probability. Almost sure convergence says: for almost every outcome ω\omega, the sequence X1(ω),X2(ω),X_1(\omega), X_2(\omega), \ldots converges to X(ω)X(\omega). Not just "unlikely to be far away," but "actually converges along each path."

The relationship: Almost sure convergence implies convergence in probability, but not vice versa. The distinction matters in practice: convergence in probability allows for occasional large deviations that become rare; almost sure convergence says the sequence eventually settles down for each outcome.

Main Theorems

Theorem

Weak Law of Large Numbers (WLLN)

Statement

If X1,X2,X_1, X_2, \ldots are i.i.d. random variables with E[Xi]=μ<\mathbb{E}[X_i] = \mu < \infty, then the sample mean converges in probability to μ\mu:

Xˉn=1ni=1nXiPμ\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i \xrightarrow{P} \mu

That is, for every ϵ>0\epsilon > 0: limnPr[Xˉnμ>ϵ]=0\lim_{n \to \infty} \Pr[|\bar{X}_n - \mu| > \epsilon] = 0.

Intuition

Averaging reduces fluctuations. Each XiX_i fluctuates around μ\mu, but when you average many of them, the positive and negative deviations tend to cancel. The more you average, the less the sample mean fluctuates. Eventually, the probability of any fixed-size deviation vanishes.

Proof Sketch

(Simple proof assuming finite variance): If Var(Xi)=σ2<\text{Var}(X_i) = \sigma^2 < \infty, then Var(Xˉn)=σ2/n\text{Var}(\bar{X}_n) = \sigma^2/n. By Chebyshev's inequality:

Pr[Xˉnμ>ϵ]Var(Xˉn)ϵ2=σ2nϵ20\Pr[|\bar{X}_n - \mu| > \epsilon] \leq \frac{\text{Var}(\bar{X}_n)}{\epsilon^2} = \frac{\sigma^2}{n\epsilon^2} \to 0

(General proof with only finite mean): Use truncation. Define Yi=Xi1{Xin}Y_i = X_i \mathbf{1}\{|X_i| \leq n\}. Show that: (1) Pr[XiYi]0\Pr[X_i \neq Y_i] \to 0 by dominated convergence, (2) Var(Yi)/n0\text{Var}(Y_i)/n \to 0, and (3) E[Yi]μ\mathbb{E}[Y_i] \to \mu. Apply Chebyshev to the truncated variables.

Why It Matters

The WLLN is the justification for using sample statistics as estimates of population quantities. When you compute a sample mean, a sample variance, or an empirical risk, you are relying on the WLLN to guarantee that these quantities are close to their population counterparts for large nn.

Crucially, the WLLN only requires a finite mean --- not a finite variance. The Cauchy distribution has no finite mean, and the LLN genuinely fails: the sample mean of i.i.d. Cauchy variables does not converge. Finite mean is both necessary and sufficient for the WLLN.

Failure Mode

The WLLN fails when E[X]=\mathbb{E}[|X|] = \infty. The canonical example is the Cauchy distribution: the sample mean of nn i.i.d. Cauchy variables has the same Cauchy distribution for every nn. Averaging does not help because the tails are too heavy. This is why concentration inequalities (which give non-asymptotic bounds) always require moment conditions.

Theorem

Strong Law of Large Numbers (SLLN)

Statement

If X1,X2,X_1, X_2, \ldots are i.i.d. with E[Xi]=μ<\mathbb{E}[X_i] = \mu < \infty, then:

Xˉna.s.μ\bar{X}_n \xrightarrow{\text{a.s.}} \mu

That is, Pr ⁣[limnXˉn=μ]=1\Pr\!\left[\lim_{n \to \infty} \bar{X}_n = \mu\right] = 1.

Intuition

The strong law says something more powerful than the weak law: not just that large deviations become unlikely, but that the sample mean actually converges for almost every sequence of outcomes. If you ran the experiment once and watched Xˉ1,Xˉ2,Xˉ3,\bar{X}_1, \bar{X}_2, \bar{X}_3, \ldots, the sequence would converge to μ\mu (with probability 1).

Proof Sketch

(Proof assuming finite fourth moment, for intuition): Compute E[(Xˉnμ)4]\mathbb{E}[(\bar{X}_n - \mu)^4]. After expanding and using independence: E[(Xˉnμ)4]=O(1/n2)\mathbb{E}[(\bar{X}_n - \mu)^4] = O(1/n^2). Then nE[(Xˉnμ)4]<\sum_n \mathbb{E}[(\bar{X}_n - \mu)^4] < \infty. By Markov's inequality: nPr[Xˉnμ>ϵ]<\sum_n \Pr[|\bar{X}_n - \mu| > \epsilon] < \infty for every ϵ>0\epsilon > 0. By the first Borel-Cantelli lemma: Pr[Xˉnμ>ϵ infinitely often]=0\Pr[|\bar{X}_n - \mu| > \epsilon \text{ infinitely often}] = 0. This gives almost sure convergence.

(General proof with only finite mean): Uses Kolmogorov's truncation technique and the Kolmogorov three-series theorem. The key idea is to truncate, apply the SLLN for bounded variables (using Borel-Cantelli), and show the truncation error vanishes.

Why It Matters

The SLLN justifies Monte Carlo simulation. When you estimate E[f(X)]\mathbb{E}[f(X)] by running a simulation and averaging f(X1),,f(Xn)f(X_1), \ldots, f(X_n), you want to know that the estimate converges to the true value as you run longer. The SLLN guarantees this: with probability 1, your simulation will eventually give the right answer.

For ML: the SLLN implies that the empirical distribution converges to the true distribution (in a pointwise sense), which is the starting point for uniform convergence arguments that control the generalization gap.

Failure Mode

Like the WLLN, the SLLN requires E[X]<\mathbb{E}[|X|] < \infty. An important subtlety: the SLLN can fail for non-identically distributed variables even with finite means. Kolmogorov's conditions for the SLLN with independent (not identically distributed) variables require: nVar(Xn)/n2<\sum_n \text{Var}(X_n)/n^2 < \infty (Kolmogorov's criterion).

Rate of Convergence

The LLN tells you that Xˉnμ\bar{X}_n \to \mu, but it does not tell you how fast. The rate of convergence comes from the central limit theorem:

n(Xˉnμ)dN(0,σ2)\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} \mathcal{N}(0, \sigma^2)

This says the fluctuations of Xˉn\bar{X}_n around μ\mu are of order σ/n\sigma / \sqrt{n}. The 1/n1/\sqrt{n} rate is universal (for distributions with finite variance) and explains why:

  • Halving the error requires quadrupling the sample size
  • Monte Carlo estimates converge slowly: 4x more computation for 2x more accuracy
  • Concentration inequalities give bounds of order σ/n\sigma/\sqrt{n}

The CLT goes beyond the LLN by characterizing the shape of the fluctuations (Gaussian), not just their decay.

Kolmogorov's Conditions

For independent (but not necessarily identically distributed) random variables X1,X2,X_1, X_2, \ldots with E[Xi]=μi\mathbb{E}[X_i] = \mu_i, the SLLN 1ni(Xiμi)a.s.0\frac{1}{n}\sum_i (X_i - \mu_i) \xrightarrow{\text{a.s.}} 0 holds under Kolmogorov's condition:

n=1Var(Xn)n2<\sum_{n=1}^\infty \frac{\text{Var}(X_n)}{n^2} < \infty

This is satisfied, for instance, if the variances are uniformly bounded. The 1/n21/n^2 denominator comes from the Kronecker lemma combined with the Kolmogorov three-series theorem.

Canonical Examples

Example

Coin flips

Let XiBern(1/2)X_i \sim \text{Bern}(1/2). Then Xˉn\bar{X}_n is the fraction of heads in nn flips. The LLN says Xˉn1/2\bar{X}_n \to 1/2. Chebyshev gives the rate: Pr[Xˉn1/2>ϵ]14nϵ2\Pr[|\bar{X}_n - 1/2| > \epsilon] \leq \frac{1}{4n\epsilon^2}. For ϵ=0.01\epsilon = 0.01: n250,000n \geq 250{,}000 suffices for the probability to be at most 0.010.01.

The CLT gives a tighter bound: Pr[Xˉn1/2>ϵ]2Φ(2ϵn)\Pr[|\bar{X}_n - 1/2| > \epsilon] \approx 2\Phi(-2\epsilon\sqrt{n}) where Φ\Phi is the standard normal CDF.

Example

Empirical risk as LLN

The empirical risk R^n(h)=1ni(h(xi),yi)\hat{R}_n(h) = \frac{1}{n}\sum_i \ell(h(x_i), y_i) is the sample mean of (h(Xi),Yi)\ell(h(X_i), Y_i), which are i.i.d. with mean R(h)=E[(h(X),Y)]R(h) = \mathbb{E}[\ell(h(X), Y)]. By the SLLN:

R^n(h)a.s.R(h)\hat{R}_n(h) \xrightarrow{\text{a.s.}} R(h)

for each fixed hh. This is the pointwise convergence of empirical risk to population risk. The challenge of learning theory is to make this convergence uniform over hHh \in \mathcal{H}, which requires concentration inequalities and complexity measures (VC dimension, Rademacher complexity).

Common Confusions

Watch Out

Convergence in probability is NOT almost sure convergence

Consider: let Xn=1X_n = 1 with probability 1/n1/n and Xn=0X_n = 0 otherwise (independently). Then XnP0X_n \xrightarrow{P} 0 (since Pr[Xn>ϵ]=1/n0\Pr[|X_n| > \epsilon] = 1/n \to 0). But nPr[Xn=1]=1/n=\sum_n \Pr[X_n = 1] = \sum 1/n = \infty, so by the second Borel-Cantelli lemma (the events are independent), Xn=1X_n = 1 infinitely often with probability 1. Thus Xn̸a.s.0X_n \not\xrightarrow{\text{a.s.}} 0.

In this example, large deviations keep happening, just less and less frequently. Convergence in probability allows this; almost sure convergence does not.

Watch Out

The LLN requires finite mean, not finite variance

The simplest proof of the WLLN uses Chebyshev's inequality and requires finite variance. But the WLLN holds with only a finite mean (the proof uses truncation). The SLLN also holds with only a finite mean. Finite variance gives you the rate of convergence (via the CLT), but convergence itself needs only a finite mean.

Conversely, if E[X]=\mathbb{E}[|X|] = \infty, the LLN fails. The sample mean does not converge to any fixed value.

Watch Out

The LLN is about the sample mean, not individual observations

A common misstatement: "by the law of large numbers, extreme values become rare." This is wrong. Individual observations XiX_i always have the same distribution, no matter how large nn is. It is the average Xˉn\bar{X}_n that converges. Extreme values keep occurring at the same rate; they just get diluted by the average.

Summary

  • WLLN: XˉnPμ\bar{X}_n \xrightarrow{P} \mu (convergence in probability) --- requires only E[X]<\mathbb{E}[|X|] < \infty
  • SLLN: Xˉna.s.μ\bar{X}_n \xrightarrow{\text{a.s.}} \mu (almost sure convergence) --- also requires only E[X]<\mathbb{E}[|X|] < \infty
  • The SLLN is strictly stronger than the WLLN
  • Rate of convergence: fluctuations are O(σ/n)O(\sigma/\sqrt{n}) (from CLT)
  • LLN justifies: empirical risk as proxy for population risk, consistency of estimators, Monte Carlo methods
  • LLN fails for distributions with infinite mean (e.g., Cauchy)

Exercises

ExerciseCore

Problem

Simulate nn i.i.d. coin flips (XiBern(0.5)X_i \sim \text{Bern}(0.5)) and plot Xˉn\bar{X}_n as a function of nn for n=1,2,,10,000n = 1, 2, \ldots, 10{,}000. Run 10 independent simulations on the same plot. Verify visually that the sample mean converges to 0.5 and that the fluctuations decrease as 1/n1/\sqrt{n}.

ExerciseAdvanced

Problem

Give an example of a sequence of independent random variables XnX_n with E[Xn]=0\mathbb{E}[X_n] = 0 for all nn such that 1ni=1nXi\frac{1}{n}\sum_{i=1}^n X_i does not converge to 0 almost surely. Why does this not contradict the SLLN?

References

Canonical:

  • Durrett, Probability: Theory and Examples (5th ed., 2019), Sections 2.2-2.4
  • Billingsley, Probability and Measure (3rd ed., 1995), Sections 6, 22

Current:

  • Vershynin, High-Dimensional Probability (2018), Chapter 0 (motivation)

  • Wainwright, High-Dimensional Statistics (2019), Section 2.1

  • Casella & Berger, Statistical Inference (2002), Chapters 5-10

  • Lehmann & Casella, Theory of Point Estimation (1998), Chapters 1-6

Next Topics

Building on the law of large numbers:

  • Central limit theorem: the rate and shape of convergence --- n(Xˉnμ)N(0,σ2)\sqrt{n}(\bar{X}_n - \mu) \to \mathcal{N}(0, \sigma^2)
  • Empirical risk minimization: the LLN in action for learning theory
  • Concentration inequalities: non-asymptotic versions of the LLN

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics