Skip to main content

Comparison

Weak Law vs. Strong Law of Large Numbers

Both promise the sample mean converges to the population mean. The weak law uses convergence in probability and is easier to prove. The strong law uses almost-sure convergence and is what Monte Carlo simulation actually needs. The canonical counter-example shows the gap is real.

Last reviewed: May 12, 2026

What Each Promises

Both statements say the sample mean Xˉn=1ni=1nXi\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i of i.i.d. variables with finite mean μ\mu converges to μ\mu. They differ in what "converges" means.

Weak law (WLLN). For every ϵ>0\epsilon > 0, limnPr ⁣[Xˉnμ>ϵ]=0.\lim_{n \to \infty} \Pr\!\left[\lvert\bar{X}_n - \mu\rvert > \epsilon\right] = 0. This is convergence in probability. The probability of a fixed-size deviation vanishes, but for any given nn there is still some chance of a large deviation.

Strong law (SLLN). Pr ⁣[limnXˉn=μ]=1.\Pr\!\left[\lim_{n \to \infty} \bar{X}_n = \mu\right] = 1. This is almost-sure convergence. With probability one, the entire sequence Xˉ1,Xˉ2,\bar{X}_1, \bar{X}_2, \ldots actually converges to μ\mu as a deterministic limit, viewed sample path by sample path.

The almost-sure form is strictly stronger. Anything that converges almost surely converges in probability; the reverse fails.

The Gap Is Real

The simplest way to see that the gap matters: a sequence that converges in probability but not almost surely.

Let XnX_n be independent Bernoulli variables with Pr[Xn=1]=1/n\Pr[X_n = 1] = 1/n and Pr[Xn=0]=11/n\Pr[X_n = 0] = 1 - 1/n.

In probability: Pr[Xn0>ϵ]=1/n0\Pr[\lvert X_n - 0 \rvert > \epsilon] = 1/n \to 0 for any ϵ(0,1)\epsilon \in (0, 1). So XnP0X_n \xrightarrow{P} 0.

Almost surely: the events {Xn=1}\{X_n = 1\} are independent and 1/n=\sum 1/n = \infty. By the second Borel-Cantelli lemma, Pr[Xn=1 infinitely often]=1\Pr[X_n = 1 \text{ infinitely often}] = 1. So along almost every sample path, XnX_n keeps returning to 11 infinitely often. The sequence XnX_n does not converge to 00 for almost every ω\omega.

This is the canonical separation. Convergence in probability lets the exceptional "bad" event keep happening, as long as its probability shrinks. Almost-sure convergence requires that for almost every sample path, the bad event eventually stops.

Proof-Effort Comparison

AspectWeak lawStrong law
Convergence modeIn probabilityAlmost sure
Minimal i.i.d. assumptionEX<\mathbb{E}\lvert X \rvert < \inftyEX<\mathbb{E}\lvert X \rvert < \infty
Quick proof toolChebyshev inequality (if Var(X)<\mathrm{Var}(X) < \infty)Borel-Cantelli + 4th moment, or Etemadi
Hard proofTruncation argument (Khintchine)Kolmogorov three-series + truncation
Generalization beyond i.i.d.Easier, many variantsHarder, fewer variants

The 4th-moment SLLN proof is short and goes via Markov + first Borel-Cantelli: compute E[(Xˉnμ)4]=O(1/n2)\mathbb{E}[(\bar{X}_n - \mu)^4] = O(1/n^2), sum over nn to get nPr[Xˉnμ>ϵ]<\sum_n \Pr[\lvert\bar{X}_n - \mu\rvert > \epsilon] < \infty, conclude that the deviation event happens only finitely often. The minimal-assumption SLLN proof (only EX<\mathbb{E}\lvert X \rvert < \infty) needs Etemadi's argument, which truncates and uses pairwise-independent subsequences. The WLLN under the same assumption needs only a single truncation step.

When Each Form Is What You Need

The weak law is enough whenever the question is about a single value of nn:

  • Confidence intervals. Pr[Xˉnμ>ϵ]α\Pr[\lvert\bar{X}_n - \mu\rvert > \epsilon] \leq \alpha is exactly what the WLLN bounds.
  • Hypothesis testing power. Probability that the test statistic crosses the threshold at sample size nn.
  • Polling and survey statistics. "How big does nn need to be for the estimator to be within ϵ\epsilon of the truth with probability 1δ1 - \delta?"

The strong law is required whenever the question is about the whole trajectory:

  • Monte Carlo correctness. A simulation that runs and runs needs the guarantee that its running average will eventually settle at the right value, sample path by sample path. WLLN does not give this; the running average might keep wandering, just less and less often.
  • Pathwise statements in stochastic processes. "The empirical measure of a positive-recurrent Markov chain converges to the stationary distribution" is an a.s. statement; the weak version is too weak to use for individual realized trajectories.
  • Almost-sure consistency of estimators in asymptotic statistics. The classical MLE consistency result is a.s. consistency, not just convergence in probability.

In ML practice the WLLN suffices for most generalization arguments (empirical risk near population risk with high probability), but the SLLN is what justifies the statement "if I train for long enough on fresh i.i.d. samples, the training loss converges to the population loss along almost every training trajectory".

The Practical Gap

For most well-behaved distributions, the two limits look identical in simulation: both versions converge fast enough that the trajectory settles within an envelope of O(1/n)O(1/\sqrt{n}). The visible distinction shows up only in pathological cases:

  • The Bernoulli example above, which is a sequence of non-identically-distributed variables. For i.i.d. variables, convergence in probability + finite mean does imply almost-sure convergence, so the gap vanishes.
  • Sequences constructed deliberately to converge in probability but not almost surely (typewriter-style examples in modes-of-convergence).

What this means in practice: for i.i.d. data with finite mean, you get both versions of the LLN for free. The gap is a theoretical-foundations issue, not a practical one. But the gap is real, and knowing which version your argument needs prevents over-claiming or under-claiming guarantees.

Common Confusions

"The strong law is just a stronger version of the weak law." Strictly true as a statement about convergence modes, but misleading as a guide to which to apply. They answer different questions. WLLN bounds a finite-nn probability; SLLN guarantees a limit along sample paths. The "stronger" version costs more to prove and is what stochastic-process arguments need.

"For i.i.d. data the two are the same." Almost. For i.i.d. with finite mean, both hold under the same assumption, so as theorems they are equivalent in their scope. But the conclusions are different: one bounds a probability at each nn, the other asserts pathwise convergence.

"Convergence in probability implies convergence almost surely along a subsequence." True (this is a classical lemma), but not the same as almost-sure convergence of the original sequence. The subsequence may depend on ω\omega, and the inter-subsequence behavior can be wild.

Quick Decision Rule

QuestionUse
"What is Pr[Xˉnμ>ϵ]\Pr[\lvert \bar{X}_n - \mu\rvert > \epsilon] at this nn?"WLLN
"Will my running average settle at μ\mu?"SLLN
"Is my MLE consistent?"SLLN (in classical statement); WLLN sufficient if you only want "converges in probability"
"Is empirical risk close to population risk with high probability at nn?"WLLN
"Does the Markov chain time-average converge to the stationary mean?"SLLN

References

Canonical:

  • Durrett, Probability: Theory and Examples (5th ed., 2019), Sections 2.2-2.4 (WLLN, SLLN, Etemadi's proof).
  • Billingsley, Probability and Measure (3rd ed., 1995), Sections 6 and 22.
  • Kallenberg, Foundations of Modern Probability (3rd ed., 2021), Chapter 5 (strong laws and the Kolmogorov three-series theorem).

Current:

  • Resnick, A Probability Path (1999; reprint 2014), Chapter 7 (compact pedagogical treatment with the Borel-Cantelli counter-example).