Skip to main content

Concentration Probability

LLN and CLT Failures Under Heavy Tails

What breaks when finite mean or finite variance fails. Cauchy: the sample mean stays Cauchy no matter how large the sample. Pareto across the alpha regimes: when LLN still holds but CLT does not. Generalized CLT and stable-law limits. The consistency illusion in finance and reinsurance.

AdvancedAdvancedTier 1StableCore spine~45 min
For:MLStats

Why This Matters

The law of large numbers and the central limit theorem are the two load-bearing results behind sample-mean estimation, Monte Carlo, empirical risk minimization, and most confidence intervals. Both assume tail conditions. The LLN requires EX<\mathbb{E}\lvert X \rvert < \infty. The classical CLT requires E[X2]<\mathbb{E}[X^2] < \infty. When either fails, the standard machinery does not just degrade gracefully. It produces estimators that look like they are converging and then snap back, confidence intervals that under-cover by orders of magnitude, and bootstrap procedures that return tighter and tighter answers around the wrong center.

This page is the failure-mode catalog. It walks through Cauchy (no mean), Pareto across the α\alpha regimes (mean exists or does not, variance exists or does not), the generalized CLT replacement when standard CLT fails, and the domain-of-attraction language that says which distributions sit in which basin. It is the page to read before you compute a sample mean of insurance losses, financial returns, or token-frequency counts.

Quick Version

DistributionEX\mathbb{E}\lvert X\rvertVar(X)\mathrm{Var}(X)LLNClassical CLTLimit law of normalized sum
GaussianfinitefiniteholdsholdsGaussian
Pareto, α>2\alpha > 2finitefiniteholdsholdsGaussian
Pareto, 1<α21 < \alpha \leq 2finiteinfiniteholdsfailsα\alpha-stable
Pareto, α1\alpha \leq 1infiniteinfinitefailsfailsα\alpha-stable
Cauchy (α=1\alpha = 1)infiniteinfinitefailsfailsCauchy (itself)

The hierarchy is set by tail index α\alpha, not by variance alone. Variance infinity is the point where standard CLT breaks. Mean infinity is the point where LLN breaks. The two thresholds are different.

The Cauchy Case: LLN Fails Loudly

The Cauchy distribution has density f(x)=1π(1+x2)f(x) = \frac{1}{\pi (1 + x^2)}. Its tails decay as 1/x21/x^2, slow enough that EX=\mathbb{E}\lvert X\rvert = \infty. There is no finite mean to converge to.

What goes wrong is sharper than "the sample mean has high variance". The sample mean of nn i.i.d. Cauchy variables has the same Cauchy distribution for every nn. Averaging does nothing.

Theorem

Sample Mean of i.i.d. Cauchy is Again Cauchy

Statement

If X1,,XnX_1, \ldots, X_n are i.i.d. with the standard Cauchy distribution, then the sample mean satisfies Xˉn=1ni=1nXi  =d  X1.\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i \;\overset{d}{=}\; X_1. The sample mean and a single observation have identically the same distribution. Increasing nn does not narrow the spread of Xˉn\bar{X}_n.

Intuition

Averaging is a noise-reduction operation when the noise has finite variance. Cauchy noise has infinite variance and a positive probability of producing, on any given sample, a value larger than the cumulative sum so far. That one observation dominates the average. Each additional sample brings a new chance to overwrite the running estimate.

Why It Matters

This is the cleanest counter-example to the intuition that "more data is always better". For Cauchy-tailed data, the sample mean is not a consistent estimator of any location parameter. The right estimator is the sample median, which is consistent and has finite asymptotic variance under Cauchy noise. The lesson generalizes: tail behavior dictates which estimator works, and the sample mean is the wrong default for heavy tails.

Failure Mode

The result depends on the symmetry and exact tail constant of the standard Cauchy. Other α=1\alpha = 1 stable laws produce the same self-similarity under averaging, but a shifted Cauchy averaged with itself produces a Cauchy with the shift preserved. The qualitative failure of LLN holds across α1\alpha \leq 1 stable laws.

Optional ProofProof via characteristic functionsShow

The characteristic function of standard Cauchy is φX(t)=et\varphi_X(t) = e^{-\lvert t\rvert}. For i.i.d. X1,,XnX_1, \ldots, X_n and Xˉn=1nXi\bar{X}_n = \frac{1}{n}\sum X_i,

φXˉn(t)=i=1nφXi(t/n)=(et/n)n=et=φX1(t).\varphi_{\bar{X}_n}(t) = \prod_{i=1}^n \varphi_{X_i}(t/n) = \left(e^{-\lvert t/n\rvert}\right)^n = e^{-\lvert t\rvert} = \varphi_{X_1}(t).

By the uniqueness theorem for characteristic functions, Xˉn\bar{X}_n and X1X_1 have the same distribution. The proof is two lines. The reason it works is that φX(t)=et\varphi_X(t) = e^{-\lvert t\rvert} is the α=1\alpha = 1 stable characteristic function, and the α=1\alpha = 1 stability relation is exactly φ(t)n=φ(nt)\varphi(t)^n = \varphi(n t) up to the right scaling.

The same calculation shows that for any symmetric α\alpha-stable law with characteristic function φ(t)=etα\varphi(t) = e^{-\lvert t\rvert^\alpha}, the normalized sum n1/αXin^{-1/\alpha}\sum X_i has the same distribution as X1X_1. The Cauchy case is α=1\alpha = 1, which is why the unnormalized sample mean (normalization n1n^{-1}, equal to n1/αn^{-1/\alpha} when α=1\alpha = 1) is again Cauchy.

Pareto Across the Alpha Regimes

The Pareto distribution with shape α>0\alpha > 0 has Pr[X>x]=(xm/x)α\Pr[X > x] = (x_m/x)^\alpha for xxmx \geq x_m, so the tail decays polynomially. Moments of order kk exist iff k<αk < \alpha. This single parameter divides the parameter space into three qualitatively different regimes.

Regime A: α>2\alpha > 2. Both LLN and CLT hold.

Mean E[X]=αxmα1\mathbb{E}[X] = \frac{\alpha x_m}{\alpha - 1} and variance Var(X)=αxm2(α1)2(α2)\mathrm{Var}(X) = \frac{\alpha x_m^2}{(\alpha-1)^2(\alpha-2)} are both finite. The sample mean converges to E[X]\mathbb{E}[X] almost surely by the SLLN, and the standardized sum n(XˉnE[X])\sqrt{n}(\bar{X}_n - \mathbb{E}[X]) converges in distribution to N(0,Var(X))\mathcal{N}(0, \mathrm{Var}(X)) by the classical CLT.

The catch: the constants are large. For α=2.5\alpha = 2.5, Var(X)\mathrm{Var}(X) exists but is roughly xm22.5(1.5)20.52.2xm2\frac{x_m^2 \cdot 2.5}{(1.5)^2 \cdot 0.5} \approx 2.2\, x_m^2 times the unit scale, and the rate O(1/n)O(1/\sqrt{n}) comes with a Berry-Esseen constant that depends on the third moment, which barely exists. Approximations converge but slowly.

Regime B: 1<α21 < \alpha \leq 2. LLN holds. CLT fails.

Mean is finite, E[X]=αxmα1\mathbb{E}[X] = \frac{\alpha x_m}{\alpha - 1}. By Khintchine's weak law, the sample mean converges in probability to E[X]\mathbb{E}[X]. The strong law also holds for i.i.d. variables since finite mean is sufficient.

Variance is infinite. The classical CLT does not apply. Instead, the normalized sum

1n1/αi=1n(XiE[X])\frac{1}{n^{1/\alpha}}\sum_{i=1}^n (X_i - \mathbb{E}[X])

converges in distribution to an α\alpha-stable law. The normalization is n1/αn^{1/\alpha}, slower than n\sqrt{n} when α<2\alpha < 2 (so the fluctuations are larger). The limit is not Gaussian. It has the same power-law tails as the underlying distribution.

This is the regime where the consistency illusion bites. Run a Monte Carlo simulation with Pareto α=1.5\alpha = 1.5 losses. The running average looks like it is settling down for thousands of iterations. Then a single draw lands in the deep tail, and the running average jumps by orders of magnitude. There is no nn large enough for the standard confidence interval Xˉn±1.96sn/n\bar{X}_n \pm 1.96 \, s_n/\sqrt{n} to cover at the nominal rate, because sn2s_n^2 does not stabilize.

Regime C: α1\alpha \leq 1. Neither LLN nor CLT holds.

Mean is infinite. The sample mean does not converge to any finite value. The right normalization is again n1/αn^{1/\alpha}, and the limit is a totally skewed α\alpha-stable law (assuming XxmX \geq x_m for one-sided Pareto). Sample means are meaningless. Empirical procedures that compute them silently produce nonsense.

Watch Out

Heavy tail does not mean infinite mean

A common misreading: "Pareto is heavy-tailed, so its mean is infinite". Wrong for α>1\alpha > 1. The mean exists and is well-defined; it is just that the variance does not exist for α2\alpha \leq 2. A Pareto with α=1.5\alpha = 1.5 has a finite mean of 3xm3 x_m, a finite median, a finite mode, and an infinite variance. Sample means converge to the mean by Khintchine's WLLN. What fails is the rate of convergence and the Gaussian fluctuation shape, not the limit itself.

The clean threshold is: α>k\alpha > k if and only if E[Xk]<\mathbb{E}[X^k] < \infty. The LLN-relevant threshold is α>1\alpha > 1. The CLT-relevant threshold is α>2\alpha > 2.

The Generalized Central Limit Theorem

When variance is infinite but tails are regularly varying, the standard CLT is replaced by a wider result. Define a positive random variable XX to have a regularly varying tail with index α\alpha if Pr[X>tx]/Pr[X>t]xα\Pr[X > t x] / \Pr[X > t] \to x^{-\alpha} as tt \to \infty for every x>0x > 0. Pareto, Student-tt with ν\nu degrees of freedom (index α=ν\alpha = \nu), and the half-Cauchy (α=1\alpha = 1) all satisfy this.

Theorem

Generalized CLT (Levy-Gnedenko)

Statement

There exist constants an>0a_n > 0 and bnRb_n \in \mathbb{R} such that 1an(i=1nXibn)dSα\frac{1}{a_n}\left(\sum_{i=1}^n X_i - b_n\right) \xrightarrow{d} S_\alpha where SαS_\alpha is an α\alpha-stable distribution. The normalizing sequence satisfies an=n1/αL(n)a_n = n^{1/\alpha} L(n) for a slowly varying LL. When α>1\alpha > 1, bnb_n may be taken as nE[X1]n \, \mathbb{E}[X_1]; when α1\alpha \leq 1, the centering is more delicate and the unnormalized sum needs a different anchor.

Intuition

The classical CLT picks Gaussian because n\sqrt{n} normalization plus finite variance forces the limit to satisfy a stability relation that only Gaussians satisfy. When the variance is infinite, the right normalization changes to n1/αn^{1/\alpha}, and the stability relation picks out the α\alpha-stable laws instead. Gaussian is the boundary case α=2\alpha = 2.

Why It Matters

This is the failure-mode replacement, not a workaround. Confidence intervals constructed from the stable-law limit have the right coverage asymptotically; intervals constructed from the Gaussian approximation do not. The error distribution has power-law tails of index α\alpha rather than the Gaussian's ex2/2e^{-x^2/2} tail, and the rate of convergence is slower. Practical use requires either non-parametric methods (block maxima, peaks-over-threshold; see extreme-value theory) or estimating α\alpha first via Hill or similar tail-index estimators.

Failure Mode

The result needs the tail-balance condition: both tails (when applicable) must be regularly varying with the same index, with limit ratios p,q0p, q \geq 0 summing to 1. Asymmetric heavy tails get a one-sided stable limit. Distributions with slowly varying corrections (e.g. tails of the form xαlogxx^{-\alpha} \log x) sit in a different domain of attraction and may need an additional logarithmic factor in ana_n.

Quantitative BoundDomain of attraction characterizationShow

Say XX is in the domain of attraction of an α\alpha-stable law, written XD(α)X \in D(\alpha), if there exist an,bna_n, b_n such that 1an(Xibn)\frac{1}{a_n}(\sum X_i - b_n) converges in distribution to a non-degenerate α\alpha-stable law. The Levy-Gnedenko characterization is:

  • α=2\alpha = 2 (Gaussian domain): XD(2)X \in D(2) iff E[X21{Xt}]\mathbb{E}[X^2 \mathbf{1}\{\lvert X\rvert \leq t\}] is slowly varying as tt \to \infty. This includes all distributions with finite variance, plus a few without (e.g. Pr[X>x]1/(x2logx)\Pr[X > x] \sim 1/(x^2 \log x)).
  • 0<α<20 < \alpha < 2: XD(α)X \in D(\alpha) iff Pr[X>t]\Pr[\lvert X\rvert > t] is regularly varying with index α-\alpha, and the tails are balanced: Pr[X>t]/Pr[X>t]p\Pr[X > t]/\Pr[\lvert X\rvert > t] \to p for some p[0,1]p \in [0, 1].

The Gaussian domain is enormous; the α\alpha-stable domain for each α<2\alpha < 2 is a thin slice of distributions sharing one tail index. This is why standard CLT is "stable" in the usability sense: most realistic distributions live in D(2)D(2). The exceptions are exactly the distributions that look heavy-tailed on a log-log plot.

The Consistency Illusion

A running average Xˉn\bar{X}_n from a heavy-tailed process can look convergent for thousands of samples and then jump. The visual pattern is:

  1. Many small samples accumulate. The running average drifts toward a plausible-looking value.
  2. A single deep-tail sample arrives. The running average jumps by an amount that depends on nn but does not shrink with nn when the contribution scales like n1/αn^{1/\alpha} in the regime α<1\alpha < 1, or stays comparable to the running estimate when α(1,2)\alpha \in (1, 2).
  3. The cycle repeats. The average never settles.

For α(1,2)\alpha \in (1, 2), the LLN guarantees the running average converges to E[X]\mathbb{E}[X], but on time scales that depend on α\alpha. The 1/n1/\sqrt{n} rate from CLT is wrong. The right rate is 1/n11/α1/n^{1-1/\alpha}, which approaches 0 as α1\alpha \to 1. At α=1.1\alpha = 1.1, the rate is roughly 1/n0.091/n^{0.09}. A million samples reduce error by a factor of 3\approx 3, not by a factor of 10001000.

Example

Pareto sample-mean simulation

Take XiX_i i.i.d. Pareto with α=1.5\alpha = 1.5, xm=1x_m = 1, so E[X]=3\mathbb{E}[X] = 3. Simulate Xˉn\bar{X}_n for nn up to 10610^6. Typical runs show Xˉn\bar{X}_n oscillating in the range [2.5,4.5][2.5, 4.5] even at n=106n = 10^6, occasionally jumping to 1515 or 2020 after a single deep-tail sample. The standard 95% confidence interval Xˉn±1.96sn/n\bar{X}_n \pm 1.96 s_n/\sqrt{n} is meaningless: sns_n itself does not stabilize, because Var(X)=\mathrm{Var}(X) = \infty. A correctly designed stable-law confidence interval is much wider and has the right coverage.

Common Confusions

Watch Out

Variance is not the right threshold for LLN

Finite variance is sufficient for the LLN (via Chebyshev) but not necessary. The right condition for the i.i.d. LLN is finite mean. Distributions like Pareto with α=1.5\alpha = 1.5 have infinite variance and still satisfy the LLN. Conversely, distributions with infinite mean (Cauchy, Pareto α1\alpha \leq 1) fail the LLN even though "infinite variance" sounds less drastic than "infinite mean".

Watch Out

CLT failure is not just slower convergence

The CLT does not get "weaker" as the tail gets heavier and then recover for slightly less heavy tails. It snaps off at α=2\alpha = 2. For α<2\alpha < 2, the limit law is not Gaussian; it is a different law, with different tail behavior, different quantiles, and different confidence-interval calibration. A confidence interval built on the Gaussian approximation does not just under-cover by a small factor. It under-covers by an asymptotically growing factor.

Watch Out

Cauchy is not just heavy-tailed Gaussian

Cauchy and Gaussian have the same symmetric, unimodal density shape, so intuition borrowed from "normal-ish" distributions is dangerous. The Cauchy distribution has no mean, no variance, and no defined moment of any positive order. Operations that rely on those moments (sample mean, sample variance, t-statistic) produce results that have no probabilistic interpretation in the limit. Use the median or the trimmed mean.

Connections to ML and Risk

Heavy-tailed phenomena show up across ML in ways that quietly break the standard analysis:

  • Gradient norms during training can be heavy-tailed, especially in language models and on noisy data. SGD-as-SDE arguments that assume Gaussian gradient noise break in this regime.
  • Token frequencies follow Zipf's law (power-law α1\alpha \approx 1), which means sample-based estimates of language statistics suffer the Pareto α(1,2)\alpha \in (1, 2) slow-convergence problem.
  • Generalization gap across seeds has been empirically observed to have heavy upper tails on some benchmarks, which makes "mean accuracy across 5 seeds" an unreliable summary.
  • Financial returns at daily resolution typically have α(2.5,4)\alpha \in (2.5, 4) for major indices, sliding closer to 2 for individual equities and much lower for cryptocurrency. Standard volatility models that assume finite kurtosis miss the tail risk by orders of magnitude.
  • Insurance and reinsurance losses routinely fit Pareto with α<2\alpha < 2, which is why the actuarial literature has spent decades building tools that do not assume CLT (peaks-over-threshold, copulas, excess-of-loss arrangements). See extreme-value theory.

Summary

  • The LLN needs EX<\mathbb{E}\lvert X\rvert < \infty. The CLT needs E[X2]<\mathbb{E}[X^2] < \infty. These are different thresholds.
  • Cauchy and Pareto α1\alpha \leq 1 fail the LLN entirely. The sample mean does not converge.
  • Pareto 1<α21 < \alpha \leq 2 satisfies the LLN but fails the classical CLT. The sample mean converges to the population mean, but the fluctuations are α\alpha-stable, not Gaussian, and the rate is n11/αn^{1-1/\alpha}, not n\sqrt{n}.
  • The generalized CLT (Levy-Gnedenko) gives the right limit law: α\alpha-stable. Domain of attraction is characterized by regular variation of the tail.
  • For heavy-tailed data, use the median or a trimmed mean as the location estimator, and use stable-law or peaks-over-threshold intervals for the uncertainty quantification.

Exercises

ExerciseCore

Problem

Simulate n=104n = 10^4 i.i.d. standard Cauchy variables and plot the running sample mean as a function of nn. Repeat the experiment 20 times on the same axes. Compare to the analogous plot for i.i.d. N(0,1)\mathcal{N}(0, 1) variables.

ExerciseAdvanced

Problem

For i.i.d. Pareto XiX_i with α=1.5\alpha = 1.5, xm=1x_m = 1, derive the correct normalization ana_n such that an1(iXinE[X])a_n^{-1}(\sum_i X_i - n \mathbb{E}[X]) has a non-degenerate limit, and identify the limiting distribution as an α\alpha-stable law. Why is the normalization n2/3n^{2/3} and not n\sqrt{n}?

References

Canonical:

  • Feller, An Introduction to Probability Theory and Its Applications, Vol II (2nd ed., 1971), Chapter XVII (stable laws and domain of attraction).
  • Durrett, Probability: Theory and Examples (5th ed., 2019), Sections 3.7 and 3.8 (stable laws, generalized CLT).
  • Resnick, Heavy-Tail Phenomena: Probabilistic and Statistical Modeling (2007), Chapters 1-3.

Current:

  • Embrechts, Klüppelberg, and Mikosch, Modelling Extremal Events for Insurance and Finance (1997), Chapter 1 (regular variation and the domain-of-attraction framework).
  • Nolan, Univariate Stable Distributions: Models for Heavy Tailed Data (2020), Chapters 1 and 3 (computational treatment of stable laws and Hill-type tail-index estimators).

Critique:

  • Taleb, Statistical Consequences of Fat Tails (2020), Chapters 3 and 5 on the consistency illusion and the misuse of sample moments under heavy tails.

Next Topics

Building on the failure modes:

  • Fat Tails — broader treatment of heavy-tailed phenomena and where they appear in ML and risk.
  • Extreme-Value Theory — Fisher-Tippett three-types theorem, peaks-over-threshold, and quantitative tail estimation.
  • Subexponential Random Variables — the borderline case between thin and heavy tails, with concentration inequalities that still apply.

Last reviewed: May 12, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

6

Derived topics

0

No published topic currently declares this as a prerequisite.