LLN and CLT Failures Under Heavy Tails

Sneiderman, Robby

Concentration Probability

LLN and CLT Failures Under Heavy Tails

What breaks when finite mean or finite variance fails. Cauchy: the sample mean stays Cauchy no matter how large the sample. Pareto across the alpha regimes: when LLN still holds but CLT does not. Generalized CLT and stable-law limits. The consistency illusion in finance and reinsurance.

AdvancedAdvancedTier 1StableCore spine~45 min

For:MLStats

Prerequisites

Law of Large Numbers Central Limit Theorem Characteristic Functions Fat Tails

Prereq Map

Why This Matters

The law of large numbers and the central limit theorem are the two load-bearing results behind sample-mean estimation, Monte Carlo, empirical risk minimization, and most confidence intervals. Both assume tail conditions. The LLN requires $\mathbb{E}\lvert X \rvert < \infty$ . The classical CLT requires $\mathbb{E}[X^2] < \infty$ . When either fails, the standard machinery does not just degrade gracefully. It produces estimators that look like they are converging and then snap back, confidence intervals that under-cover by orders of magnitude, and bootstrap procedures that return tighter and tighter answers around the wrong center.

This page is the failure-mode catalog. It walks through Cauchy (no mean), Pareto across the $\alpha$ regimes (mean exists or does not, variance exists or does not), the generalized CLT replacement when standard CLT fails, and the domain-of-attraction language that says which distributions sit in which basin. It is the page to read before you compute a sample mean of insurance losses, financial returns, or token-frequency counts.

Quick Version

Distribution	$\mathbb{E}\lvert X\rvert$	$\mathrm{Var}(X)$	LLN	Classical CLT	Limit law of normalized sum
Gaussian	finite	finite	holds	holds	Gaussian
Pareto, $\alpha > 2$	finite	finite	holds	holds	Gaussian
Pareto, $1 < \alpha \leq 2$	finite	infinite	holds	fails	$\alpha$ -stable
Pareto, $\alpha \leq 1$	infinite	infinite	fails	fails	$\alpha$ -stable
Cauchy ( $\alpha = 1$ )	infinite	infinite	fails	fails	Cauchy (itself)

The hierarchy is set by tail index $\alpha$ , not by variance alone. Variance infinity is the point where standard CLT breaks. Mean infinity is the point where LLN breaks. The two thresholds are different.

The Cauchy Case: LLN Fails Loudly

The Cauchy distribution has density $f(x) = \frac{1}{\pi (1 + x^2)}$ . Its tails decay as $1/x^2$ , slow enough that $\mathbb{E}\lvert X\rvert = \infty$ . There is no finite mean to converge to.

What goes wrong is sharper than "the sample mean has high variance". The sample mean of $n$ i.i.d. Cauchy variables has the same Cauchy distribution for every $n$ . Averaging does nothing.

Theorem

Sample Mean of i.i.d. Cauchy is Again Cauchy

Statement

If $X_1, \ldots, X_n$ are i.i.d. with the standard Cauchy distribution, then the sample mean satisfies $\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i \;\overset{d}{=}\; X_1.$ The sample mean and a single observation have identically the same distribution. Increasing $n$ does not narrow the spread of $\bar{X}_n$ .

Intuition

Averaging is a noise-reduction operation when the noise has finite variance. Cauchy noise has infinite variance and a positive probability of producing, on any given sample, a value larger than the cumulative sum so far. That one observation dominates the average. Each additional sample brings a new chance to overwrite the running estimate.

Why It Matters

This is the cleanest counter-example to the intuition that "more data is always better". For Cauchy-tailed data, the sample mean is not a consistent estimator of any location parameter. The right estimator is the sample median, which is consistent and has finite asymptotic variance under Cauchy noise. The lesson generalizes: tail behavior dictates which estimator works, and the sample mean is the wrong default for heavy tails.

Failure Mode

The result depends on the symmetry and exact tail constant of the standard Cauchy. Other $\alpha = 1$ stable laws produce the same self-similarity under averaging, but a shifted Cauchy averaged with itself produces a Cauchy with the shift preserved. The qualitative failure of LLN holds across $\alpha \leq 1$ stable laws.

report a correction →

Optional ProofProof via characteristic functionsShow

The characteristic function of standard Cauchy is $\varphi_X(t) = e^{-\lvert t\rvert}$ . For i.i.d. $X_1, \ldots, X_n$ and $\bar{X}_n = \frac{1}{n}\sum X_i$ ,

$\varphi_{\bar{X}_n}(t) = \prod_{i=1}^n \varphi_{X_i}(t/n) = \left(e^{-\lvert t/n\rvert}\right)^n = e^{-\lvert t\rvert} = \varphi_{X_1}(t).$

By the uniqueness theorem for characteristic functions, $\bar{X}_n$ and $X_1$ have the same distribution. The proof is two lines. The reason it works is that $\varphi_X(t) = e^{-\lvert t\rvert}$ is the $\alpha = 1$ stable characteristic function, and the $\alpha = 1$ stability relation is exactly $\varphi(t)^n = \varphi(n t)$ up to the right scaling.

The same calculation shows that for any symmetric $\alpha$ -stable law with characteristic function $\varphi(t) = e^{-\lvert t\rvert^\alpha}$ , the normalized sum $n^{-1/\alpha}\sum X_i$ has the same distribution as $X_1$ . The Cauchy case is $\alpha = 1$ , which is why the unnormalized sample mean (normalization $n^{-1}$ , equal to $n^{-1/\alpha}$ when $\alpha = 1$ ) is again Cauchy.

Pareto Across the Alpha Regimes

The Pareto distribution with shape $\alpha > 0$ has $\Pr[X > x] = (x_m/x)^\alpha$ for $x \geq x_m$ , so the tail decays polynomially. Moments of order $k$ exist iff $k < \alpha$ . This single parameter divides the parameter space into three qualitatively different regimes.

Regime A: $\alpha > 2$ . Both LLN and CLT hold.

Mean $\mathbb{E}[X] = \frac{\alpha x_m}{\alpha - 1}$ and variance $\mathrm{Var}(X) = \frac{\alpha x_m^2}{(\alpha-1)^2(\alpha-2)}$ are both finite. The sample mean converges to $\mathbb{E}[X]$ almost surely by the SLLN, and the standardized sum $\sqrt{n}(\bar{X}_n - \mathbb{E}[X])$ converges in distribution to $\mathcal{N}(0, \mathrm{Var}(X))$ by the classical CLT.

The catch: the constants are large. For $\alpha = 2.5$ , $\mathrm{Var}(X)$ exists but is roughly $\frac{x_m^2 \cdot 2.5}{(1.5)^2 \cdot 0.5} \approx 2.2\, x_m^2$ times the unit scale, and the rate $O(1/\sqrt{n})$ comes with a Berry-Esseen constant that depends on the third moment, which barely exists. Approximations converge but slowly.

Regime B: $1 < \alpha \leq 2$ . LLN holds. CLT fails.

Mean is finite, $\mathbb{E}[X] = \frac{\alpha x_m}{\alpha - 1}$ . By Khintchine's weak law, the sample mean converges in probability to $\mathbb{E}[X]$ . The strong law also holds for i.i.d. variables since finite mean is sufficient.

Variance is infinite. The classical CLT does not apply. Instead, the normalized sum

$\frac{1}{n^{1/\alpha}}\sum_{i=1}^n (X_i - \mathbb{E}[X])$

converges in distribution to an $\alpha$ -stable law. The normalization is $n^{1/\alpha}$ , slower than $\sqrt{n}$ when $\alpha < 2$ (so the fluctuations are larger). The limit is not Gaussian. It has the same power-law tails as the underlying distribution.

This is the regime where the consistency illusion bites. Run a Monte Carlo simulation with Pareto $\alpha = 1.5$ losses. The running average looks like it is settling down for thousands of iterations. Then a single draw lands in the deep tail, and the running average jumps by orders of magnitude. There is no $n$ large enough for the standard confidence interval $\bar{X}_n \pm 1.96 \, s_n/\sqrt{n}$ to cover at the nominal rate, because $s_n^2$ does not stabilize.

Regime C: $\alpha \leq 1$ . Neither LLN nor CLT holds.

Mean is infinite. The sample mean does not converge to any finite value. The right normalization is again $n^{1/\alpha}$ , and the limit is a totally skewed $\alpha$ -stable law (assuming $X \geq x_m$ for one-sided Pareto). Sample means are meaningless. Empirical procedures that compute them silently produce nonsense.

Watch Out

Heavy tail does not mean infinite mean

A common misreading: "Pareto is heavy-tailed, so its mean is infinite". Wrong for $\alpha > 1$ . The mean exists and is well-defined; it is just that the variance does not exist for $\alpha \leq 2$ . A Pareto with $\alpha = 1.5$ has a finite mean of $3 x_m$ , a finite median, a finite mode, and an infinite variance. Sample means converge to the mean by Khintchine's WLLN. What fails is the rate of convergence and the Gaussian fluctuation shape, not the limit itself.

The clean threshold is: $\alpha > k$ if and only if $\mathbb{E}[X^k] < \infty$ . The LLN-relevant threshold is $\alpha > 1$ . The CLT-relevant threshold is $\alpha > 2$ .

The Generalized Central Limit Theorem

When variance is infinite but tails are regularly varying, the standard CLT is replaced by a wider result. Define a positive random variable $X$ to have a regularly varying tail with index $\alpha$ if $\Pr[X > t x] / \Pr[X > t] \to x^{-\alpha}$ as $t \to \infty$ for every $x > 0$ . Pareto, Student- $t$ with $\nu$ degrees of freedom (index $\alpha = \nu$ ), and the half-Cauchy ( $\alpha = 1$ ) all satisfy this.

Theorem

Generalized CLT (Levy-Gnedenko)

Statement

There exist constants $a_n > 0$ and $b_n \in \mathbb{R}$ such that $\frac{1}{a_n}\left(\sum_{i=1}^n X_i - b_n\right) \xrightarrow{d} S_\alpha$ where $S_\alpha$ is an $\alpha$ -stable distribution. The normalizing sequence satisfies $a_n = n^{1/\alpha} L(n)$ for a slowly varying $L$ . When $\alpha > 1$ , $b_n$ may be taken as $n \, \mathbb{E}[X_1]$ ; when $\alpha \leq 1$ , the centering is more delicate and the unnormalized sum needs a different anchor.

Intuition

The classical CLT picks Gaussian because $\sqrt{n}$ normalization plus finite variance forces the limit to satisfy a stability relation that only Gaussians satisfy. When the variance is infinite, the right normalization changes to $n^{1/\alpha}$ , and the stability relation picks out the $\alpha$ -stable laws instead. Gaussian is the boundary case $\alpha = 2$ .

Why It Matters

This is the failure-mode replacement, not a workaround. Confidence intervals constructed from the stable-law limit have the right coverage asymptotically; intervals constructed from the Gaussian approximation do not. The error distribution has power-law tails of index $\alpha$ rather than the Gaussian's $e^{-x^2/2}$ tail, and the rate of convergence is slower. Practical use requires either non-parametric methods (block maxima, peaks-over-threshold; see extreme-value theory) or estimating $\alpha$ first via Hill or similar tail-index estimators.

Failure Mode

The result needs the tail-balance condition: both tails (when applicable) must be regularly varying with the same index, with limit ratios $p, q \geq 0$ summing to 1. Asymmetric heavy tails get a one-sided stable limit. Distributions with slowly varying corrections (e.g. tails of the form $x^{-\alpha} \log x$ ) sit in a different domain of attraction and may need an additional logarithmic factor in $a_n$ .

report a correction →

Quantitative BoundDomain of attraction characterizationShow

Say $X$ is in the domain of attraction of an $\alpha$ -stable law, written $X \in D(\alpha)$ , if there exist $a_n, b_n$ such that $\frac{1}{a_n}(\sum X_i - b_n)$ converges in distribution to a non-degenerate $\alpha$ -stable law. The Levy-Gnedenko characterization is:

$\alpha = 2$ (Gaussian domain): $X \in D(2)$ iff $\mathbb{E}[X^2 \mathbf{1}\{\lvert X\rvert \leq t\}]$ is slowly varying as $t \to \infty$ . This includes all distributions with finite variance, plus a few without (e.g. $\Pr[X > x] \sim 1/(x^2 \log x)$ ).
$0 < \alpha < 2$ : $X \in D(\alpha)$ iff $\Pr[\lvert X\rvert > t]$ is regularly varying with index $-\alpha$ , and the tails are balanced: $\Pr[X > t]/\Pr[\lvert X\rvert > t] \to p$ for some $p \in [0, 1]$ .

The Gaussian domain is enormous; the $\alpha$ -stable domain for each $\alpha < 2$ is a thin slice of distributions sharing one tail index. This is why standard CLT is "stable" in the usability sense: most realistic distributions live in $D(2)$ . The exceptions are exactly the distributions that look heavy-tailed on a log-log plot.

The Consistency Illusion

A running average $\bar{X}_n$ from a heavy-tailed process can look convergent for thousands of samples and then jump. The visual pattern is:

Many small samples accumulate. The running average drifts toward a plausible-looking value.
A single deep-tail sample arrives. The running average jumps by an amount that depends on $n$ but does not shrink with $n$ when the contribution scales like $n^{1/\alpha}$ in the regime $\alpha < 1$ , or stays comparable to the running estimate when $\alpha \in (1, 2)$ .
The cycle repeats. The average never settles.

For $\alpha \in (1, 2)$ , the LLN guarantees the running average converges to $\mathbb{E}[X]$ , but on time scales that depend on $\alpha$ . The $1/\sqrt{n}$ rate from CLT is wrong. The right rate is $1/n^{1-1/\alpha}$ , which approaches 0 as $\alpha \to 1$ . At $\alpha = 1.1$ , the rate is roughly $1/n^{0.09}$ . A million samples reduce error by a factor of $\approx 3$ , not by a factor of $1000$ .

Example

Pareto sample-mean simulation

Take $X_i$ i.i.d. Pareto with $\alpha = 1.5$ , $x_m = 1$ , so $\mathbb{E}[X] = 3$ . Simulate $\bar{X}_n$ for $n$ up to $10^6$ . Typical runs show $\bar{X}_n$ oscillating in the range $[2.5, 4.5]$ even at $n = 10^6$ , occasionally jumping to $15$ or $20$ after a single deep-tail sample. The standard 95% confidence interval $\bar{X}_n \pm 1.96 s_n/\sqrt{n}$ is meaningless: $s_n$ itself does not stabilize, because $\mathrm{Var}(X) = \infty$ . A correctly designed stable-law confidence interval is much wider and has the right coverage.

Common Confusions

Watch Out

Variance is not the right threshold for LLN

Finite variance is sufficient for the LLN (via Chebyshev) but not necessary. The right condition for the i.i.d. LLN is finite mean. Distributions like Pareto with $\alpha = 1.5$ have infinite variance and still satisfy the LLN. Conversely, distributions with infinite mean (Cauchy, Pareto $\alpha \leq 1$ ) fail the LLN even though "infinite variance" sounds less drastic than "infinite mean".

Watch Out

CLT failure is not just slower convergence

The CLT does not get "weaker" as the tail gets heavier and then recover for slightly less heavy tails. It snaps off at $\alpha = 2$ . For $\alpha < 2$ , the limit law is not Gaussian; it is a different law, with different tail behavior, different quantiles, and different confidence-interval calibration. A confidence interval built on the Gaussian approximation does not just under-cover by a small factor. It under-covers by an asymptotically growing factor.

Watch Out

Cauchy is not just heavy-tailed Gaussian

Cauchy and Gaussian have the same symmetric, unimodal density shape, so intuition borrowed from "normal-ish" distributions is dangerous. The Cauchy distribution has no mean, no variance, and no defined moment of any positive order. Operations that rely on those moments (sample mean, sample variance, t-statistic) produce results that have no probabilistic interpretation in the limit. Use the median or the trimmed mean.

Connections to ML and Risk

Heavy-tailed phenomena show up across ML in ways that quietly break the standard analysis:

Gradient norms during training can be heavy-tailed, especially in language models and on noisy data. SGD-as-SDE arguments that assume Gaussian gradient noise break in this regime.
Token frequencies follow Zipf's law (power-law $\alpha \approx 1$ ), which means sample-based estimates of language statistics suffer the Pareto $\alpha \in (1, 2)$ slow-convergence problem.
Generalization gap across seeds has been empirically observed to have heavy upper tails on some benchmarks, which makes "mean accuracy across 5 seeds" an unreliable summary.
Financial returns at daily resolution typically have $\alpha \in (2.5, 4)$ for major indices, sliding closer to 2 for individual equities and much lower for cryptocurrency. Standard volatility models that assume finite kurtosis miss the tail risk by orders of magnitude.
Insurance and reinsurance losses routinely fit Pareto with $\alpha < 2$ , which is why the actuarial literature has spent decades building tools that do not assume CLT (peaks-over-threshold, copulas, excess-of-loss arrangements). See extreme-value theory.

Summary

The LLN needs $\mathbb{E}\lvert X\rvert < \infty$ . The CLT needs $\mathbb{E}[X^2] < \infty$ . These are different thresholds.
Cauchy and Pareto $\alpha \leq 1$ fail the LLN entirely. The sample mean does not converge.
Pareto $1 < \alpha \leq 2$ satisfies the LLN but fails the classical CLT. The sample mean converges to the population mean, but the fluctuations are $\alpha$ -stable, not Gaussian, and the rate is $n^{1-1/\alpha}$ , not $\sqrt{n}$ .
The generalized CLT (Levy-Gnedenko) gives the right limit law: $\alpha$ -stable. Domain of attraction is characterized by regular variation of the tail.
For heavy-tailed data, use the median or a trimmed mean as the location estimator, and use stable-law or peaks-over-threshold intervals for the uncertainty quantification.

Exercises

ExerciseCore

Problem

Simulate $n = 10^4$ i.i.d. standard Cauchy variables and plot the running sample mean as a function of $n$ . Repeat the experiment 20 times on the same axes. Compare to the analogous plot for i.i.d. $\mathcal{N}(0, 1)$ variables.

ExerciseAdvanced

Problem

For i.i.d. Pareto $X_i$ with $\alpha = 1.5$ , $x_m = 1$ , derive the correct normalization $a_n$ such that $a_n^{-1}(\sum_i X_i - n \mathbb{E}[X])$ has a non-degenerate limit, and identify the limiting distribution as an $\alpha$ -stable law. Why is the normalization $n^{2/3}$ and not $\sqrt{n}$ ?

References

Canonical:

Feller, An Introduction to Probability Theory and Its Applications, Vol II (2nd ed., 1971), Chapter XVII (stable laws and domain of attraction).
Durrett, Probability: Theory and Examples (5th ed., 2019), Sections 3.7 and 3.8 (stable laws, generalized CLT).
Resnick, Heavy-Tail Phenomena: Probabilistic and Statistical Modeling (2007), Chapters 1-3.

Current:

Embrechts, Klüppelberg, and Mikosch, Modelling Extremal Events for Insurance and Finance (1997), Chapter 1 (regular variation and the domain-of-attraction framework).
Nolan, Univariate Stable Distributions: Models for Heavy Tailed Data (2020), Chapters 1 and 3 (computational treatment of stable laws and Hill-type tail-index estimators).

Critique:

Taleb, Statistical Consequences of Fat Tails (2020), Chapters 3 and 5 on the consistency illusion and the misuse of sample moments under heavy tails.

Next Topics

Building on the failure modes:

Fat Tails — broader treatment of heavy-tailed phenomena and where they appear in ML and risk.
Extreme-Value Theory — Fisher-Tippett three-types theorem, peaks-over-threshold, and quantitative tail estimation.
Subexponential Random Variables — the borderline case between thin and heavy tails, with concentration inequalities that still apply.

Last reviewed: May 12, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

6

Common Probability Distributionslayer 0A · tier 1
Central Limit Theoremlayer 0B · tier 1
Law of Large Numberslayer 0B · tier 1
Modes of Convergence of Random Variableslayer 0B · tier 1
Characteristic Functionslayer 1 · tier 1

Derived topics

0

No published topic currently declares this as a prerequisite.

Why This Matters

Quick Version

The Cauchy Case: LLN Fails Loudly

Pareto Across the Alpha Regimes

Regime A: α>2\alpha > 2α>2. Both LLN and CLT hold.

Regime B: 1<α≤21 < \alpha \leq 21<α≤2. LLN holds. CLT fails.

Regime C: α≤1\alpha \leq 1α≤1. Neither LLN nor CLT holds.

The Generalized Central Limit Theorem

The Consistency Illusion

Common Confusions

Connections to ML and Risk

Summary

Exercises

References

Next Topics

Required before and derived from this topic

Required prerequisites

Derived topics

Regime A: $\alpha > 2$ . Both LLN and CLT hold.

Regime B: $1 < \alpha \leq 2$ . LLN holds. CLT fails.

Regime C: $\alpha \leq 1$ . Neither LLN nor CLT holds.