Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Probability

Fat Tails and Heavy-Tailed Distributions

When the tails dominate. Power laws, Pareto distributions, subexponential tails, why the law of large numbers converges slowly or fails, and why most of ML silently assumes thin tails.

AdvancedTier 1Stable~50 min

Why This Matters

Most of ML theory assumes that random variables are well-behaved: bounded, subgaussian, or at least subexponential. Concentration inequalities, uniform convergence bounds, and PAC learning guarantees all depend on tail conditions that exclude heavy-tailed distributions. When those conditions fail, sample means become unreliable, confidence intervals become meaningless, and standard risk bounds collapse.

Fat tails are not a curiosity. They appear in financial returns, city populations, word frequencies, network degree distributions, wealth distributions, insurance claims, earthquake magnitudes, and increasingly in ML itself: gradient norms during training, loss spikes, token frequency distributions in language models, and the distribution of model performance across random seeds.

The central question is: when tails decay polynomially rather than exponentially, which tools from probability and statistics still work, and which ones silently give wrong answers?

α = 2.5
Mean and variance exist (α > 2)
10⁻⁲10⁻⁴10⁻⁶1tail region1251020x (log scale)P(X = x), log scaleGaussian (exp decay)Pareto (α=2.5, power law)At x=10: Gaussian 10⁻²², Pareto 104

Mental Model

Compare two distributions with the same mean and variance:

Gaussian N(0,1)\mathcal{N}(0, 1): the probability of exceeding tt drops as et2/2e^{-t^2/2}. An observation beyond 6σ6\sigma has probability about 10910^{-9}. You will never see one in a finite dataset.

Pareto with α=3\alpha = 3 (finite variance): the probability of exceeding tt drops as t3t^{-3}. An observation 100 times the typical value has probability 10610^{-6}. In a dataset of a million points, you expect one. That single observation can dominate the sample mean.

The Gaussian world is one where the sample mean is a reliable summary. The Pareto world is one where a single extreme observation can change the conclusion. The distinction between these two regimes is the subject of this page.

Core Definitions

Definition

Heavy-Tailed Distribution

A distribution FF is heavy-tailed if its moment generating function M(t)=E[etX]M(t) = \mathbb{E}[e^{tX}] is infinite for all t>0t > 0. Equivalently, the tail Fˉ(x)=P(X>x)\bar{F}(x) = P(X > x) decays slower than any exponential:

limxeλxFˉ(x)=for all λ>0\lim_{x \to \infty} e^{\lambda x} \bar{F}(x) = \infty \quad \text{for all } \lambda > 0

This means the tails are "heavier" than any exponential distribution. The Pareto, log-normal, Cauchy, and stable distributions are all heavy-tailed. The Gaussian, exponential, and Poisson distributions are not.

Definition

Fat-Tailed (Regularly Varying) Distribution

A distribution FF is fat-tailed (or has a power-law tail) if its survival function satisfies:

Fˉ(x)=P(X>x)=L(x)xα\bar{F}(x) = P(X > x) = L(x) \cdot x^{-\alpha}

where α>0\alpha > 0 is the tail index and L(x)L(x) is a slowly varying function (meaning L(tx)/L(t)1L(tx)/L(t) \to 1 as tt \to \infty for all x>0x > 0).

The tail index α\alpha determines which moments exist:

  • E[Xp]<\mathbb{E}[|X|^p] < \infty if and only if p<αp < \alpha
  • If α1\alpha \leq 1: the mean does not exist
  • If α2\alpha \leq 2: the variance does not exist
  • If α4\alpha \leq 4: the kurtosis does not exist

Every fat-tailed distribution is heavy-tailed, but the converse is false. The log-normal distribution is heavy-tailed but not fat-tailed (its tail decays as exp(c(logx)2)\exp(-c (\log x)^2), not as a power law).

Definition

Pareto Distribution

The Pareto distribution with shape α>0\alpha > 0 and scale xm>0x_m > 0 has survival function:

P(X>x)=(xmx)α,xxmP(X > x) = \left(\frac{x_m}{x}\right)^\alpha, \quad x \geq x_m

and density f(x)=αxmαx(α+1)f(x) = \alpha x_m^\alpha x^{-(\alpha+1)} for xxmx \geq x_m.

Moments: E[Xk]=αxmkαk\mathbb{E}[X^k] = \frac{\alpha x_m^k}{\alpha - k} when k<αk < \alpha, and infinite otherwise.

For α>1\alpha > 1: E[X]=αxmα1\mathbb{E}[X] = \frac{\alpha x_m}{\alpha - 1}. For α>2\alpha > 2: Var(X)=αxm2(α1)2(α2)\text{Var}(X) = \frac{\alpha x_m^2}{(\alpha - 1)^2(\alpha - 2)}.

Definition

Subexponential Distribution

A distribution FF on [0,)[0, \infty) is subexponential if for i.i.d. X1,X2X_1, X_2 with distribution FF:

limxP(X1+X2>x)P(X1>x)=2\lim_{x \to \infty} \frac{P(X_1 + X_2 > x)}{P(X_1 > x)} = 2

This says: the sum exceeds xx primarily because one of the summands exceeds xx, not because both are moderately large. For nn i.i.d. copies:

P(X1++Xn>x)nP(X1>x)P(X_1 + \cdots + X_n > x) \sim n \cdot P(X_1 > x)

The maximum dominates the sum. This is the "one big jump" principle, and it is the defining property of distributions where rare events drive aggregate behavior.

Definition

Stable Distribution

A distribution is stable if a linear combination of two independent copies has the same distribution (up to location and scale). Formally, XX is stable if for any a,b>0a, b > 0, there exist c>0c > 0 and dRd \in \mathbb{R} such that:

aX1+bX2=dcX+daX_1 + bX_2 \stackrel{d}{=} cX + d

where X1,X2X_1, X_2 are independent copies of XX.

Stable distributions are parameterized by four values: stability index α(0,2]\alpha \in (0, 2], skewness β[1,1]\beta \in [-1, 1], scale γ>0\gamma > 0, and location δR\delta \in \mathbb{R}. Only two special cases have closed-form densities: the Gaussian (α=2\alpha = 2) and the Cauchy (α=1\alpha = 1, β=0\beta = 0).

Main Theorems

Theorem

Generalized Central Limit Theorem (Stable Laws)

Statement

Let X1,X2,X_1, X_2, \ldots be i.i.d. random variables. If there exist sequences an>0a_n > 0 and bnRb_n \in \mathbb{R} such that:

X1+X2++XnbnandS\frac{X_1 + X_2 + \cdots + X_n - b_n}{a_n} \xrightarrow{d} S

where SS is a non-degenerate random variable, then SS must be a stable distribution with some index α(0,2]\alpha \in (0, 2].

When α=2\alpha = 2, SS is Gaussian and an=na_n = \sqrt{n}: this is the classical CLT.

When α<2\alpha < 2, SS is a non-Gaussian stable distribution and an=n1/αL(n)a_n = n^{1/\alpha} L(n) for a slowly varying function LL. The limit is heavy-tailed with infinite variance.

A distribution with tail P(X>x)xαL(x)P(|X| > x) \sim x^{-\alpha} L(x) for α(0,2)\alpha \in (0, 2) is in the domain of attraction of the stable law with index α\alpha.

Intuition

The classical CLT says: if you average i.i.d. variables with finite variance, the average is approximately Gaussian. The generalized CLT asks: what if the variance is infinite? The answer is that properly normalized sums still converge, but to a non-Gaussian stable distribution. The only possible limits for normalized sums of i.i.d. variables are stable distributions. This is a universality result: just as the Gaussian is the universal limit for finite-variance sums, stable laws are the universal limits for all i.i.d. sums.

Proof Sketch

The proof uses characteristic functions. If ϕ(t)\phi(t) is the characteristic function of XiX_i, then the characteristic function of the normalized sum is [ϕ(t/an)]neitbn/an[\phi(t/a_n)]^n \cdot e^{-itb_n/a_n}. For this to converge to a non-degenerate limit, the limit must satisfy the stability property S^(t)n=S^(cnt)eidnt\hat{S}(t)^n = \hat{S}(c_n t) e^{id_n t} for appropriate sequences. The Levy-Khintchine representation theorem characterizes all infinitely divisible distributions, and the stability constraint restricts to the stable family. Gnedenko and Kolmogorov proved that a distribution is in the domain of attraction of a stable law with index α\alpha if and only if P(X>x)=xαL(x)P(|X| > x) = x^{-\alpha} L(x) for a slowly varying LL (when α<2\alpha < 2).

Why It Matters

This theorem tells you what happens when you average data from fat-tailed distributions. The sample mean does not become Gaussian. It becomes stable, which means: the average is itself heavy-tailed, extreme values in the sample can dominate the average, and confidence intervals based on Gaussian quantiles are wrong. For α<1\alpha < 1, the mean does not exist, and the sample mean does not converge at all.

For ML: if gradient noise has a stable distribution with α<2\alpha < 2 (there is empirical evidence for this in some architectures), then SGD dynamics are qualitatively different from the Gaussian noise assumption used in most convergence analyses.

Failure Mode

The theorem requires i.i.d. observations. For dependent data (time series, Markov chains), different limit theorems apply, and the limits can be non-stable. The theorem also assumes the distribution is in the domain of attraction of some stable law, which excludes some pathological cases (distributions that oscillate between different tail behaviors).

Theorem

Slow LLN Convergence Under Fat Tails

Statement

Let X1,X2,X_1, X_2, \ldots be i.i.d. with P(X>x)xαP(X > x) \sim x^{-\alpha} for α(1,2)\alpha \in (1, 2). The mean μ=E[X]\mu = \mathbb{E}[X] exists but the variance is infinite. The law of large numbers still holds:

Xˉna.s.μ\bar{X}_n \xrightarrow{\text{a.s.}} \mu

but the rate of convergence is n1/α1n^{1/\alpha - 1} rather than n1/2n^{-1/2}. The fluctuations of Xˉn\bar{X}_n around μ\mu are of order n1/α1n^{1/\alpha - 1}, and the normalized sum (Snnμ)/n1/α(S_n - n\mu)/n^{1/\alpha} converges to a stable law, not a Gaussian.

Chebyshev-style confidence intervals do not apply because the variance is infinite.

Intuition

With finite variance, the sample mean fluctuates at rate 1/n1/\sqrt{n}. With infinite variance but finite mean (1<α<21 < \alpha < 2), the sample mean still converges, but much more slowly. The reason: occasional extreme observations create large jumps in the running average. These jumps become rarer as nn grows, but they are large enough that the average converges at rate n1/α1n^{1/\alpha - 1} rather than n1/2n^{-1/2}. Since 1/α>1/21/\alpha > 1/2 when α<2\alpha < 2, we have 1/α1>1/21/\alpha - 1 > -1/2, so convergence is slower.

Proof Sketch

The SLLN holds because E[X]<\mathbb{E}[|X|] < \infty (since α>1\alpha > 1), which is both necessary and sufficient for the strong law under i.i.d. conditions. The convergence rate follows from the generalized CLT: the centering sequence is bn=nμb_n = n\mu and the scaling is an=n1/αL(n)a_n = n^{1/\alpha}L(n). The fluctuations Xˉnμ\bar{X}_n - \mu are of order an/n=n1/α1L(n)a_n / n = n^{1/\alpha - 1}L(n).

Why It Matters

Standard sample-size calculations assume 1/n1/\sqrt{n} convergence. If the data is fat-tailed with α=1.5\alpha = 1.5, convergence is at rate n1/3n^{-1/3} instead of n1/2n^{-1/2}. To halve the error, you need not 4 times as many samples but 23=82^3 = 8 times as many. For α\alpha close to 1, convergence is arbitrarily slow. This matters whenever you estimate a population mean from fat-tailed data: financial returns, insurance claims, or the average loss across a dataset with occasional extreme outliers.

Failure Mode

When α1\alpha \leq 1, the mean does not exist and the LLN fails entirely. The sample mean of i.i.d. Cauchy random variables (α=1\alpha = 1) has the same Cauchy distribution for every nn. Averaging does not help at all.

Heavy-Tailed vs. Fat-Tailed vs. Subexponential

These terms are related but distinct. The hierarchy is:

Fat-tailed (regularly varying) \subset Subexponential \subset Heavy-tailed

ClassTail decayExample
Fat-tailedP(X>x)xαL(x)P(X > x) \sim x^{-\alpha} L(x)Pareto, Student-tt, stable
SubexponentialSum dominated by maximumLog-normal, Weibull (β<1\beta < 1)
Heavy-tailedSlower than any exponentialAll of the above
Thin-tailedExponential or faster decayGaussian, exponential, Poisson

The key operational distinction: for subexponential distributions, extreme events are driven by a single large observation (the "one big jump" principle). For thin-tailed distributions, extreme events require many moderately large observations occurring together.

Canonical Examples

Example

Pareto vs. Gaussian: the single-observation problem

Draw n=1000n = 1000 samples from a Pareto distribution with α=1.5\alpha = 1.5 and xm=1x_m = 1. The mean is μ=αxm/(α1)=3\mu = \alpha x_m / (\alpha - 1) = 3. Compute the sample mean. Repeat 10,000 times.

You will find: the sample mean fluctuates wildly across repetitions. In some repetitions, a single observation of size 10410^4 or larger appears and dominates the entire sum. The distribution of the sample mean is itself heavy-tailed (it converges to a stable law with index α=1.5\alpha = 1.5).

Now repeat with n=1000n = 1000 Gaussian samples with the same mean and variance. The sample mean is tightly concentrated around μ\mu, with fluctuations of order σ/n\sigma / \sqrt{n}. The distribution of the sample mean is Gaussian.

The difference: with Pareto data, the sample mean is not a reliable summary even for n=1000n = 1000.

Example

The 80/20 rule and power laws

A Pareto distribution with α\alpha slightly above 1 generates the "80/20" phenomenon. If wealth follows a Pareto distribution with α=1.16\alpha = 1.16, then about 80% of total wealth is held by the top 20% of individuals. More precisely, the fraction of total wealth held by the top pp fraction of the population is 1(1p)11/α1 - (1-p)^{1 - 1/\alpha}.

This concentration gets more extreme as α\alpha decreases toward 1. For α=1.05\alpha = 1.05, the top 20% holds about 87% of wealth.

Implications for ML

Gradient Noise

Several empirical studies have found that stochastic gradient noise in deep learning is better modeled by stable distributions with α<2\alpha < 2 than by Gaussian noise. If this is correct, then:

  • SGD convergence theory that assumes Gaussian noise gives the wrong rate
  • The learning rate schedule that is optimal under Gaussian noise may not be optimal under stable noise
  • Heavy-tailed gradient noise may actually help optimization by allowing escape from sharp local minima (a form of convex tinkering at the algorithmic level)

Loss Distributions

The distribution of per-sample losses in a training set is often heavy-tailed: most samples have small loss, but a few have very large loss. If you use the mean loss as your objective, rare high-loss samples can dominate gradient updates. Robust losses (Huber, trimmed mean) address this by down-weighting the tails.

Token Frequencies in Language Models

Word and token frequencies follow Zipf's law, which is a power law with α1\alpha \approx 1. This means: the most common token appears roughly kk times more often than the kk-th most common token. Training a language model on this distribution means that rare tokens get very few gradient updates, and the model's performance on rare tokens converges much more slowly than on common tokens.

Why Subgaussian Assumptions Matter

Standard ML theory bounds (Hoeffding, McDiarmid, Rademacher complexity) assume bounded or subgaussian random variables. A random variable XX is subgaussian with parameter σ\sigma if E[etX]eσ2t2/2\mathbb{E}[e^{tX}] \leq e^{\sigma^2 t^2 / 2} for all tt. This is a strong tail condition: it guarantees P(X>t)2et2/(2σ2)P(X > t) \leq 2e^{-t^2/(2\sigma^2)}.

Fat-tailed distributions violate this condition. When the data is fat-tailed:

  • Hoeffding's inequality does not apply
  • Standard uniform convergence bounds fail
  • The sample complexity for learning may be much worse than predicted by VC theory
  • Median-of-means estimators and truncation techniques are needed as replacements

Common Confusions

Watch Out

Heavy-tailed does not mean infinite mean

A distribution can be heavy-tailed (MGF infinite for all t>0t > 0) while still having a finite mean. A Pareto with α=3\alpha = 3 has finite mean and finite variance, but it is still heavy-tailed. The term "heavy-tailed" refers to tail decay rate, not to the existence of moments.

Watch Out

Power laws are not confirmed by log-log plots

Plotting the empirical survival function on a log-log scale and observing an approximately straight line does not confirm a power law. Many distributions (log-normal, stretched exponential) look approximately linear on a log-log plot over a limited range. Rigorous testing requires methods like the Clauset-Shalizi-Newman procedure, which uses maximum likelihood estimation for the tail index and a goodness-of-fit test based on the Kolmogorov-Smirnov statistic.

Watch Out

The sample variance does not estimate the population variance under fat tails

If XX has a Pareto distribution with α=2.5\alpha = 2.5, the population variance is finite. But the sample variance converges to the population variance extremely slowly, and the distribution of the sample variance is itself heavy-tailed. Standard confidence intervals for the variance (which assume χ2\chi^2 limiting distributions) are unreliable. The sample kurtosis is meaningless when α<4\alpha < 4.

Watch Out

Taleb's point is about sample statistics, not about the distribution itself

Taleb's critique is not that fat-tailed distributions are mysterious. It is that standard statistical tools (sample mean, sample variance, confidence intervals, p-values) give misleading answers when applied to fat-tailed data, because these tools implicitly assume thin tails. The distribution itself is perfectly well-defined mathematically. The problem is epistemological: you cannot reliably estimate its properties from a finite sample.

Misreadings of Fat Tails

Watch Out

A Black Swan is any rare event

Wrong. A Black Swan in Taleb's sense has three properties: rarity, extreme impact, and retrospective explainability (people invent stories after the fact as if the event was predictable). A coin flip landing heads 20 times in a row is rare but not a Black Swan because its impact is negligible. A financial crisis that wipes out savings and is afterward "explained" by obvious-in-hindsight factors is a Black Swan.

Watch Out

Fat tails mean we should give up on statistics

Wrong. Fat tails mean we should use the right statistics. Median-of-means estimators, trimmed means, robust regression, distribution-free concentration inequalities, and non-parametric methods all work under fat tails. The mistake is applying thin-tailed tools (sample mean + Gaussian confidence intervals) to fat-tailed data, not doing statistics at all.

Exercises

ExerciseCore

Problem

A Pareto distribution has shape parameter α\alpha and scale xm=1x_m = 1. For which values of α\alpha does the mean exist? The variance? The fourth moment? Compute the mean and variance when they exist.

ExerciseCore

Problem

Explain why the sample mean of nn i.i.d. Cauchy random variables has the same Cauchy distribution regardless of nn. What does this imply about the law of large numbers?

ExerciseAdvanced

Problem

Let XX be a non-negative random variable with P(X>x)=xαP(X > x) = x^{-\alpha} for x1x \geq 1 and α(1,2)\alpha \in (1, 2). Show that Var(X)=\text{Var}(X) = \infty but E[X]<\mathbb{E}[X] < \infty, and compute the mean.

ExerciseAdvanced

Problem

The subexponential property states that for i.i.d. non-negative X1,X2X_1, X_2 with distribution FF: P(X1+X2>x)/P(X1>x)2P(X_1 + X_2 > x) / P(X_1 > x) \to 2 as xx \to \infty. Prove that this implies P(max(X1,X2)>x)/P(X1+X2>x)1P(\max(X_1, X_2) > x) / P(X_1 + X_2 > x) \to 1. Interpret this result.

ExerciseResearch

Problem

The median-of-means estimator divides nn samples into kk groups, computes the sample mean of each group, and returns the median of these kk means. Explain why this estimator is more robust to fat tails than the ordinary sample mean, and under what conditions on α\alpha it achieves sub-Gaussian concentration.

References

Canonical:

  • Feller, An Introduction to Probability Theory and Its Applications, Vol. 2 (1971), Chapters VIII-IX
  • Nolan, Stable Distributions: Models for Heavy Tailed Data (2020), Chapters 1-3
  • Taleb, Statistical Consequences of Fat Tails (2020), Chapters 1-5

Current:

  • Embrechts, Kluppelberg, & Mikosch, Modelling Extremal Events (1997), Chapters 1-3
  • Vershynin, High-Dimensional Probability (2018), Section 2.8 (subgaussian and subexponential)
  • Lugosi & Mendelson, "Mean Estimation and Regression Under Heavy-Tailed Distributions: A Survey" (2019), Foundations and Trends in ML, Sections 1-4

Next Topics

Building on fat tails:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics