Probability
Fat Tails and Heavy-Tailed Distributions
When the tails dominate. Power laws, Pareto distributions, subexponential tails, why the law of large numbers converges slowly or fails, and why most of ML silently assumes thin tails.
Prerequisites
Why This Matters
Most of ML theory assumes that random variables are well-behaved: bounded, subgaussian, or at least subexponential. Concentration inequalities, uniform convergence bounds, and PAC learning guarantees all depend on tail conditions that exclude heavy-tailed distributions. When those conditions fail, sample means become unreliable, confidence intervals become meaningless, and standard risk bounds collapse.
Fat tails are not a curiosity. They appear in financial returns, city populations, word frequencies, network degree distributions, wealth distributions, insurance claims, earthquake magnitudes, and increasingly in ML itself: gradient norms during training, loss spikes, token frequency distributions in language models, and the distribution of model performance across random seeds.
The central question is: when tails decay polynomially rather than exponentially, which tools from probability and statistics still work, and which ones silently give wrong answers?
Mental Model
Compare two distributions with the same mean and variance:
Gaussian : the probability of exceeding drops as . An observation beyond has probability about . You will never see one in a finite dataset.
Pareto with (finite variance): the probability of exceeding drops as . An observation 100 times the typical value has probability . In a dataset of a million points, you expect one. That single observation can dominate the sample mean.
The Gaussian world is one where the sample mean is a reliable summary. The Pareto world is one where a single extreme observation can change the conclusion. The distinction between these two regimes is the subject of this page.
Core Definitions
Heavy-Tailed Distribution
A distribution is heavy-tailed if its moment generating function is infinite for all . Equivalently, the tail decays slower than any exponential:
This means the tails are "heavier" than any exponential distribution. The Pareto, log-normal, Cauchy, and stable distributions are all heavy-tailed. The Gaussian, exponential, and Poisson distributions are not.
Fat-Tailed (Regularly Varying) Distribution
A distribution is fat-tailed (or has a power-law tail) if its survival function satisfies:
where is the tail index and is a slowly varying function (meaning as for all ).
The tail index determines which moments exist:
- if and only if
- If : the mean does not exist
- If : the variance does not exist
- If : the kurtosis does not exist
Every fat-tailed distribution is heavy-tailed, but the converse is false. The log-normal distribution is heavy-tailed but not fat-tailed (its tail decays as , not as a power law).
Pareto Distribution
The Pareto distribution with shape and scale has survival function:
and density for .
Moments: when , and infinite otherwise.
For : . For : .
Subexponential Distribution
A distribution on is subexponential if for i.i.d. with distribution :
This says: the sum exceeds primarily because one of the summands exceeds , not because both are moderately large. For i.i.d. copies:
The maximum dominates the sum. This is the "one big jump" principle, and it is the defining property of distributions where rare events drive aggregate behavior.
Stable Distribution
A distribution is stable if a linear combination of two independent copies has the same distribution (up to location and scale). Formally, is stable if for any , there exist and such that:
where are independent copies of .
Stable distributions are parameterized by four values: stability index , skewness , scale , and location . Only two special cases have closed-form densities: the Gaussian () and the Cauchy (, ).
Main Theorems
Generalized Central Limit Theorem (Stable Laws)
Statement
Let be i.i.d. random variables. If there exist sequences and such that:
where is a non-degenerate random variable, then must be a stable distribution with some index .
When , is Gaussian and : this is the classical CLT.
When , is a non-Gaussian stable distribution and for a slowly varying function . The limit is heavy-tailed with infinite variance.
A distribution with tail for is in the domain of attraction of the stable law with index .
Intuition
The classical CLT says: if you average i.i.d. variables with finite variance, the average is approximately Gaussian. The generalized CLT asks: what if the variance is infinite? The answer is that properly normalized sums still converge, but to a non-Gaussian stable distribution. The only possible limits for normalized sums of i.i.d. variables are stable distributions. This is a universality result: just as the Gaussian is the universal limit for finite-variance sums, stable laws are the universal limits for all i.i.d. sums.
Proof Sketch
The proof uses characteristic functions. If is the characteristic function of , then the characteristic function of the normalized sum is . For this to converge to a non-degenerate limit, the limit must satisfy the stability property for appropriate sequences. The Levy-Khintchine representation theorem characterizes all infinitely divisible distributions, and the stability constraint restricts to the stable family. Gnedenko and Kolmogorov proved that a distribution is in the domain of attraction of a stable law with index if and only if for a slowly varying (when ).
Why It Matters
This theorem tells you what happens when you average data from fat-tailed distributions. The sample mean does not become Gaussian. It becomes stable, which means: the average is itself heavy-tailed, extreme values in the sample can dominate the average, and confidence intervals based on Gaussian quantiles are wrong. For , the mean does not exist, and the sample mean does not converge at all.
For ML: if gradient noise has a stable distribution with (there is empirical evidence for this in some architectures), then SGD dynamics are qualitatively different from the Gaussian noise assumption used in most convergence analyses.
Failure Mode
The theorem requires i.i.d. observations. For dependent data (time series, Markov chains), different limit theorems apply, and the limits can be non-stable. The theorem also assumes the distribution is in the domain of attraction of some stable law, which excludes some pathological cases (distributions that oscillate between different tail behaviors).
Slow LLN Convergence Under Fat Tails
Statement
Let be i.i.d. with for . The mean exists but the variance is infinite. The law of large numbers still holds:
but the rate of convergence is rather than . The fluctuations of around are of order , and the normalized sum converges to a stable law, not a Gaussian.
Chebyshev-style confidence intervals do not apply because the variance is infinite.
Intuition
With finite variance, the sample mean fluctuates at rate . With infinite variance but finite mean (), the sample mean still converges, but much more slowly. The reason: occasional extreme observations create large jumps in the running average. These jumps become rarer as grows, but they are large enough that the average converges at rate rather than . Since when , we have , so convergence is slower.
Proof Sketch
The SLLN holds because (since ), which is both necessary and sufficient for the strong law under i.i.d. conditions. The convergence rate follows from the generalized CLT: the centering sequence is and the scaling is . The fluctuations are of order .
Why It Matters
Standard sample-size calculations assume convergence. If the data is fat-tailed with , convergence is at rate instead of . To halve the error, you need not 4 times as many samples but times as many. For close to 1, convergence is arbitrarily slow. This matters whenever you estimate a population mean from fat-tailed data: financial returns, insurance claims, or the average loss across a dataset with occasional extreme outliers.
Failure Mode
When , the mean does not exist and the LLN fails entirely. The sample mean of i.i.d. Cauchy random variables () has the same Cauchy distribution for every . Averaging does not help at all.
Heavy-Tailed vs. Fat-Tailed vs. Subexponential
These terms are related but distinct. The hierarchy is:
Fat-tailed (regularly varying) Subexponential Heavy-tailed
| Class | Tail decay | Example |
|---|---|---|
| Fat-tailed | Pareto, Student-, stable | |
| Subexponential | Sum dominated by maximum | Log-normal, Weibull () |
| Heavy-tailed | Slower than any exponential | All of the above |
| Thin-tailed | Exponential or faster decay | Gaussian, exponential, Poisson |
The key operational distinction: for subexponential distributions, extreme events are driven by a single large observation (the "one big jump" principle). For thin-tailed distributions, extreme events require many moderately large observations occurring together.
Canonical Examples
Pareto vs. Gaussian: the single-observation problem
Draw samples from a Pareto distribution with and . The mean is . Compute the sample mean. Repeat 10,000 times.
You will find: the sample mean fluctuates wildly across repetitions. In some repetitions, a single observation of size or larger appears and dominates the entire sum. The distribution of the sample mean is itself heavy-tailed (it converges to a stable law with index ).
Now repeat with Gaussian samples with the same mean and variance. The sample mean is tightly concentrated around , with fluctuations of order . The distribution of the sample mean is Gaussian.
The difference: with Pareto data, the sample mean is not a reliable summary even for .
The 80/20 rule and power laws
A Pareto distribution with slightly above 1 generates the "80/20" phenomenon. If wealth follows a Pareto distribution with , then about 80% of total wealth is held by the top 20% of individuals. More precisely, the fraction of total wealth held by the top fraction of the population is .
This concentration gets more extreme as decreases toward 1. For , the top 20% holds about 87% of wealth.
Implications for ML
Gradient Noise
Several empirical studies have found that stochastic gradient noise in deep learning is better modeled by stable distributions with than by Gaussian noise. If this is correct, then:
- SGD convergence theory that assumes Gaussian noise gives the wrong rate
- The learning rate schedule that is optimal under Gaussian noise may not be optimal under stable noise
- Heavy-tailed gradient noise may actually help optimization by allowing escape from sharp local minima (a form of convex tinkering at the algorithmic level)
Loss Distributions
The distribution of per-sample losses in a training set is often heavy-tailed: most samples have small loss, but a few have very large loss. If you use the mean loss as your objective, rare high-loss samples can dominate gradient updates. Robust losses (Huber, trimmed mean) address this by down-weighting the tails.
Token Frequencies in Language Models
Word and token frequencies follow Zipf's law, which is a power law with . This means: the most common token appears roughly times more often than the -th most common token. Training a language model on this distribution means that rare tokens get very few gradient updates, and the model's performance on rare tokens converges much more slowly than on common tokens.
Why Subgaussian Assumptions Matter
Standard ML theory bounds (Hoeffding, McDiarmid, Rademacher complexity) assume bounded or subgaussian random variables. A random variable is subgaussian with parameter if for all . This is a strong tail condition: it guarantees .
Fat-tailed distributions violate this condition. When the data is fat-tailed:
- Hoeffding's inequality does not apply
- Standard uniform convergence bounds fail
- The sample complexity for learning may be much worse than predicted by VC theory
- Median-of-means estimators and truncation techniques are needed as replacements
Common Confusions
Heavy-tailed does not mean infinite mean
A distribution can be heavy-tailed (MGF infinite for all ) while still having a finite mean. A Pareto with has finite mean and finite variance, but it is still heavy-tailed. The term "heavy-tailed" refers to tail decay rate, not to the existence of moments.
Power laws are not confirmed by log-log plots
Plotting the empirical survival function on a log-log scale and observing an approximately straight line does not confirm a power law. Many distributions (log-normal, stretched exponential) look approximately linear on a log-log plot over a limited range. Rigorous testing requires methods like the Clauset-Shalizi-Newman procedure, which uses maximum likelihood estimation for the tail index and a goodness-of-fit test based on the Kolmogorov-Smirnov statistic.
The sample variance does not estimate the population variance under fat tails
If has a Pareto distribution with , the population variance is finite. But the sample variance converges to the population variance extremely slowly, and the distribution of the sample variance is itself heavy-tailed. Standard confidence intervals for the variance (which assume limiting distributions) are unreliable. The sample kurtosis is meaningless when .
Taleb's point is about sample statistics, not about the distribution itself
Taleb's critique is not that fat-tailed distributions are mysterious. It is that standard statistical tools (sample mean, sample variance, confidence intervals, p-values) give misleading answers when applied to fat-tailed data, because these tools implicitly assume thin tails. The distribution itself is perfectly well-defined mathematically. The problem is epistemological: you cannot reliably estimate its properties from a finite sample.
Misreadings of Fat Tails
A Black Swan is any rare event
Wrong. A Black Swan in Taleb's sense has three properties: rarity, extreme impact, and retrospective explainability (people invent stories after the fact as if the event was predictable). A coin flip landing heads 20 times in a row is rare but not a Black Swan because its impact is negligible. A financial crisis that wipes out savings and is afterward "explained" by obvious-in-hindsight factors is a Black Swan.
Fat tails mean we should give up on statistics
Wrong. Fat tails mean we should use the right statistics. Median-of-means estimators, trimmed means, robust regression, distribution-free concentration inequalities, and non-parametric methods all work under fat tails. The mistake is applying thin-tailed tools (sample mean + Gaussian confidence intervals) to fat-tailed data, not doing statistics at all.
Exercises
Problem
A Pareto distribution has shape parameter and scale . For which values of does the mean exist? The variance? The fourth moment? Compute the mean and variance when they exist.
Problem
Explain why the sample mean of i.i.d. Cauchy random variables has the same Cauchy distribution regardless of . What does this imply about the law of large numbers?
Problem
Let be a non-negative random variable with for and . Show that but , and compute the mean.
Problem
The subexponential property states that for i.i.d. non-negative with distribution : as . Prove that this implies . Interpret this result.
Problem
The median-of-means estimator divides samples into groups, computes the sample mean of each group, and returns the median of these means. Explain why this estimator is more robust to fat tails than the ordinary sample mean, and under what conditions on it achieves sub-Gaussian concentration.
References
Canonical:
- Feller, An Introduction to Probability Theory and Its Applications, Vol. 2 (1971), Chapters VIII-IX
- Nolan, Stable Distributions: Models for Heavy Tailed Data (2020), Chapters 1-3
- Taleb, Statistical Consequences of Fat Tails (2020), Chapters 1-5
Current:
- Embrechts, Kluppelberg, & Mikosch, Modelling Extremal Events (1997), Chapters 1-3
- Vershynin, High-Dimensional Probability (2018), Section 2.8 (subgaussian and subexponential)
- Lugosi & Mendelson, "Mean Estimation and Regression Under Heavy-Tailed Distributions: A Survey" (2019), Foundations and Trends in ML, Sections 1-4
Next Topics
Building on fat tails:
- Extreme value theory: the mathematics of maxima and peaks over threshold
- Subexponential random variables: the detailed theory of the "one big jump" principle
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Law of Large NumbersLayer 0B
Builds on This
- Extreme Value TheoryLayer 3