Fat Tails and Heavy-Tailed Distributions

Sneiderman, Robby

Concentration Probability

Fat Tails and Heavy-Tailed Distributions

When the tails dominate. Power laws, Pareto distributions, subexponential tails, why the law of large numbers converges slowly or fails, and why most of ML silently assumes thin tails.

AdvancedTier 1StableCore spine~50 min

Prerequisites

Common Probability Distributions Expectation Variance Covariance Moments Law of Large Numbers Characteristic Functions

Quiz (4)Prereq Map

Why This Matters

Most of ML theory assumes that random variables are well-behaved: bounded, subgaussian, or at least subexponential. Concentration inequalities, uniform convergence bounds, and PAC learning guarantees all depend on tail conditions that exclude heavy-tailed distributions. When those conditions fail, sample means become unreliable, confidence intervals become meaningless, and standard risk bounds collapse.

Fat tails are not a curiosity. They appear in financial returns, city populations, word frequencies, network degree distributions, wealth distributions, insurance claims, earthquake magnitudes, and increasingly in ML itself: gradient norms during training, loss spikes, token frequency distributions in language models, and the distribution of model performance across random seeds.

The central question is: when tails decay polynomially rather than exponentially, which tools from probability and statistics still work, and which ones silently give wrong answers?

Pareto tail index:α = 2.5

Mean and variance exist (α > 2)

Mental Model

Compare two distributions with the same mean and variance:

Gaussian $\mathcal{N}(0, 1)$ : the probability of exceeding $t$ drops as $e^{-t^2/2}$ . An observation beyond $6\sigma$ has probability about $10^{-9}$ . You will never see one in a finite dataset.

Pareto with $\alpha = 3$ (finite variance): the probability of exceeding $t$ drops as $t^{-3}$ . An observation 100 times the typical value has probability $10^{-6}$ . In a dataset of a million points, you expect one. That single observation can dominate the sample mean.

The Gaussian world is one where the sample mean is a reliable summary. The Pareto world is one where a single extreme observation can change the conclusion. The distinction between these two regimes is the subject of this page.

Core Definitions

Definition

Heavy-Tailed Distribution

A distribution $F$ is heavy-tailed if and only if its moment generating function $M(t) = \mathbb{E}[e^{tX}]$ is infinite for all $t > 0$ . Equivalently, the tail $\bar{F}(x) = P(X > x)$ decays slower than any exponential:

$\lim_{x \to \infty} e^{\lambda x} \bar{F}(x) = \infty \quad \text{for all } \lambda > 0$

This means the tails are "heavier" than any exponential distribution. The Pareto, log-normal, Cauchy, and stable distributions are all heavy-tailed. The Gaussian, exponential, and Poisson distributions are not.

Definition

Fat-Tailed (Regularly Varying) Distribution

A distribution $F$ is fat-tailed (or has a power-law tail) if and only if its survival function satisfies:

$\bar{F}(x) = P(X > x) = L(x) \cdot x^{-\alpha}$

where $\alpha > 0$ is the tail index and $L(x)$ is a slowly varying function (meaning $L(tx)/L(t) \to 1$ as $t \to \infty$ for all $x > 0$ ).

The tail index $\alpha$ determines which moments exist:

$\mathbb{E}[|X|^p] < \infty$ if and only if $p < \alpha$
If $\alpha \leq 1$ : the mean does not exist
If $\alpha \leq 2$ : the variance does not exist
If $\alpha \leq 4$ : the kurtosis does not exist

Every fat-tailed distribution is heavy-tailed, but the converse is false. The log-normal distribution is heavy-tailed but not fat-tailed (its tail decays as $\exp(-c (\log x)^2)$ , not as a power law).

Definition

Pareto Distribution $X \sim Pareto (α, x_{m})$

The Pareto distribution with shape $\alpha > 0$ and scale $x_m > 0$ has survival function:

$P(X > x) = \left(\frac{x_m}{x}\right)^\alpha, \quad x \geq x_m$

and density $f(x) = \alpha x_m^\alpha x^{-(\alpha+1)}$ for $x \geq x_m$ .

Moments: $\mathbb{E}[X^k] = \frac{\alpha x_m^k}{\alpha - k}$ when $k < \alpha$ , and infinite otherwise.

For $\alpha > 1$ : $\mathbb{E}[X] = \frac{\alpha x_m}{\alpha - 1}$ . For $\alpha > 2$ : $\text{Var}(X) = \frac{\alpha x_m^2}{(\alpha - 1)^2(\alpha - 2)}$ .

Definition

Subexponential Distribution

A distribution $F$ on $[0, \infty)$ is subexponential if and only if for i.i.d. $X_1, X_2$ with distribution $F$ :

$\lim_{x \to \infty} \frac{P(X_1 + X_2 > x)}{P(X_1 > x)} = 2$

This says: the sum exceeds $x$ primarily because one of the summands exceeds $x$ , not because both are moderately large. For $n$ i.i.d. copies:

$P(X_1 + \cdots + X_n > x) \sim n \cdot P(X_1 > x)$

The maximum dominates the sum. This is the "one big jump" principle, and it is the defining property of distributions where rare events drive aggregate behavior.

Definition

Stable Distribution

A distribution is stable if and only if a linear combination of two independent copies has the same distribution (up to location and scale). Formally, $X$ is stable if for any $a, b > 0$ , there exist $c > 0$ and $d \in \mathbb{R}$ such that:

$aX_1 + bX_2 \stackrel{d}{=} cX + d$

where $X_1, X_2$ are independent copies of $X$ .

Stable distributions are parameterized by four values: stability index $\alpha \in (0, 2]$ , skewness $\beta \in [-1, 1]$ , scale $\gamma > 0$ , and location $\delta \in \mathbb{R}$ . Only two special cases have closed-form densities: the Gaussian ( $\alpha = 2$ ) and the Cauchy ( $\alpha = 1$ , $\beta = 0$ ).

Main Theorems

Theorem

Generalized Central Limit Theorem (Stable Laws)

Statement

Let $X_1, X_2, \ldots$ be i.i.d. random variables. If there exist sequences $a_n > 0$ and $b_n \in \mathbb{R}$ such that:

$\frac{X_1 + X_2 + \cdots + X_n - b_n}{a_n} \xrightarrow{d} S$

where $S$ is a non-degenerate random variable, then $S$ must be a stable distribution with some index $\alpha \in (0, 2]$ .

When $\alpha = 2$ , $S$ is Gaussian and $a_n = \sqrt{n}$ : this is the classical CLT.

When $\alpha < 2$ , $S$ is a non-Gaussian stable distribution and $a_n = n^{1/\alpha} L(n)$ for a slowly varying function $L$ . The limit is heavy-tailed with infinite variance.

A distribution with tail $P(|X| > x) \sim x^{-\alpha} L(x)$ for $\alpha \in (0, 2)$ is in the domain of attraction of the stable law with index $\alpha$ .

Intuition

The classical CLT says: if you average i.i.d. variables with finite variance, the average is approximately Gaussian. The generalized CLT asks: what if the variance is infinite? The answer is that properly normalized sums still converge, but to a non-Gaussian stable distribution. The only possible limits for normalized sums of i.i.d. variables are stable distributions. This is a universality result: just as the Gaussian is the universal limit for finite-variance sums, stable laws are the universal limits for all i.i.d. sums.

Proof Sketch

The proof uses characteristic functions. If $\phi(t)$ is the characteristic function of $X_i$ , then the characteristic function of the normalized sum is $[\phi(t/a_n)]^n \cdot e^{-itb_n/a_n}$ . For this to converge to a non-degenerate limit, the limit must satisfy the stability property $\hat{S}(t)^n = \hat{S}(c_n t) e^{id_n t}$ for appropriate sequences. The Levy-Khintchine representation theorem characterizes all infinitely divisible distributions, and the stability constraint restricts to the stable family. Gnedenko and Kolmogorov proved that a distribution is in the domain of attraction of a stable law with index $\alpha$ if and only if $P(|X| > x) = x^{-\alpha} L(x)$ for a slowly varying $L$ (when $\alpha < 2$ ).

Why It Matters

This theorem tells you what happens when you average data from fat-tailed distributions. The sample mean does not become Gaussian. It becomes stable, which means: the average is itself heavy-tailed, extreme values in the sample can dominate the average, and confidence intervals based on Gaussian quantiles are wrong. For $\alpha < 1$ , the mean does not exist, and the sample mean does not converge at all.

For ML: if gradient noise has a stable distribution with $\alpha < 2$ (there is empirical evidence for this in some architectures), then SGD dynamics are qualitatively different from the Gaussian noise assumption used in most convergence analyses.

Failure Mode

The theorem requires i.i.d. observations. For dependent data (time series, Markov chains), different limit theorems apply, and the limits can be non-stable. The theorem also assumes the distribution is in the domain of attraction of some stable law, which excludes some pathological cases (distributions that oscillate between different tail behaviors).

report a correction →

Theorem

Slow LLN Convergence Under Fat Tails

Statement

Let $X_1, X_2, \ldots$ be i.i.d. with $P(X > x) \sim x^{-\alpha}$ for $\alpha \in (1, 2)$ . The mean $\mu = \mathbb{E}[X]$ exists but the variance is infinite. The law of large numbers still holds:

$\bar{X}_n \xrightarrow{\text{a.s.}} \mu$

but the rate of convergence is $n^{1/\alpha - 1}$ rather than $n^{-1/2}$ . The fluctuations of $\bar{X}_n$ around $\mu$ are of order $n^{1/\alpha - 1}$ , and the normalized sum $(S_n - n\mu)/n^{1/\alpha}$ converges to a stable law, not a Gaussian.

Chebyshev-style confidence intervals do not apply because the variance is infinite.

Intuition

With finite variance, the sample mean fluctuates at rate $1/\sqrt{n}$ . With infinite variance but finite mean ( $1 < \alpha < 2$ ), the sample mean still converges, but much more slowly. The reason: occasional extreme observations create large jumps in the running average. These jumps become rarer as $n$ grows, but they are large enough that the average converges at rate $n^{1/\alpha - 1}$ rather than $n^{-1/2}$ . Since $1/\alpha > 1/2$ when $\alpha < 2$ , we have $1/\alpha - 1 > -1/2$ , so convergence is slower.

Proof Sketch

The SLLN holds because $\mathbb{E}[|X|] < \infty$ (since $\alpha > 1$ ), which is both necessary and sufficient for the strong law under i.i.d. conditions. The convergence rate follows from the generalized CLT: the centering sequence is $b_n = n\mu$ and the scaling is $a_n = n^{1/\alpha}L(n)$ . The fluctuations $\bar{X}_n - \mu$ are of order $a_n / n = n^{1/\alpha - 1}L(n)$ .

Why It Matters

Standard sample-size calculations assume $1/\sqrt{n}$ convergence. If the data is fat-tailed with $\alpha = 1.5$ , convergence is at rate $n^{-1/3}$ instead of $n^{-1/2}$ . To halve the error, you need not 4 times as many samples but $2^3 = 8$ times as many. For $\alpha$ close to 1, convergence is arbitrarily slow. This matters whenever you estimate a population mean from fat-tailed data: financial returns, insurance claims, or the average loss across a dataset with occasional extreme outliers.

Failure Mode

When $\alpha \leq 1$ , the mean does not exist and the LLN fails entirely. The sample mean of i.i.d. Cauchy random variables ( $\alpha = 1$ ) has the same Cauchy distribution for every $n$ . Averaging does not help at all.

report a correction →

Cauchy sample means do not converge; switch to Uniform or Gaussian to see what convergence looks like

Cauchy has no finite mean. Each running average is itself Cauchy-distributed for every n; the trajectories never settle. Resample to see fresh jumps from the heavy tails.

Heavy-Tailed vs. Fat-Tailed vs. Subexponential

These terms are related but distinct. The hierarchy is:

Fat-tailed (regularly varying) $\subset$ Subexponential $\subset$ Heavy-tailed

Class	Tail decay	Example
Fat-tailed	$P(X > x) \sim x^{-\alpha} L(x)$	Pareto, Student- $t$ , stable
Subexponential	Sum dominated by maximum	Log-normal, Weibull ( $\beta < 1$ )
Heavy-tailed	Slower than any exponential	All of the above
Thin-tailed	Exponential or faster decay	Gaussian, exponential, Poisson

The key operational distinction: for subexponential distributions, extreme events are driven by a single large observation (the "one big jump" principle). For thin-tailed distributions, extreme events require many moderately large observations occurring together.

Canonical Examples

Example

Pareto vs. Gaussian: the single-observation problem

Draw $n = 1000$ samples from a Pareto distribution with $\alpha = 1.5$ and $x_m = 1$ . The mean is $\mu = \alpha x_m / (\alpha - 1) = 3$ . Compute the sample mean. Repeat 10,000 times.

You will find: the sample mean fluctuates wildly across repetitions. In some repetitions, a single observation of size $10^4$ or larger appears and dominates the entire sum. The distribution of the sample mean is itself heavy-tailed (it converges to a stable law with index $\alpha = 1.5$ ).

Now repeat with $n = 1000$ Gaussian samples with the same mean and variance. The sample mean is tightly concentrated around $\mu$ , with fluctuations of order $\sigma / \sqrt{n}$ . The distribution of the sample mean is Gaussian.

The difference: with Pareto data, the sample mean is not a reliable summary even for $n = 1000$ .

Example

The 80/20 rule and power laws

A Pareto distribution with $\alpha$ slightly above 1 generates the "80/20" phenomenon. If wealth follows a Pareto distribution with $\alpha = 1.16$ , then about 80% of total wealth is held by the top 20% of individuals.

The Lorenz curve for a Pareto with tail index $\alpha > 1$ is $L(u) = 1 - (1-u)^{1 - 1/\alpha}$ , giving the fraction of total wealth held by the bottom $u$ fraction of the population. Therefore the fraction held by the top $p$ fraction is

$1 - L(1 - p) = p^{1 - 1/\alpha}.$

Check: $\alpha = 1.16$ , $p = 0.2$ gives $0.2^{1 - 1/1.16} = 0.2^{0.138} \approx 0.80$ , recovering the 80/20 rule.

This concentration gets more extreme as $\alpha$ decreases toward 1. For $\alpha = 1.05$ , the top 20% holds $0.2^{1 - 1/1.05} = 0.2^{0.0476} \approx 0.93$ , so about 93% of wealth.

Implications for ML

Gradient Noise

Several empirical studies have found that stochastic gradient noise in deep learning is better modeled by stable distributions with $\alpha < 2$ than by Gaussian noise. If this is correct, then:

SGD convergence theory that assumes Gaussian noise gives the wrong rate
The learning rate schedule that is optimal under Gaussian noise may not be optimal under stable noise
Heavy-tailed gradient noise may actually help optimization by allowing escape from sharp local minima (a form of convex tinkering at the algorithmic level)

Loss Distributions

The distribution of per-sample losses in a training set is often heavy-tailed: most samples have small loss, but a few have very large loss. If you use the mean loss as your objective, rare high-loss samples can dominate gradient updates. Robust losses (Huber, trimmed mean) address this by down-weighting the tails.

Token Frequencies in Language Models

Word and token frequencies follow Zipf's law, which is a power law with $\alpha \approx 1$ . This means: the most common token appears roughly $k$ times more often than the $k$ -th most common token. Training a language model on this distribution means that rare tokens get very few gradient updates, and the model's performance on rare tokens converges much more slowly than on common tokens.

Why Subgaussian Assumptions Matter

Standard ML theory bounds (Hoeffding, McDiarmid, Rademacher complexity) assume bounded or subgaussian random variables. A random variable $X$ is subgaussian with parameter $\sigma$ if $\mathbb{E}[e^{tX}] \leq e^{\sigma^2 t^2 / 2}$ for all $t$ . This is a strong tail condition: it guarantees $P(X > t) \leq 2e^{-t^2/(2\sigma^2)}$ .

Fat-tailed distributions violate this condition. When the data is fat-tailed:

Hoeffding's inequality does not apply
Standard uniform convergence bounds fail
The sample complexity for learning may be much worse than predicted by VC theory
Median-of-means estimators and truncation techniques are needed as replacements

Common Confusions

Watch Out

Heavy-tailed does not mean infinite mean

A distribution can be heavy-tailed (MGF infinite for all $t > 0$ ) while still having a finite mean. A Pareto with $\alpha = 3$ has finite mean and finite variance, but it is still heavy-tailed. The term "heavy-tailed" refers to tail decay rate, not to the existence of moments.

Watch Out

Power laws are not confirmed by log-log plots

Plotting the empirical survival function on a log-log scale and observing an approximately straight line does not confirm a power law. Many distributions (log-normal, stretched exponential) look approximately linear on a log-log plot over a limited range. Rigorous testing requires methods like the Clauset-Shalizi-Newman procedure, which uses maximum likelihood estimation for the tail index and a goodness-of-fit test based on the Kolmogorov-Smirnov statistic.

Watch Out

The sample variance does not estimate the population variance under fat tails

If $X$ has a Pareto distribution with $\alpha = 2.5$ , the population variance is finite. But the sample variance converges to the population variance extremely slowly, and the distribution of the sample variance is itself heavy-tailed. Standard confidence intervals for the variance (which assume $\chi^2$ limiting distributions) are unreliable. The sample kurtosis is meaningless when $\alpha < 4$ .

Watch Out

Taleb's point is about sample statistics, not about the distribution itself

Taleb's critique is not that fat-tailed distributions are mysterious. It is that standard statistical tools (sample mean, sample variance, confidence intervals, p-values) give misleading answers when applied to fat-tailed data, because these tools implicitly assume thin tails. The distribution itself is perfectly well-defined mathematically. The problem is epistemological: you cannot reliably estimate its properties from a finite sample.

Misreadings of Fat Tails

Watch Out

A Black Swan is any rare event

Wrong. A Black Swan in Taleb's sense has three properties: rarity, extreme impact, and retrospective explainability (people invent stories after the fact as if the event was predictable). A coin flip landing heads 20 times in a row is rare but not a Black Swan because its impact is negligible. A financial crisis that wipes out savings and is afterward "explained" by obvious-in-hindsight factors is a Black Swan.

Watch Out

Fat tails mean we should give up on statistics

Wrong. Fat tails mean we should use the right statistics. Median-of-means estimators, trimmed means, robust regression, distribution-free concentration inequalities, and non-parametric methods all work under fat tails. The mistake is applying thin-tailed tools (sample mean + Gaussian confidence intervals) to fat-tailed data, not doing statistics at all.

Exercises

ExerciseCore

Problem

A Pareto distribution has shape parameter $\alpha$ and scale $x_m = 1$ . For which values of $\alpha$ does the mean exist? The variance? The fourth moment? Compute the mean and variance when they exist.

ExerciseCore

Problem

Explain why the sample mean of $n$ i.i.d. Cauchy random variables has the same Cauchy distribution regardless of $n$ . What does this imply about the law of large numbers?

ExerciseAdvanced

Problem

Let $X$ be a non-negative random variable with $P(X > x) = x^{-\alpha}$ for $x \geq 1$ and $\alpha \in (1, 2)$ . Show that $\text{Var}(X) = \infty$ but $\mathbb{E}[X] < \infty$ , and compute the mean.

ExerciseAdvanced

Problem

The subexponential property states that for i.i.d. non-negative $X_1, X_2$ with distribution $F$ : $P(X_1 + X_2 > x) / P(X_1 > x) \to 2$ as $x \to \infty$ . Prove that this implies $P(\max(X_1, X_2) > x) / P(X_1 + X_2 > x) \to 1$ . Interpret this result.

ExerciseResearch

Problem

The median-of-means estimator divides $n$ samples into $k$ groups, computes the sample mean of each group, and returns the median of these $k$ means. Explain why this estimator is more robust to fat tails than the ordinary sample mean, and under what conditions on $\alpha$ it achieves sub-Gaussian concentration.

References

Canonical:

Feller, An Introduction to Probability Theory and Its Applications, Vol. 2 (1971), Chapters VIII-IX
Nolan, Stable Distributions: Models for Heavy Tailed Data (2020), Chapters 1-3
Taleb, Statistical Consequences of Fat Tails (2020), Chapters 1-5

Current:

Embrechts, Kluppelberg, & Mikosch, Modelling Extremal Events (1997), Chapters 1-3
Vershynin, High-Dimensional Probability (2018), Section 2.8 (subgaussian and subexponential)
Lugosi & Mendelson, "Mean Estimation and Regression Under Heavy-Tailed Distributions: A Survey" (2019), Foundations and Trends in ML, Sections 1-4

Next Topics

Building on fat tails:

Extreme value theory: the mathematics of maxima and peaks over threshold
Subexponential random variables: the detailed theory of the "one big jump" principle

Last reviewed: April 17, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

6

Common Probability Distributionslayer 0A · tier 1
Expectation, Variance, Covariance, and Momentslayer 0A · tier 1
Law of Large Numberslayer 0B · tier 1
Characteristic Functionslayer 1 · tier 1
Convex Tinkeringlayer 2 · tier 2

Derived topics

2

LLN and CLT Failures Under Heavy Tailslayer 2 · tier 1
Extreme Value Theorylayer 3 · tier 2

Graph-backed continuations

Extreme Value Theory