Pareto Distribution

Sneiderman, Robby

Foundations

Pareto Distribution

The Pareto distribution is the canonical power-law on a half-line. The Type I parameterization has survival function (x_m/x)^alpha for x at least x_m. The shape parameter alpha is the tail index. Three regimes of alpha matter for the law of large numbers and the central limit theorem: alpha at most 1 has no finite mean and breaks the LLN; 1 < alpha at most 2 has finite mean but infinite variance so the standard CLT fails (generalized CLT to a stable law); alpha greater than 2 admits both LLN and CLT in the usual form. Applications: wealth, city sizes, file sizes, network degree, insurance severity. The 80/20 'Pareto principle' is a specific case requiring alpha approximately 1.16.

AdvancedAdvancedTier 2StableSupporting~45 min

For:MLStatsActuarial

Prerequisites

Common Probability Distributions Central Limit Theorem Law of Large Numbers Distributions Atlas

Prereq Map

Plain-Language Definition

The Pareto distribution is the simplest model of a power-law tail. A positive random variable $X$ is Pareto Type I with minimum value $x_m > 0$ and shape parameter $\alpha > 0$ if the probability of exceeding $x$ falls like a power of $x$ :

$\mathbb{P}(X > x) = \left(\frac{x_m}{x}\right)^\alpha, \quad x \geq x_m.$

The shape parameter $\alpha$ is called the tail index. A smaller $\alpha$ means a heavier tail, a slower decay of exceedance probabilities, and more weight in the upper extreme. The 80/20 rule, the long tail of file sizes on the internet, and the size distribution of cities and earthquakes all sit in the Pareto family with different tail indices.

The shape of the tail is what makes the Pareto interesting: depending on how heavy the tail is, the sample mean may not converge, or it may converge but to a non-Normal limit. The distinctions are sharp, controlled entirely by $\alpha$ .

Definition

Pareto Type I Distribution $X \sim Pareto (x_{m}, α)$

A random variable $X$ has a Pareto Type I distribution with scale $x_m > 0$ and shape $\alpha > 0$ when its survival function is

$S_X(x) = \mathbb{P}(X > x) = \left(\frac{x_m}{x}\right)^\alpha, \quad x \geq x_m,$

and $S_X(x) = 1$ for $x < x_m$ . The density is

$f_X(x) = \frac{\alpha\, x_m^\alpha}{x^{\alpha + 1}}, \quad x \geq x_m.$

The support starts at $x_m$ , not at 0; the distribution is left-bounded. The Type II (Lomax) parameterization shifts the support to start at 0 by replacing $x$ with $x_m + y$ where $y \geq 0$ ; survival functions become $(1 + y/x_m)^{-\alpha}$ . The two share the same tail behavior but differ near the origin.

Why This Matters

The Pareto is the canonical heavy-tailed distribution in applied work for three reasons.

It is the limiting tail. A consequence of the Pickands-Balkema-de Haan theorem in extreme-value theory is that exceedances of a high threshold from any distribution in the Frechet domain of attraction (i.e. with a regularly varying tail) converge to a Generalized Pareto. The Pareto Type II is the natural parametric model for threshold exceedances when the tail is power-law.
It separates the three asymptotic regimes. The sample mean of iid Pareto samples follows three distinct asymptotic laws depending on $\alpha$ . Small $\alpha$ breaks the law of large numbers; intermediate $\alpha$ admits the law of large numbers but breaks the classical central limit theorem; large $\alpha$ admits both in the usual form. The Pareto is the cleanest distribution to use as a stress test for any sample-mean-based estimator.
It is a useful baseline for tail-aware decisions. Wealth, city sizes, file sizes, network degree, insurance severity above a threshold, and earthquake magnitudes are all power-law-shaped over significant ranges. Reporting a sample mean for such data is misleading; the right summary is the tail index and a quantile, both of which the Pareto parameterizes directly.

The 80/20 principle ("80 percent of the wealth is held by 20 percent of the people") is a specific case of the Pareto distribution with shape $\alpha$ satisfying $1 - F(F^{-1}(0.8)) \cdot F^{-1}(0.8) = 0.2 \cdot \mathbb{E}[X]$ . Solving for $\alpha$ gives $\alpha \approx 1.16$ . Other splits (90/10, 70/30) correspond to other values of $\alpha$ . The "rule" is a shorthand for a single point on a continuum, not a universal law.

Survival, Mean, Variance

Theorem

Pareto Survival, Mean, and Variance

Statement

The survival function is $\mathbb{P}(X > x) = (x_m / x)^\alpha$ for $x \geq x_m$ . The $k$ -th moment exists if and only if $\alpha > k$ , in which case $\mathbb{E}[X^k] = \frac{\alpha\, x_m^k}{\alpha - k}.$ Specializing to $k = 1$ and $k = 2$ : $\mathbb{E}[X] = \frac{\alpha\, x_m}{\alpha - 1} \text{ for } \alpha > 1, \quad \operatorname{Var}(X) = \frac{\alpha\, x_m^2}{(\alpha - 1)^2 (\alpha - 2)} \text{ for } \alpha > 2.$ For $\alpha \leq 1$ the mean is infinite; for $1 < \alpha \leq 2$ the mean is finite but the variance is infinite.

Intuition

The integral defining $\mathbb{E}[X^k]$ converges at infinity if and only if $x^k \cdot x^{-(\alpha + 1)} = x^{k - \alpha - 1}$ has an integrable tail, i.e. $k - \alpha - 1 < -1$ , equivalently $\alpha > k$ . Below the threshold, the integral diverges, and the moment is infinite. Above the threshold, the integral is elementary.

Proof Sketch

For $\alpha > k$ , $\mathbb{E}[X^k] = \int_{x_m}^\infty x^k \cdot \alpha\, x_m^\alpha / x^{\alpha + 1}\, dx = \alpha\, x_m^\alpha \int_{x_m}^\infty x^{k - \alpha - 1}\, dx$ . The integral evaluates to $x_m^{k - \alpha}/(\alpha - k)$ , giving $\mathbb{E}[X^k] = \alpha\, x_m^k / (\alpha - k)$ . For $\alpha \leq k$ the integrand has a non-integrable tail and the moment is infinite.

Why It Matters

The thresholds for moment existence are the central organizing principle for working with the Pareto. A statement of the form "estimate the mean of $X$ " requires $\alpha > 1$ ; otherwise the sample mean does not estimate any well-defined population quantity. A statement involving the standard error of the sample mean requires $\alpha > 2$ ; otherwise the classical CLT-based standard error is infinite and a different asymptotic framework is needed.

Failure Mode

Software libraries differ on which $\alpha$ they call the "shape": some use the survival exponent (our $\alpha$ ), others use $\alpha + 1$ (the density exponent), others use $1/\alpha$ . Convert before plugging in. The same warning applies to academic papers: empirical-finance papers sometimes report tail exponents that differ by 1 from the parameter used by classical statistics texts.

report a correction →

Three Regimes for LLN and CLT

Theorem

LLN and CLT Regimes for Iid Pareto Samples

Statement

Let $\bar X_n = (1/n)\sum_{i=1}^n X_i$ and $S_n = \sum_{i=1}^n X_i$ .

Regime A ( $\alpha \leq 1$ ). The mean is infinite. $\bar X_n \to \infty$ almost surely. Neither the law of large numbers nor the standard central limit theorem applies. Under suitable centering and scaling, $S_n$ has a stable-law limit with index $\alpha$ .
Regime B ( $1 < \alpha \leq 2$ ). The mean $\mu = \alpha\, x_m / (\alpha - 1)$ is finite. The variance is infinite. The law of large numbers holds: $\bar X_n \to \mu$ almost surely (by Khintchine). The classical central limit theorem fails; instead, $(S_n - n\mu) / n^{1/\alpha}$ converges in distribution to a stable law with index $\alpha$ .
Regime C ( $\alpha > 2$ ). Both the mean and the variance are finite. Standard law of large numbers and classical central limit theorem apply: $\bar X_n \to \mu$ almost surely and $\sqrt{n}(\bar X_n - \mu) \xrightarrow{d} N(0, \sigma^2)$ .

Intuition

The classical CLT requires finite variance; the law of large numbers requires only finite mean. Pareto $\alpha$ controls both thresholds simultaneously. The boundary $\alpha = 2$ separates Normal limits from stable limits; the boundary $\alpha = 1$ separates law-of-large-numbers behavior from no-law-of-large-numbers behavior.

Proof Sketch

The mean condition $\mathbb{E}[X] < \infty$ requires $\alpha > 1$ . The variance condition $\operatorname{Var}(X) < \infty$ requires $\alpha > 2$ . With finite mean and variance, the standard Kolmogorov SLLN and Lindeberg CLT apply. With finite mean only, Khintchine's SLLN still gives convergence of the sample mean to the population mean almost surely. Generalized CLT theory (Gnedenko-Kolmogorov; see Feller volume 2, chapter 17) gives stable-law limits for centered partial sums whenever the tail is regularly varying with index $\alpha$ , which is the Pareto case.

Why It Matters

The regime boundary at $\alpha = 2$ is the most consequential. Confidence intervals for the sample mean, $t$ -tests, $z$ -tests, and every standard-error calculation rely on the finite-variance CLT. When data is Pareto with $\alpha \leq 2$ , these procedures produce intervals that shrink at the wrong rate ( $n^{1/\alpha}$ instead of $\sqrt n$ ) and the coverage probabilities are uncontrolled in finite samples.

Failure Mode

The "median is more reliable than the mean for heavy-tailed data" advice is correct for $\alpha \leq 1$ (no finite mean exists) but the median has its own bias-variance properties that are different from the mean. For $1 < \alpha \leq 2$ , the mean is well-defined and the sample mean converges; the slow $n^{1/\alpha}$ rate is the problem, not the existence.

report a correction →

See also lln-failures-heavy-tails for the diagnostic plots that detect each regime from data.

Worked Example: Three Tail Indices

Consider Pareto Type I samples with $x_m = 1$ and three shape values $\alpha = 0.8, 1.5, 3.0$ .

For $\alpha = 0.8$ (Regime A), $\mathbb{E}[X] = \infty$ . A simulation of $n = 10^6$ iid samples produces a sample mean that drifts upward with $n$ and depends sensitively on the largest observation. Median is well-defined: $q_{0.5} = 1 \cdot 0.5^{-1/0.8} = 2^{1/0.8} \approx 2.378$ .

For $\alpha = 1.5$ (Regime B), $\mathbb{E}[X] = 1.5 / 0.5 = 3$ . Sample mean converges to 3 in probability, but the rate is $n^{-1/1.5} = n^{-2/3}$ , slower than $n^{-1/2}$ . Standard errors computed from the sample variance are meaningless; the variance is infinite.

For $\alpha = 3.0$ (Regime C), $\mathbb{E}[X] = 3 \cdot 1 / 2 = 1.5$ and $\operatorname{Var}(X) = 3 / (4 \cdot 1) = 0.75$ . Sample mean converges at the standard $n^{-1/2}$ rate, and $\sqrt{n}(\bar X_n - 1.5) \xrightarrow{d} N(0, 0.75)$ . Confidence intervals are conventional.

Across the three regimes, the population median is always finite: $q_{0.5}(\alpha) = x_m \cdot 2^{1/\alpha}$ , equal to $1.890$ for $\alpha = 1.5$ and $1.260$ for $\alpha = 3$ . Median is a stable summary even when the mean is not.

Common Misconceptions

Watch Out

Pareto with alpha at most 1 has no finite mean

The sample mean of Pareto data with $\alpha \leq 1$ diverges to infinity almost surely. Reporting a sample mean from such data is meaningless; the population quantity does not exist. Use the median or a quantile-based summary instead.

Watch Out

The 80/20 rule is a single point, not a universal property

The "80/20 rule" corresponds to a Pareto with $\alpha$ near $1.16$ . Other splits (90/10, 70/30) correspond to other values of $\alpha$ . The split is a one-parameter shorthand, not a separate empirical regularity. Quoting "the 80/20 rule applies" to a data set without computing $\alpha$ is a common error.

Watch Out

A power-law tail and a power-law density are not the same statement

The Pareto Type I has density $f(x) = \alpha\, x_m^\alpha / x^{\alpha + 1}$ , an exponent of $\alpha + 1$ in the density. The survival function has exponent $\alpha$ . Papers sometimes report the density exponent and label it $\alpha$ ; others report the survival exponent and use the same symbol. The two differ by 1. Always check which is meant.

Watch Out

Estimating alpha from a log-log plot is biased

Plotting $\log \mathbb{P}(X > x)$ against $\log x$ and reading off the slope is a quick visual check, not a valid estimator. The slope estimator has systematic bias, and the empirical survival function for the largest order statistics has substantial sampling variability. Use Hill's estimator or a maximum-likelihood fit above a chosen threshold; quantify the threshold sensitivity.

Comparison: Pareto vs Exponential vs Lognormal

The three nonnegative right-skewed distributions form a useful tail-weight ladder.

Exponential. Tail decays as $e^{-\lambda x}$ . Light-tailed; all moments exist; standard LLN and CLT.
Lognormal. Tail decays sub-exponentially but super-polynomially. All moments exist; LLN and CLT hold; but tails are heavier than Exponential and conditional excess grows roughly linearly with the threshold.
Pareto. Tail decays polynomially as $x^{-\alpha}$ . Moments exist only above $\alpha$ ; LLN and CLT hold only for sufficiently large $\alpha$ .

Discriminating between these on data is the work of the mean-excess plot and the log-log survival plot. Pareto data shows a roughly horizontal mean-excess plot above some threshold; Exponential data shows a strictly horizontal mean-excess plot at every level; Lognormal data shows a curved mean-excess plot.

For the severity-modeling perspective on the Pareto, including peaks-over-threshold fitting and connections to the Generalized Pareto, see ActuaryPath's Pareto page at https://www.actuarypath.com/concepts/pareto-distribution/.

Maximum-Likelihood Estimator

For an iid Pareto Type I sample with known $x_m$ , the MLE of $\alpha$ is

$\widehat\alpha = \frac{n}{\sum_{i=1}^{n} \ln(X_i / x_m)}.$

This is the inverse of the average log-excess and is a special case of Hill's estimator. The MLE is consistent and asymptotically Normal with variance $\alpha^2 / n$ when $x_m$ is known. When $x_m$ is unknown, $\widehat x_m = \min_i X_i$ is the MLE and the MLE for $\alpha$ uses the same formula with the sample minimum.

Both MLEs are biased in finite samples for small $\alpha$ ; the Hill estimator has known finite-sample bias documented in classical extreme-value theory references.

Exercises

ExerciseCore

Problem

A power-law model for the size distribution of files on a server has $x_m = 1$ KB and tail index $\alpha = 2$ . Compute the median file size, the mean file size, and the probability that a file exceeds 100 KB.

ExerciseCore

Problem

A Pareto Type I has $x_m = 100$ and $\alpha = 3$ . Find the 95th and 99th percentiles, and the conditional expectation given exceedance of 1000.

ExerciseCore

Problem

Suppose $X \sim \operatorname{Pareto}(x_m = 1, \alpha = 1.5)$ . Compute $\mathbb{P}(X > 10)$ and $\mathbb{E}[X]$ , and explain why the sample variance from any iid sample is uninformative.

ExerciseAdvanced

Problem

Derive the maximum-likelihood estimator of $\alpha$ from an iid Pareto Type I sample with known $x_m$ .

ExerciseAdvanced

Problem

Show that if $X \sim \operatorname{Pareto}(x_m, \alpha)$ , then $Y = \ln(X / x_m)$ is $\operatorname{Exponential}(\alpha)$ .

ExerciseAdvanced

Problem

Find the value of $\alpha$ for which the Pareto Type I satisfies the "80/20" property: the top 20 percent of the population holds 80 percent of the total wealth.

References

Casella, G., and Berger, R. L. (2002). Statistical Inference, 2nd ed., Duxbury. Section 3.3 includes the Pareto in the catalog of continuous distributions; chapter 5 covers asymptotic theory and the conditions under which the CLT applies.
Blitzstein, J. K., and Hwang, J. (2019). Introduction to Probability, 2nd ed., Chapman and Hall / CRC. Chapter 6 has worked examples on Pareto wealth distributions and the LLN failure.
For peaks-over-threshold fitting, Generalized Pareto modeling, and the actuarial-severity perspective, see ActuaryPath's Pareto page at https://www.actuarypath.com/concepts/pareto-distribution/ and Klugman, Panjer, Willmot (2019), Loss Models, 5th ed., Wiley, Chapter 5.
For the stable-law limit theorems referenced in Regime B, Feller, W. (1971). An Introduction to Probability Theory and Its Applications, Volume 2, 2nd ed., Wiley. Chapter 17 covers stable laws and generalized central limit theorems.

Last reviewed: May 12, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Common Probability Distributionslayer 0A · tier 1
Distributions Atlaslayer 0A · tier 1
Central Limit Theoremlayer 0B · tier 1
Law of Large Numberslayer 0B · tier 1

Derived topics

0

No published topic currently declares this as a prerequisite.