Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Probability

Extreme Value Theory

The mathematics of maxima and rare events. The Fisher-Tippett-Gnedenko theorem, the three extreme value distributions (Gumbel, Frechet, Weibull), peaks-over-threshold, and applications to tail risk and model evaluation.

AdvancedTier 2Stable~45 min
0

Why This Matters

The law of large numbers and the central limit theorem describe the behavior of sums and averages. Extreme value theory (EVT) describes the behavior of maxima and minima. This is a different question with different answers and different applications.

When you evaluate a machine learning model, you often care about worst-case performance: the maximum prediction error, the largest loss on any input, the most confident wrong prediction. When you assess risk, you care about the worst outcome: the largest portfolio loss, the tallest wave, the strongest earthquake. When you use best-of-N sampling in language models, you care about the maximum quality score across NN samples.

EVT provides the mathematical tools for these problems. Just as the CLT says that properly normalized sums converge to a Gaussian, the Fisher-Tippett-Gnedenko theorem says that properly normalized maxima converge to one of exactly three distributions. Which one depends on the tail behavior of the underlying distribution.

Mental Model

Draw nn i.i.d. samples from some distribution FF and record the maximum Mn=max(X1,,Xn)M_n = \max(X_1, \ldots, X_n). As nn grows, MnM_n increases. The question is: after proper centering and scaling, does the distribution of MnM_n converge to something universal?

The answer is yes, and the limit depends on how the tail of FF decays:

  • Exponential-type tails (Gaussian, exponential): the maximum grows logarithmically, and the limit is the Gumbel distribution
  • Polynomial tails (fat-tailed, Pareto): the maximum grows as a power of nn, and the limit is the Frechet distribution
  • Bounded support (uniform, beta): the maximum approaches the upper bound, and the limit is the Weibull distribution

Core Definitions

Definition

Generalized Extreme Value Distribution

The generalized extreme value (GEV) distribution unifies the three extreme value types into a single family parameterized by location μR\mu \in \mathbb{R}, scale σ>0\sigma > 0, and shape ξR\xi \in \mathbb{R}:

G(x)=exp{[1+ξ(xμσ)]1/ξ}G(x) = \exp\left\{-\left[1 + \xi\left(\frac{x - \mu}{\sigma}\right)\right]^{-1/\xi}\right\}

defined on {x:1+ξ(xμ)/σ>0}\{x : 1 + \xi(x - \mu)/\sigma > 0\}.

The three cases:

  • ξ>0\xi > 0: Frechet type (heavy right tail, polynomial decay)
  • ξ=0\xi = 0: Gumbel type (light tail, exponential-type decay). The formula is interpreted as the limit ξ0\xi \to 0: G(x)=exp(e(xμ)/σ)G(x) = \exp(-e^{-(x-\mu)/\sigma})
  • ξ<0\xi < 0: Weibull type (finite right endpoint at μσ/ξ\mu - \sigma/\xi)

The parameter ξ\xi is called the extreme value index or shape parameter. It determines the tail behavior of the maximum distribution.

Definition

Gumbel Distribution

The Gumbel distribution (Type I extreme value) has CDF:

G(x)=exp(e(xμ)/σ),xRG(x) = \exp(-e^{-(x - \mu)/\sigma}), \quad x \in \mathbb{R}

with mean μ+γσ\mu + \gamma\sigma (where γ0.5772\gamma \approx 0.5772 is the Euler-Mascheroni constant) and variance π2σ2/6\pi^2\sigma^2/6.

The Gumbel distribution arises as the limit for maxima of distributions with exponential-type tails: Gaussian, exponential, gamma, log-normal.

Definition

Frechet Distribution

The Frechet distribution (Type II extreme value) has CDF:

Φα(x)={0x0exp(xα)x>0\Phi_\alpha(x) = \begin{cases} 0 & x \leq 0 \\ \exp(-x^{-\alpha}) & x > 0 \end{cases}

for α>0\alpha > 0. This corresponds to GEV with ξ=1/α>0\xi = 1/\alpha > 0.

The Frechet distribution has a polynomial right tail: 1Φα(x)xα1 - \Phi_\alpha(x) \sim x^{-\alpha} for large xx. It arises as the limit for maxima of fat-tailed distributions with tail index α\alpha.

Definition

Reversed Weibull Distribution

The reversed Weibull distribution (Type III extreme value) has CDF:

Ψα(x)={exp((x)α)x01x>0\Psi_\alpha(x) = \begin{cases} \exp(-(-x)^\alpha) & x \leq 0 \\ 1 & x > 0 \end{cases}

for α>0\alpha > 0. This corresponds to GEV with ξ=1/α<0\xi = -1/\alpha < 0.

It arises as the limit for maxima of distributions with a finite right endpoint. The uniform and beta distributions are in this domain of attraction.

Definition

Domain of Attraction

A distribution FF belongs to the domain of attraction of an extreme value distribution GG, written FD(G)F \in D(G), if there exist normalizing sequences an>0a_n > 0 and bnb_n such that:

P(Mnbnanx)=F(anx+bn)nG(x)P\left(\frac{M_n - b_n}{a_n} \leq x\right) = F(a_n x + b_n)^n \to G(x)

as nn \to \infty.

Not every distribution belongs to some domain of attraction, but most distributions encountered in practice do.

Definition

Generalized Pareto Distribution

The generalized Pareto distribution (GPD) has CDF:

H(x)=1(1+ξxσ)1/ξH(x) = 1 - \left(1 + \frac{\xi x}{\sigma}\right)^{-1/\xi}

for x0x \geq 0 when ξ0\xi \geq 0, and 0xσ/ξ0 \leq x \leq -\sigma/\xi when ξ<0\xi < 0. The case ξ=0\xi = 0 is interpreted as the exponential: H(x)=1ex/σH(x) = 1 - e^{-x/\sigma}.

The GPD is the natural distribution for exceedances over a threshold. If XX has a GEV distribution, then exceedances XuX - u given X>uX > u follow a GPD as uu approaches the right endpoint.

Main Theorems

Theorem

Fisher-Tippett-Gnedenko Theorem

Statement

Let X1,X2,X_1, X_2, \ldots be i.i.d. random variables with distribution FF, and let Mn=max(X1,,Xn)M_n = \max(X_1, \ldots, X_n). If there exist sequences an>0a_n > 0 and bnRb_n \in \mathbb{R} such that:

MnbnandG\frac{M_n - b_n}{a_n} \xrightarrow{d} G

for some non-degenerate distribution GG, then GG must be a generalized extreme value distribution GEV(μ,σ,ξ)\text{GEV}(\mu, \sigma, \xi) for some μ,σ,ξ\mu, \sigma, \xi.

Equivalently, GG must be one of the three types:

  • Type I (Gumbel): ξ=0\xi = 0, tails decay exponentially
  • Type II (Frechet): ξ>0\xi > 0, polynomial right tail
  • Type III (Reversed Weibull): ξ<0\xi < 0, finite right endpoint

The normalizing sequences are:

  • Frechet domain (FD(Φα)F \in D(\Phi_\alpha)): bn=0b_n = 0, an=F1(11/n)a_n = F^{-1}(1 - 1/n), which grows as n1/αn^{1/\alpha}
  • Gumbel domain (FD(Λ)F \in D(\Lambda)): ana_n and bnb_n depend on the specific FF (e.g., for Gaussian: bn2lognb_n \sim \sqrt{2 \log n}, an1/2logna_n \sim 1/\sqrt{2 \log n})
  • Weibull domain (FD(Ψα)F \in D(\Psi_\alpha)): bn=x=sup{x:F(x)<1}b_n = x^* = \sup\{x : F(x) < 1\}, ana_n depends on the behavior of FF near xx^*

Intuition

This theorem is the "CLT for maxima." Just as the CLT says that normalized sums of i.i.d. variables can only converge to a Gaussian (under finite variance), the Fisher-Tippett theorem says that normalized maxima can only converge to one of three types. The type is determined entirely by the tail behavior of the underlying distribution.

The key distinction: fat-tailed distributions produce Frechet limits (the maximum grows as a power of nn), exponential-type tails produce Gumbel limits (the maximum grows logarithmically), and bounded distributions produce reversed Weibull limits (the maximum approaches the upper bound).

Proof Sketch

The proof uses the max-stability property. If GG is a non-degenerate limit of Fn(anx+bn)F^n(a_n x + b_n), then GG must be max-stable: Gn(cnx+dn)=G(x)G^n(c_n x + d_n) = G(x) for some sequences cn,dnc_n, d_n. Taking logarithms: nlogG(cnx+dn)=logG(x)n \log G(c_n x + d_n) = \log G(x). Since logG\log G is a negative function that is zero at infinity, the functional equation constrains logG\log G to be of the form (1+ξx)1/ξ-(1 + \xi x)^{-1/\xi} (up to location and scale). The three types correspond to ξ>0\xi > 0, ξ=0\xi = 0, and ξ<0\xi < 0. The characterization of domains of attraction uses Karamata's theory of regular variation (for the Frechet case) and the von Mises conditions (for all cases).

Why It Matters

The theorem tells you that regardless of the underlying distribution, the maximum of a large sample behaves in one of three ways. This enables:

  • Tail risk assessment: estimate the probability of events more extreme than any observed, by fitting a GEV to block maxima
  • Engineering reliability: design structures to withstand the maximum load over a 100-year period, using only 30 years of data
  • ML model evaluation: understand the distribution of worst-case errors across test inputs

The shape parameter ξ\xi is the single most important quantity: it determines whether extremes are bounded (ξ<0\xi < 0), grow logarithmically (ξ=0\xi = 0), or grow polynomially (ξ>0\xi > 0).

Failure Mode

The theorem requires the existence of normalizing sequences. Some distributions are not in any domain of attraction (e.g., discrete distributions with irregular support can fail). The theorem is also asymptotic: for finite nn, the GEV approximation may be poor. The rate of convergence to the GEV limit is often much slower than CLT convergence to the Gaussian, especially in the Gumbel domain (convergence is logarithmic in nn).

Theorem

Pickands-Balkema-de Haan Theorem

Statement

If FF is in the domain of attraction of a GEV distribution with shape parameter ξ\xi, then for large threshold uu, the distribution of exceedances XuX - u given X>uX > u is approximately generalized Pareto:

P(XuyX>u)Hξ,σ(u)(y)=1(1+ξyσ(u))1/ξP(X - u \leq y \mid X > u) \approx H_{\xi, \sigma(u)}(y) = 1 - \left(1 + \frac{\xi y}{\sigma(u)}\right)^{-1/\xi}

where σ(u)>0\sigma(u) > 0 depends on uu, and the approximation improves as ux=sup{x:F(x)<1}u \to x^* = \sup\{x : F(x) < 1\}.

The shape parameter ξ\xi is the same as in the GEV limit. This connects block maxima analysis (GEV) with threshold exceedance analysis (GPD).

Intuition

Instead of looking at the maximum of a block of data, look at all observations that exceed a high threshold uu. The distribution of how far they exceed uu is approximately GPD. This is more data-efficient than block maxima: instead of one maximum per block, you use all exceedances. The GPD shape parameter ξ\xi is the same as the GEV shape, so both methods estimate the same tail behavior.

Proof Sketch

Define the excess distribution Fu(y)=P(XuyX>u)=(F(u+y)F(u))/(1F(u))F_u(y) = P(X - u \leq y \mid X > u) = (F(u + y) - F(u))/(1 - F(u)) for y0y \geq 0. If FF is in the domain of attraction of a GEV with parameter ξ\xi, then FF satisfies a regular variation condition on its tail. This condition, combined with the relationship between FnF^n and the GEV limit, implies that FuF_u converges to a GPD with the same ξ\xi as uxu \to x^*. The proof uses the fact that the GPD is the only distribution with the "threshold stability" property: if XuX>uX - u \mid X > u follows GPD, then XvX>vX - v \mid X > v also follows GPD for v>uv > u (with adjusted scale).

Why It Matters

The Pickands-Balkema-de Haan theorem is the foundation of the peaks-over-threshold (POT) method, which is the standard approach for modeling extreme events in practice. Instead of dividing data into blocks and fitting a GEV to block maxima (which wastes data), the POT method fits a GPD to all exceedances above a threshold. This is more statistically efficient and is the dominant method in hydrology, finance, and insurance.

Failure Mode

The choice of threshold uu is critical. Too low, and the GPD approximation is poor (the asymptotic result has not kicked in). Too high, and there are too few exceedances for reliable estimation. Threshold selection is one of the most difficult practical aspects of EVT. Common approaches include mean residual life plots and stability of parameter estimates across threshold choices.

Domains of Attraction: Which Distributions Go Where

DistributionDomainShape ξ\xiWhy
GaussianGumbelξ=0\xi = 0Tail decays as ex2/2e^{-x^2/2}
ExponentialGumbelξ=0\xi = 0Tail decays as exe^{-x}
Log-normalGumbelξ=0\xi = 0Tail decays faster than any power law
Pareto(α\alpha)Frechetξ=1/α\xi = 1/\alphaTail decays as xαx^{-\alpha}
Student-tt(ν\nu)Frechetξ=1/ν\xi = 1/\nuTail decays as xνx^{-\nu}
CauchyFrechetξ=1\xi = 1Tail decays as x1x^{-1}
Uniform(0,1)Rev. Weibullξ=1\xi = -1Bounded above at 1
Beta(aa,bb)Rev. Weibullξ=1\xi = -1Bounded above at 1

Applications

Financial Tail Risk

Value-at-Risk (VaR) and Expected Shortfall (ES) estimate the probability and size of extreme losses. Standard approaches assume Gaussian returns, which drastically underestimate tail risk. EVT provides a principled alternative: fit a GPD to losses exceeding a high threshold, then extrapolate to estimate the probability of losses larger than any observed.

Best-of-N Sampling in LLMs

When generating NN candidate responses from a language model and selecting the best one according to a reward model, the quality of the selected response depends on the distribution of the maximum of NN scores. If scores are approximately Gaussian, the improvement from best-of-NN grows as 2logN\sqrt{2 \log N} (Gumbel scaling). If scores are fat-tailed, the improvement grows as N1/αN^{1/\alpha} (Frechet scaling), which is much faster.

Flood Modeling and Return Levels

EVT is the standard tool for estimating "100-year floods" from shorter records. Fit a GEV to annual maximum river levels, then extrapolate to the 100-year return level: the level exceeded with probability 1/100 in any given year. The shape parameter ξ\xi determines whether the extrapolation is conservative (ξ<0\xi < 0) or aggressive (ξ>0\xi > 0).

Model Evaluation

The worst-case error of a model across nn test inputs is the maximum of nn per-input errors. If per-input errors have an exponential-type tail, the worst-case error grows as logn\log n. If errors have a fat tail, the worst-case error grows as n1/αn^{1/\alpha}. This distinction matters for safety-critical applications where worst-case performance is the binding constraint.

Canonical Examples

Example

Maximum of Gaussian samples

Let X1,,XnN(0,1)X_1, \ldots, X_n \sim \mathcal{N}(0, 1). The maximum MnM_n satisfies:

MnbnandGumbel\frac{M_n - b_n}{a_n} \xrightarrow{d} \text{Gumbel}

with bn=2lognloglogn+log(4π)22lognb_n = \sqrt{2 \log n} - \frac{\log \log n + \log(4\pi)}{2\sqrt{2 \log n}} and an=1/2logna_n = 1/\sqrt{2 \log n}.

For n=1000n = 1000: bn3.09b_n \approx 3.09, an0.27a_n \approx 0.27. The expected maximum is about 3.24 (roughly 3.2σ3.2\sigma). For n=106n = 10^6: bn4.42b_n \approx 4.42, an0.17a_n \approx 0.17. The expected maximum is about 4.5σ\sigma. The maximum grows as 2logn\sqrt{2 \log n}: very slowly.

Example

Maximum of Pareto samples

Let X1,,XnX_1, \ldots, X_n be i.i.d. Pareto with α=2\alpha = 2, xm=1x_m = 1. The maximum satisfies:

Mnn1/α=MnndFrechet(α=2)\frac{M_n}{n^{1/\alpha}} = \frac{M_n}{\sqrt{n}} \xrightarrow{d} \text{Frechet}(\alpha = 2)

For n=1000n = 1000: the expected maximum is of order 100031.6\sqrt{1000} \approx 31.6. For n=106n = 10^6: the expected maximum is of order 106=1000\sqrt{10^6} = 1000. The maximum grows as n1/2n^{1/2}: much faster than the Gaussian case.

Common Confusions

Watch Out

EVT is not just about catastrophic events

EVT applies to any maximum or minimum, not just disasters. The maximum score in a class of students, the best performance across multiple model training runs, the longest wait time in a queue: these are all problems for EVT. The theory applies whenever you care about the extreme of a collection of random variables, regardless of the domain.

Watch Out

The GEV shape parameter is not the tail index

The GEV shape ξ\xi is the reciprocal of the tail index α\alpha for Frechet-type distributions: ξ=1/α\xi = 1/\alpha. A distribution with tail index α=2\alpha = 2 (finite mean, infinite variance) gives ξ=0.5\xi = 0.5. Do not confuse ξ\xi with α\alpha. In the EVT literature, ξ>0\xi > 0 indicates heavy tails, while in the fat-tails literature, small α\alpha indicates heavier tails.

Watch Out

Block maxima vs. peaks over threshold

The block maxima method divides the data into blocks (e.g., years) and fits a GEV to the maximum of each block. The peaks-over-threshold (POT) method fits a GPD to all observations exceeding a chosen threshold. POT is generally more efficient (uses more data) but introduces an additional modeling choice (the threshold level). Both methods estimate the same tail behavior.

Exercises

ExerciseCore

Problem

Let X1,,XnX_1, \ldots, X_n be i.i.d. Uniform(0, 1). Find the exact distribution of Mn=max(X1,,Xn)M_n = \max(X_1, \ldots, X_n). Compute E[Mn]\mathbb{E}[M_n] and verify that Mn1M_n \to 1 as nn \to \infty.

ExerciseCore

Problem

The Gumbel distribution has CDF G(x)=exp(ex)G(x) = \exp(-e^{-x}). Compute its mean, variance, and median.

ExerciseAdvanced

Problem

Show that the Pareto distribution with tail P(X>x)=xαP(X > x) = x^{-\alpha} for x1x \geq 1 is in the domain of attraction of the Frechet distribution. Find the normalizing sequences ana_n and bnb_n.

ExerciseAdvanced

Problem

In best-of-NN sampling from a language model, suppose the reward scores R1,,RNR_1, \ldots, R_N are i.i.d. with a standard Gumbel distribution (CDF G(x)=eexG(x) = e^{-e^{-x}}). What is the expected value of the maximum reward MN=max(R1,,RN)M_N = \max(R_1, \ldots, R_N)? How does this scale with NN?

ExerciseResearch

Problem

Describe how you would use the peaks-over-threshold method to estimate the probability that a model's prediction error exceeds a value never seen in the test set. What are the key practical challenges?

References

Canonical:

  • Coles, An Introduction to Statistical Modeling of Extreme Values (2001), Chapters 1-5
  • de Haan & Ferreira, Extreme Value Theory: An Introduction (2006), Chapters 1-3
  • Embrechts, Kluppelberg, & Mikosch, Modelling Extremal Events (1997), Chapters 3-6

Current:

  • Beirlant, Goegebeur, Segers, & Teugels, Statistics of Extremes: Theory and Applications (2004), Chapters 1-5
  • Resnick, Heavy-Tail Phenomena: Probabilistic and Statistical Modeling (2007), Chapters 4-5
  • Nakagawa, Hashimoto, & Abe, "Best-of-N Jailbreaking" (2024), discusses EVT in the context of LLM sampling

Next Topics

Building on extreme value theory:

  • Extreme value theory connects back to fat tails through the Frechet domain of attraction
  • Order statistics provide the finite-sample foundation for EVT

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.