Extreme Value Theory

Sneiderman, Robby

Concentration Probability

Extreme Value Theory

The mathematics of maxima and rare events. The Fisher-Tippett-Gnedenko theorem, the three extreme value distributions (Gumbel, Frechet, Weibull), peaks-over-threshold, and applications to tail risk and model evaluation.

AdvancedTier 2StableReference~45 min

Prerequisites

Common Probability Distributions Order Statistics Fat Tails

Start 8-question practice · 2 available Prereq Map

Learning position

Read this page in the graph.

concentration-probability | layer 3 | tier 2. This page has 3 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

The law of large numbers and the central limit theorem describe the behavior of sums and averages. Extreme value theory (EVT) describes the behavior of maxima and minima. This is a different question with different answers and different applications.

When you evaluate a machine learning model, you often care about worst-case performance: the maximum prediction error, the largest loss on any input, the most confident wrong prediction. When you assess risk, you care about the worst outcome: the largest portfolio loss, the tallest wave, the strongest earthquake. When you use best-of-N sampling in language models, you care about the maximum quality score across $N$ samples.

EVT provides the mathematical tools for these problems. Just as the CLT says that properly normalized sums converge to a Gaussian, the Fisher-Tippett-Gnedenko theorem says that properly normalized maxima converge to one of exactly three distributions. Which one depends on the tail behavior of the underlying distribution.

Mental Model

Draw $n$ i.i.d. samples from some distribution $F$ and record the maximum $M_n = \max(X_1, \ldots, X_n)$ . As $n$ grows, $M_n$ increases. The question is: after proper centering and scaling, does the distribution of $M_n$ converge to something universal?

The answer is yes, and the limit depends on how the tail of $F$ decays:

Exponential-type tails (Gaussian, exponential): the maximum grows logarithmically, and the limit is the Gumbel distribution
Polynomial tails (fat-tailed, Pareto): the maximum grows as a power of $n$ , and the limit is the Frechet distribution
Bounded support (uniform, beta): the maximum approaches the upper bound, and the limit is the Weibull distribution

Core Definitions

Definition

Generalized Extreme Value Distribution $GE V (μ, σ, ξ)$

The generalized extreme value (GEV) distribution unifies the three extreme value types into a single family parameterized by location $\mu \in \mathbb{R}$ , scale $\sigma > 0$ , and shape $\xi \in \mathbb{R}$ :

$G(x) = \exp\left\{-\left[1 + \xi\left(\frac{x - \mu}{\sigma}\right)\right]^{-1/\xi}\right\}$

defined on $\{x : 1 + \xi(x - \mu)/\sigma > 0\}$ .

The three cases:

$\xi > 0$ : Frechet type (heavy right tail, polynomial decay)
$\xi = 0$ : Gumbel type (light tail, exponential-type decay). The formula is interpreted as the limit $\xi \to 0$ : $G(x) = \exp(-e^{-(x-\mu)/\sigma})$
$\xi < 0$ : Weibull type (finite right endpoint at $\mu - \sigma/\xi$ )

The parameter $\xi$ is called the extreme value index or shape parameter. It determines the tail behavior of the maximum distribution.

Definition

Gumbel Distribution

The Gumbel distribution (Type I extreme value) has CDF:

$G(x) = \exp(-e^{-(x - \mu)/\sigma}), \quad x \in \mathbb{R}$

with mean $\mu + \gamma\sigma$ (where $\gamma \approx 0.5772$ is the Euler-Mascheroni constant) and variance $\pi^2\sigma^2/6$ .

The Gumbel distribution arises as the limit for maxima of distributions with exponential-type tails: Gaussian, exponential, gamma, log-normal.

Definition

Frechet Distribution

The Frechet distribution (Type II extreme value) has CDF:

$\Phi_\alpha(x) = \begin{cases} 0 & x \leq 0 \\ \exp(-x^{-\alpha}) & x > 0 \end{cases}$

for $\alpha > 0$ . This corresponds to GEV with $\xi = 1/\alpha > 0$ .

The Frechet distribution has a polynomial right tail: $1 - \Phi_\alpha(x) \sim x^{-\alpha}$ for large $x$ . It arises as the limit for maxima of fat-tailed distributions with tail index $\alpha$ .

Definition

Reversed Weibull Distribution

The reversed Weibull distribution (Type III extreme value) has CDF:

$\Psi_\alpha(x) = \begin{cases} \exp(-(-x)^\alpha) & x \leq 0 \\ 1 & x > 0 \end{cases}$

for $\alpha > 0$ . This corresponds to GEV with $\xi = -1/\alpha < 0$ .

It arises as the limit for maxima of distributions with a finite right endpoint. The uniform and beta distributions are in this domain of attraction.

Definition

Domain of Attraction $D (G)$

A distribution $F$ belongs to the domain of attraction of an extreme value distribution $G$ , written $F \in D(G)$ , if there exist normalizing sequences $a_n > 0$ and $b_n$ such that:

$P\left(\frac{M_n - b_n}{a_n} \leq x\right) = F(a_n x + b_n)^n \to G(x)$

as $n \to \infty$ .

Not every distribution belongs to some domain of attraction, but most distributions encountered in practice do.

Definition

Generalized Pareto Distribution $GP D (σ, ξ)$

The generalized Pareto distribution (GPD) has CDF:

$H(x) = 1 - \left(1 + \frac{\xi x}{\sigma}\right)^{-1/\xi}$

for $x \geq 0$ when $\xi \geq 0$ , and $0 \leq x \leq -\sigma/\xi$ when $\xi < 0$ . The case $\xi = 0$ is interpreted as the exponential: $H(x) = 1 - e^{-x/\sigma}$ .

The GPD is the natural distribution for exceedances over a threshold. If $X$ has a GEV distribution, then exceedances $X - u$ given $X > u$ follow a GPD as $u$ approaches the right endpoint.

Main Theorems

Theorem

Fisher-Tippett-Gnedenko Theorem

Statement

Let $X_1, X_2, \ldots$ be i.i.d. random variables with distribution $F$ , and let $M_n = \max(X_1, \ldots, X_n)$ . If there exist sequences $a_n > 0$ and $b_n \in \mathbb{R}$ such that:

$\frac{M_n - b_n}{a_n} \xrightarrow{d} G$

for some non-degenerate distribution $G$ , then $G$ must be a generalized extreme value distribution $\text{GEV}(\mu, \sigma, \xi)$ for some $\mu, \sigma, \xi$ .

Equivalently, $G$ must be one of the three types:

Type I (Gumbel): $\xi = 0$ , tails decay exponentially
Type II (Frechet): $\xi > 0$ , polynomial right tail
Type III (Reversed Weibull): $\xi < 0$ , finite right endpoint

The normalizing sequences are:

Frechet domain ( $F \in D(\Phi_\alpha)$ ): $b_n = 0$ , $a_n = F^{-1}(1 - 1/n)$ , which grows as $n^{1/\alpha}$
Gumbel domain ( $F \in D(\Lambda)$ ): $a_n$ and $b_n$ depend on the specific $F$ (e.g., for Gaussian: $b_n \sim \sqrt{2 \log n}$ , $a_n \sim 1/\sqrt{2 \log n}$ )
Weibull domain ( $F \in D(\Psi_\alpha)$ ): $b_n = x^* = \sup\{x : F(x) < 1\}$ , $a_n$ depends on the behavior of $F$ near $x^*$

Intuition

This theorem is the "CLT for maxima." Just as the CLT says that normalized sums of i.i.d. variables can only converge to a Gaussian (under finite variance), the Fisher-Tippett theorem says that normalized maxima can only converge to one of three types. The type is determined entirely by the tail behavior of the underlying distribution.

The key distinction: fat-tailed distributions produce Frechet limits (the maximum grows as a power of $n$ ), exponential-type tails produce Gumbel limits (the maximum grows logarithmically), and bounded distributions produce reversed Weibull limits (the maximum approaches the upper bound).

Proof Sketch

The proof uses the max-stability property. If $G$ is a non-degenerate limit of $F^n(a_n x + b_n)$ , then $G$ must be max-stable: $G^n(c_n x + d_n) = G(x)$ for some sequences $c_n, d_n$ . Taking logarithms: $n \log G(c_n x + d_n) = \log G(x)$ . Since $\log G$ is a negative function that is zero at infinity, the functional equation constrains $\log G$ to be of the form $-(1 + \xi x)^{-1/\xi}$ (up to location and scale). The three types correspond to $\xi > 0$ , $\xi = 0$ , and $\xi < 0$ . The characterization of domains of attraction uses Karamata's theory of regular variation (for the Frechet case) and the von Mises conditions (for all cases).

Why It Matters

The theorem tells you that regardless of the underlying distribution, the maximum of a large sample behaves in one of three ways. This enables:

Tail risk assessment: estimate the probability of events more extreme than any observed, by fitting a GEV to block maxima
Engineering reliability: design structures to withstand the maximum load over a 100-year period, using only 30 years of data
ML model evaluation: understand the distribution of worst-case errors across test inputs

The shape parameter $\xi$ is the single most important quantity: it determines whether extremes are bounded ( $\xi < 0$ ), grow logarithmically ( $\xi = 0$ ), or grow polynomially ( $\xi > 0$ ).

Failure Mode

The theorem requires the existence of normalizing sequences. Some distributions are not in any domain of attraction (e.g., discrete distributions with irregular support can fail). The theorem is also asymptotic: for finite $n$ , the GEV approximation may be poor. The rate of convergence to the GEV limit is often much slower than CLT convergence to the Gaussian, especially in the Gumbel domain (convergence is logarithmic in $n$ ).

report a correction →

Theorem

Pickands-Balkema-de Haan Theorem

Statement

If $F$ is in the domain of attraction of a GEV distribution with shape parameter $\xi$ , then for large threshold $u$ , the distribution of exceedances $X - u$ given $X > u$ is approximately generalized Pareto:

$P(X - u \leq y \mid X > u) \approx H_{\xi, \sigma(u)}(y) = 1 - \left(1 + \frac{\xi y}{\sigma(u)}\right)^{-1/\xi}$

where $\sigma(u) > 0$ depends on $u$ , and the approximation improves as $u \to x^* = \sup\{x : F(x) < 1\}$ .

The shape parameter $\xi$ is the same as in the GEV limit. This connects block maxima analysis (GEV) with threshold exceedance analysis (GPD).

Intuition

Instead of looking at the maximum of a block of data, look at all observations that exceed a high threshold $u$ . The distribution of how far they exceed $u$ is approximately GPD. This is more data-efficient than block maxima: instead of one maximum per block, you use all exceedances. The GPD shape parameter $\xi$ is the same as the GEV shape, so both methods estimate the same tail behavior.

Proof Sketch

Define the excess distribution $F_u(y) = P(X - u \leq y \mid X > u) = (F(u + y) - F(u))/(1 - F(u))$ for $y \geq 0$ . If $F$ is in the domain of attraction of a GEV with parameter $\xi$ , then $F$ satisfies a regular variation condition on its tail. This condition, combined with the relationship between $F^n$ and the GEV limit, implies that $F_u$ converges to a GPD with the same $\xi$ as $u \to x^*$ . The proof uses the fact that the GPD is the only distribution with the "threshold stability" property: if $X - u \mid X > u$ follows GPD, then $X - v \mid X > v$ also follows GPD for $v > u$ (with adjusted scale).

Why It Matters

The Pickands-Balkema-de Haan theorem is the foundation of the peaks-over-threshold (POT) method, which is the standard approach for modeling extreme events in practice. Instead of dividing data into blocks and fitting a GEV to block maxima (which wastes data), the POT method fits a GPD to all exceedances above a threshold. This is more statistically efficient and is the dominant method in hydrology, finance, and insurance.

Failure Mode

The choice of threshold $u$ is critical. Too low, and the GPD approximation is poor (the asymptotic result has not kicked in). Too high, and there are too few exceedances for reliable estimation. Threshold selection is one of the most difficult practical aspects of EVT. Common approaches include mean residual life plots and stability of parameter estimates across threshold choices.

report a correction →

Domains of Attraction: Which Distributions Go Where

Distribution	Domain	Shape $\xi$	Why
Gaussian	Gumbel	$\xi = 0$	Tail decays as $e^{-x^2/2}$
Exponential	Gumbel	$\xi = 0$	Tail decays as $e^{-x}$
Log-normal	Gumbel	$\xi = 0$	Tail decays faster than any power law
Pareto( $\alpha$ )	Frechet	$\xi = 1/\alpha$	Tail decays as $x^{-\alpha}$
Student- $t$ ( $\nu$ )	Frechet	$\xi = 1/\nu$	Tail decays as $x^{-\nu}$
Cauchy	Frechet	$\xi = 1$	Tail decays as $x^{-1}$
Uniform(0,1)	Rev. Weibull	$\xi = -1$	Bounded above at 1
Beta( $a$ , $b$ )	Rev. Weibull	$\xi = -1$	Bounded above at 1

Applications

Financial Tail Risk

Value-at-Risk (VaR) and Expected Shortfall (ES) estimate the probability and size of extreme losses. Standard approaches assume Gaussian returns, which drastically underestimate tail risk. EVT provides a principled alternative: fit a GPD to losses exceeding a high threshold, then extrapolate to estimate the probability of losses larger than any observed.

Best-of-N Sampling in LLMs

When generating $N$ candidate responses from a language model and selecting the best one according to a reward model, the quality of the selected response depends on the distribution of the maximum of $N$ scores. If scores are approximately Gaussian, the improvement from best-of- $N$ grows as $\sqrt{2 \log N}$ (Gumbel scaling). If scores are fat-tailed, the improvement grows as $N^{1/\alpha}$ (Frechet scaling), which is much faster.

Flood Modeling and Return Levels

EVT is the standard tool for estimating "100-year floods" from shorter records. Fit a GEV to annual maximum river levels, then extrapolate to the 100-year return level: the level exceeded with probability 1/100 in any given year. The shape parameter $\xi$ determines whether the extrapolation is conservative ( $\xi < 0$ ) or aggressive ( $\xi > 0$ ).

Model Evaluation

The worst-case error of a model across $n$ test inputs is the maximum of $n$ per-input errors. If per-input errors have an exponential-type tail, the worst-case error grows as $\log n$ . If errors have a fat tail, the worst-case error grows as $n^{1/\alpha}$ . This distinction matters for safety-critical applications where worst-case performance is the binding constraint.

Canonical Examples

Example

Maximum of Gaussian samples

Let $X_1, \ldots, X_n \sim \mathcal{N}(0, 1)$ . The maximum $M_n$ satisfies:

$\frac{M_n - b_n}{a_n} \xrightarrow{d} \text{Gumbel}$

with $b_n = \sqrt{2 \log n} - \frac{\log \log n + \log(4\pi)}{2\sqrt{2 \log n}}$ and $a_n = 1/\sqrt{2 \log n}$ .

For $n = 1000$ : $b_n \approx 3.09$ , $a_n \approx 0.27$ . The expected maximum is about 3.24 (roughly $3.2\sigma$ ). For $n = 10^6$ : $b_n \approx 4.42$ , $a_n \approx 0.17$ . The expected maximum is about 4.5 $\sigma$ . The maximum grows as $\sqrt{2 \log n}$ : very slowly.

Example

Maximum of Pareto samples

Let $X_1, \ldots, X_n$ be i.i.d. Pareto with $\alpha = 2$ , $x_m = 1$ . The maximum satisfies:

$\frac{M_n}{n^{1/\alpha}} = \frac{M_n}{\sqrt{n}} \xrightarrow{d} \text{Frechet}(\alpha = 2)$

For $n = 1000$ : the expected maximum is of order $\sqrt{1000} \approx 31.6$ . For $n = 10^6$ : the expected maximum is of order $\sqrt{10^6} = 1000$ . The maximum grows as $n^{1/2}$ : much faster than the Gaussian case.

Common Confusions

Watch Out

EVT is not just about catastrophic events

EVT applies to any maximum or minimum, not just disasters. The maximum score in a class of students, the best performance across multiple model training runs, the longest wait time in a queue: these are all problems for EVT. The theory applies whenever you care about the extreme of a collection of random variables, regardless of the domain.

Watch Out

The GEV shape parameter is not the tail index

The GEV shape $\xi$ is the reciprocal of the tail index $\alpha$ for Frechet-type distributions: $\xi = 1/\alpha$ . A distribution with tail index $\alpha = 2$ (finite mean, infinite variance) gives $\xi = 0.5$ . Do not confuse $\xi$ with $\alpha$ . In the EVT literature, $\xi > 0$ indicates heavy tails, while in the fat-tails literature, small $\alpha$ indicates heavier tails.

Watch Out

Block maxima vs. peaks over threshold

The block maxima method divides the data into blocks (e.g., years) and fits a GEV to the maximum of each block. The peaks-over-threshold (POT) method fits a GPD to all observations exceeding a chosen threshold. POT is generally more efficient (uses more data) but introduces an additional modeling choice (the threshold level). Both methods estimate the same tail behavior.

Exercises

ExerciseCore

Problem

Let $X_1, \ldots, X_n$ be i.i.d. Uniform(0, 1). Find the exact distribution of $M_n = \max(X_1, \ldots, X_n)$ . Compute $\mathbb{E}[M_n]$ and verify that $M_n \to 1$ as $n \to \infty$ .

ExerciseCore

Problem

The Gumbel distribution has CDF $G(x) = \exp(-e^{-x})$ . Compute its mean, variance, and median.

ExerciseAdvanced

Problem

Show that the Pareto distribution with tail $P(X > x) = x^{-\alpha}$ for $x \geq 1$ is in the domain of attraction of the Frechet distribution. Find the normalizing sequences $a_n$ and $b_n$ .

ExerciseAdvanced

Problem

In best-of- $N$ sampling from a language model, suppose the reward scores $R_1, \ldots, R_N$ are i.i.d. with a standard Gumbel distribution (CDF $G(x) = e^{-e^{-x}}$ ). What is the expected value of the maximum reward $M_N = \max(R_1, \ldots, R_N)$ ? How does this scale with $N$ ?

ExerciseResearch

Problem

Describe how you would use the peaks-over-threshold method to estimate the probability that a model's prediction error exceeds a value never seen in the test set. What are the key practical challenges?

References

Canonical:

Coles, An Introduction to Statistical Modeling of Extreme Values (2001), Chapters 1-5
de Haan & Ferreira, Extreme Value Theory: An Introduction (2006), Chapters 1-3
Embrechts, Kluppelberg, & Mikosch, Modelling Extremal Events (1997), Chapters 3-6

Current:

Beirlant, Goegebeur, Segers, & Teugels, Statistics of Extremes: Theory and Applications (2004), Chapters 1-5
Resnick, Heavy-Tail Phenomena: Probabilistic and Statistical Modeling (2007), Chapters 4-5
Nakagawa, Hashimoto, & Abe, "Best-of-N Jailbreaking" (2024), discusses EVT in the context of LLM sampling

Next Topics

Building on extreme value theory:

Extreme value theory connects back to fat tails through the Frechet domain of attraction
Order statistics provide the finite-sample foundation for EVT

Last reviewed: April 15, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Common Probability Distributionslayer 0A · tier 1
Fat Tails and Heavy-Tailed Distributionslayer 2 · tier 1
Order Statisticslayer 1 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.