Probability
Extreme Value Theory
The mathematics of maxima and rare events. The Fisher-Tippett-Gnedenko theorem, the three extreme value distributions (Gumbel, Frechet, Weibull), peaks-over-threshold, and applications to tail risk and model evaluation.
Why This Matters
The law of large numbers and the central limit theorem describe the behavior of sums and averages. Extreme value theory (EVT) describes the behavior of maxima and minima. This is a different question with different answers and different applications.
When you evaluate a machine learning model, you often care about worst-case performance: the maximum prediction error, the largest loss on any input, the most confident wrong prediction. When you assess risk, you care about the worst outcome: the largest portfolio loss, the tallest wave, the strongest earthquake. When you use best-of-N sampling in language models, you care about the maximum quality score across samples.
EVT provides the mathematical tools for these problems. Just as the CLT says that properly normalized sums converge to a Gaussian, the Fisher-Tippett-Gnedenko theorem says that properly normalized maxima converge to one of exactly three distributions. Which one depends on the tail behavior of the underlying distribution.
Mental Model
Draw i.i.d. samples from some distribution and record the maximum . As grows, increases. The question is: after proper centering and scaling, does the distribution of converge to something universal?
The answer is yes, and the limit depends on how the tail of decays:
- Exponential-type tails (Gaussian, exponential): the maximum grows logarithmically, and the limit is the Gumbel distribution
- Polynomial tails (fat-tailed, Pareto): the maximum grows as a power of , and the limit is the Frechet distribution
- Bounded support (uniform, beta): the maximum approaches the upper bound, and the limit is the Weibull distribution
Core Definitions
Generalized Extreme Value Distribution
The generalized extreme value (GEV) distribution unifies the three extreme value types into a single family parameterized by location , scale , and shape :
defined on .
The three cases:
- : Frechet type (heavy right tail, polynomial decay)
- : Gumbel type (light tail, exponential-type decay). The formula is interpreted as the limit :
- : Weibull type (finite right endpoint at )
The parameter is called the extreme value index or shape parameter. It determines the tail behavior of the maximum distribution.
Gumbel Distribution
The Gumbel distribution (Type I extreme value) has CDF:
with mean (where is the Euler-Mascheroni constant) and variance .
The Gumbel distribution arises as the limit for maxima of distributions with exponential-type tails: Gaussian, exponential, gamma, log-normal.
Frechet Distribution
The Frechet distribution (Type II extreme value) has CDF:
for . This corresponds to GEV with .
The Frechet distribution has a polynomial right tail: for large . It arises as the limit for maxima of fat-tailed distributions with tail index .
Reversed Weibull Distribution
The reversed Weibull distribution (Type III extreme value) has CDF:
for . This corresponds to GEV with .
It arises as the limit for maxima of distributions with a finite right endpoint. The uniform and beta distributions are in this domain of attraction.
Domain of Attraction
A distribution belongs to the domain of attraction of an extreme value distribution , written , if there exist normalizing sequences and such that:
as .
Not every distribution belongs to some domain of attraction, but most distributions encountered in practice do.
Generalized Pareto Distribution
The generalized Pareto distribution (GPD) has CDF:
for when , and when . The case is interpreted as the exponential: .
The GPD is the natural distribution for exceedances over a threshold. If has a GEV distribution, then exceedances given follow a GPD as approaches the right endpoint.
Main Theorems
Fisher-Tippett-Gnedenko Theorem
Statement
Let be i.i.d. random variables with distribution , and let . If there exist sequences and such that:
for some non-degenerate distribution , then must be a generalized extreme value distribution for some .
Equivalently, must be one of the three types:
- Type I (Gumbel): , tails decay exponentially
- Type II (Frechet): , polynomial right tail
- Type III (Reversed Weibull): , finite right endpoint
The normalizing sequences are:
- Frechet domain (): , , which grows as
- Gumbel domain (): and depend on the specific (e.g., for Gaussian: , )
- Weibull domain (): , depends on the behavior of near
Intuition
This theorem is the "CLT for maxima." Just as the CLT says that normalized sums of i.i.d. variables can only converge to a Gaussian (under finite variance), the Fisher-Tippett theorem says that normalized maxima can only converge to one of three types. The type is determined entirely by the tail behavior of the underlying distribution.
The key distinction: fat-tailed distributions produce Frechet limits (the maximum grows as a power of ), exponential-type tails produce Gumbel limits (the maximum grows logarithmically), and bounded distributions produce reversed Weibull limits (the maximum approaches the upper bound).
Proof Sketch
The proof uses the max-stability property. If is a non-degenerate limit of , then must be max-stable: for some sequences . Taking logarithms: . Since is a negative function that is zero at infinity, the functional equation constrains to be of the form (up to location and scale). The three types correspond to , , and . The characterization of domains of attraction uses Karamata's theory of regular variation (for the Frechet case) and the von Mises conditions (for all cases).
Why It Matters
The theorem tells you that regardless of the underlying distribution, the maximum of a large sample behaves in one of three ways. This enables:
- Tail risk assessment: estimate the probability of events more extreme than any observed, by fitting a GEV to block maxima
- Engineering reliability: design structures to withstand the maximum load over a 100-year period, using only 30 years of data
- ML model evaluation: understand the distribution of worst-case errors across test inputs
The shape parameter is the single most important quantity: it determines whether extremes are bounded (), grow logarithmically (), or grow polynomially ().
Failure Mode
The theorem requires the existence of normalizing sequences. Some distributions are not in any domain of attraction (e.g., discrete distributions with irregular support can fail). The theorem is also asymptotic: for finite , the GEV approximation may be poor. The rate of convergence to the GEV limit is often much slower than CLT convergence to the Gaussian, especially in the Gumbel domain (convergence is logarithmic in ).
Pickands-Balkema-de Haan Theorem
Statement
If is in the domain of attraction of a GEV distribution with shape parameter , then for large threshold , the distribution of exceedances given is approximately generalized Pareto:
where depends on , and the approximation improves as .
The shape parameter is the same as in the GEV limit. This connects block maxima analysis (GEV) with threshold exceedance analysis (GPD).
Intuition
Instead of looking at the maximum of a block of data, look at all observations that exceed a high threshold . The distribution of how far they exceed is approximately GPD. This is more data-efficient than block maxima: instead of one maximum per block, you use all exceedances. The GPD shape parameter is the same as the GEV shape, so both methods estimate the same tail behavior.
Proof Sketch
Define the excess distribution for . If is in the domain of attraction of a GEV with parameter , then satisfies a regular variation condition on its tail. This condition, combined with the relationship between and the GEV limit, implies that converges to a GPD with the same as . The proof uses the fact that the GPD is the only distribution with the "threshold stability" property: if follows GPD, then also follows GPD for (with adjusted scale).
Why It Matters
The Pickands-Balkema-de Haan theorem is the foundation of the peaks-over-threshold (POT) method, which is the standard approach for modeling extreme events in practice. Instead of dividing data into blocks and fitting a GEV to block maxima (which wastes data), the POT method fits a GPD to all exceedances above a threshold. This is more statistically efficient and is the dominant method in hydrology, finance, and insurance.
Failure Mode
The choice of threshold is critical. Too low, and the GPD approximation is poor (the asymptotic result has not kicked in). Too high, and there are too few exceedances for reliable estimation. Threshold selection is one of the most difficult practical aspects of EVT. Common approaches include mean residual life plots and stability of parameter estimates across threshold choices.
Domains of Attraction: Which Distributions Go Where
| Distribution | Domain | Shape | Why |
|---|---|---|---|
| Gaussian | Gumbel | Tail decays as | |
| Exponential | Gumbel | Tail decays as | |
| Log-normal | Gumbel | Tail decays faster than any power law | |
| Pareto() | Frechet | Tail decays as | |
| Student-() | Frechet | Tail decays as | |
| Cauchy | Frechet | Tail decays as | |
| Uniform(0,1) | Rev. Weibull | Bounded above at 1 | |
| Beta(,) | Rev. Weibull | Bounded above at 1 |
Applications
Financial Tail Risk
Value-at-Risk (VaR) and Expected Shortfall (ES) estimate the probability and size of extreme losses. Standard approaches assume Gaussian returns, which drastically underestimate tail risk. EVT provides a principled alternative: fit a GPD to losses exceeding a high threshold, then extrapolate to estimate the probability of losses larger than any observed.
Best-of-N Sampling in LLMs
When generating candidate responses from a language model and selecting the best one according to a reward model, the quality of the selected response depends on the distribution of the maximum of scores. If scores are approximately Gaussian, the improvement from best-of- grows as (Gumbel scaling). If scores are fat-tailed, the improvement grows as (Frechet scaling), which is much faster.
Flood Modeling and Return Levels
EVT is the standard tool for estimating "100-year floods" from shorter records. Fit a GEV to annual maximum river levels, then extrapolate to the 100-year return level: the level exceeded with probability 1/100 in any given year. The shape parameter determines whether the extrapolation is conservative () or aggressive ().
Model Evaluation
The worst-case error of a model across test inputs is the maximum of per-input errors. If per-input errors have an exponential-type tail, the worst-case error grows as . If errors have a fat tail, the worst-case error grows as . This distinction matters for safety-critical applications where worst-case performance is the binding constraint.
Canonical Examples
Maximum of Gaussian samples
Let . The maximum satisfies:
with and .
For : , . The expected maximum is about 3.24 (roughly ). For : , . The expected maximum is about 4.5. The maximum grows as : very slowly.
Maximum of Pareto samples
Let be i.i.d. Pareto with , . The maximum satisfies:
For : the expected maximum is of order . For : the expected maximum is of order . The maximum grows as : much faster than the Gaussian case.
Common Confusions
EVT is not just about catastrophic events
EVT applies to any maximum or minimum, not just disasters. The maximum score in a class of students, the best performance across multiple model training runs, the longest wait time in a queue: these are all problems for EVT. The theory applies whenever you care about the extreme of a collection of random variables, regardless of the domain.
The GEV shape parameter is not the tail index
The GEV shape is the reciprocal of the tail index for Frechet-type distributions: . A distribution with tail index (finite mean, infinite variance) gives . Do not confuse with . In the EVT literature, indicates heavy tails, while in the fat-tails literature, small indicates heavier tails.
Block maxima vs. peaks over threshold
The block maxima method divides the data into blocks (e.g., years) and fits a GEV to the maximum of each block. The peaks-over-threshold (POT) method fits a GPD to all observations exceeding a chosen threshold. POT is generally more efficient (uses more data) but introduces an additional modeling choice (the threshold level). Both methods estimate the same tail behavior.
Exercises
Problem
Let be i.i.d. Uniform(0, 1). Find the exact distribution of . Compute and verify that as .
Problem
The Gumbel distribution has CDF . Compute its mean, variance, and median.
Problem
Show that the Pareto distribution with tail for is in the domain of attraction of the Frechet distribution. Find the normalizing sequences and .
Problem
In best-of- sampling from a language model, suppose the reward scores are i.i.d. with a standard Gumbel distribution (CDF ). What is the expected value of the maximum reward ? How does this scale with ?
Problem
Describe how you would use the peaks-over-threshold method to estimate the probability that a model's prediction error exceeds a value never seen in the test set. What are the key practical challenges?
References
Canonical:
- Coles, An Introduction to Statistical Modeling of Extreme Values (2001), Chapters 1-5
- de Haan & Ferreira, Extreme Value Theory: An Introduction (2006), Chapters 1-3
- Embrechts, Kluppelberg, & Mikosch, Modelling Extremal Events (1997), Chapters 3-6
Current:
- Beirlant, Goegebeur, Segers, & Teugels, Statistics of Extremes: Theory and Applications (2004), Chapters 1-5
- Resnick, Heavy-Tail Phenomena: Probabilistic and Statistical Modeling (2007), Chapters 4-5
- Nakagawa, Hashimoto, & Abe, "Best-of-N Jailbreaking" (2024), discusses EVT in the context of LLM sampling
Next Topics
Building on extreme value theory:
- Extreme value theory connects back to fat tails through the Frechet domain of attraction
- Order statistics provide the finite-sample foundation for EVT
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Order StatisticsLayer 1
- Fat Tails and Heavy-Tailed DistributionsLayer 2
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Law of Large NumbersLayer 0B