Learning Theory Core

Glivenko-Cantelli Theorem

Uniform almost-sure convergence of the empirical CDF to the true CDF — the prototype of every uniform-convergence statement in statistics and learning theory. The classical statement, the DKW exponential rate, the function-class generalization, and the bridge to VC and Rademacher bounds.

AdvancedTier 2StableCore spine~35 min

Prerequisites

Concentration Inequalities Uniform Convergence VC Dimension

Prereq Map

Why This Matters

The Glivenko-Cantelli theorem is the prototype of every uniform-convergence statement in statistics and learning theory. The classical version says the empirical CDF converges uniformly (over $x$ ) to the true CDF, almost surely. Stripped to its essence, this is one function class, the indicator-of-half-line class, for which the empirical mean converges to the population mean uniformly across the class.

Every modern statement in empirical-process theory, Rademacher complexity, VC dimension, and PAC learning is a generalization of this single classical result. Glivenko-Cantelli classes, the function classes for which empirical-mean-to-population-mean uniform convergence holds, are exactly the classes for which a sample-average estimator works without further assumptions. Finite VC dimension implies the Glivenko-Cantelli property; bounded Rademacher complexity gives a quantitative rate. The classical theorem is the logical anchor for the whole framework.

Mental Model

Draw $n$ i.i.d. samples $X_1, \ldots, X_n$ from an unknown distribution with CDF $F$ . Build the empirical CDF $F_n(x) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}[X_i \le x]$ . For a fixed $x$ , the law of large numbers says $F_n(x) \to F(x)$ almost surely. Glivenko-Cantelli says this convergence is uniform in $x$ — the worst-case gap over all $x$ also goes to zero almost surely.

The same idea generalizes: for any class $\mathcal{F}$ of measurable functions, ask whether $\sup_{f \in \mathcal{F}} |P_n f - P f| \to 0$ almost surely (or in probability), where $P_n f = \frac{1}{n}\sum_i f(X_i)$ is the empirical mean. Classes where this holds are called Glivenko-Cantelli classes, and they are exactly the classes one can hope to do statistics on with a finite sample.

Formal Setup

Let $X_1, \ldots, X_n$ be i.i.d. samples from a probability distribution $P$ on $\mathbb{R}$ with CDF $F$ . The empirical CDF is

F_n(x) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}[X_i \le x], \qquad x \in \mathbb{R}.

The supremum-norm distance between empirical and true CDF is

\|F_n - F\|_\infty = \sup_{x \in \mathbb{R}} |F_n(x) - F(x)|.

This is the random variable Glivenko-Cantelli controls.

The Classical Theorem

Theorem

Glivenko-Cantelli Theorem

Statement

\|F_n - F\|_\infty = \sup_{x \in \mathbb{R}} |F_n(x) - F(x)| \xrightarrow{\text{a.s.}} 0 \qquad \text{as } n \to \infty.

The convergence is uniform in $x$ and almost-sure (not just in probability).

Intuition

For each fixed $x$ , $F_n(x)$ is an empirical probability and converges to $F(x)$ by the strong law. The Glivenko-Cantelli step upgrades this pointwise convergence to uniform convergence by exploiting the monotonicity of CDFs: a finite grid of $x$ -values controls the supremum because the gap between adjacent grid points is bounded by the jump in $F$ at those points.

Proof Sketch

Fix $\epsilon > 0$ . Pick a finite grid $-\infty = t_0 < t_1 < \cdots < t_K = \infty$ such that $F(t_{k}) - F(t_{k-1}) \le \epsilon$ for each $k$ (always possible since $F$ has bounded variation). On each grid interval, $F_n - F$ is bounded above by $F_n(t_{k-1}) - F(t_k) + \epsilon$ and below by $F_n(t_k) - F(t_{k-1}) - \epsilon$ , using monotonicity of both CDFs and the grid spacing. The strong law applied to each $F_n(t_k) \to F(t_k)$ at the $K + 1$ grid points (a finite number, so the union-of-measure-zero-events is still measure zero) gives almost-sure pointwise convergence; combined with the grid-spacing bound, $\sup_x |F_n - F| \le 2\epsilon$ eventually almost surely, for every $\epsilon > 0$ . Send $\epsilon \to 0$ along a countable sequence to conclude.

Why It Matters

This is the single most cited result in empirical-process theory. Any time a paper says "by uniform convergence" or "by Glivenko-Cantelli," this is the result they are appealing to (or its function-class generalization below). It is the reason the empirical CDF, the bootstrap, KS-test, and every nonparametric estimator built on the empirical distribution function have any hope of working at all.

Failure Mode

The classical theorem requires i.i.d. sampling and a real-valued $X$ with a CDF. For multivariate $X$ , the corresponding statement holds for the empirical measure on rectangular cells (or more generally, on any VC-class of sets) but the $\sup$ over arbitrary measurable sets fails: the class of all Borel sets is too rich for any finite-sample uniform convergence to hold.

report a correction →

DKW: An Exponential Rate

Glivenko-Cantelli says the supremum goes to zero, but it does not give a rate. The Dvoretzky-Kiefer-Wolfowitz (DKW) inequality, in its sharp form due to Massart (1990), supplies one.

Theorem

Dvoretzky-Kiefer-Wolfowitz (DKW) Inequality

Statement

For every $n \ge 1$ and every $\epsilon > 0$ ,

\mathbb{P}\bigl(\|F_n - F\|_\infty > \epsilon\bigr) \le 2 e^{-2 n \epsilon^2}.

The constant 2 in the exponent and the leading factor of 2 are both sharp (Massart 1990).

Intuition

This is a sub-Gaussian tail bound on the $L_\infty$ distance between empirical and true CDF, identical in shape to Hoeffding's inequality for a single bounded random variable. The reason a single sub-Gaussian rate suffices for the supremum (not just a fixed point) is that the indicator-of-half-line class has bounded VC dimension (in fact, VC dimension 2), so the supremum behaves no worse than a single Hoeffding tail with a small constant adjustment.

Proof Sketch

Massart's proof goes via a clever rearrangement: approximate $F_n - F$ by a finite-dimensional process indexed by the $n$ data points, then bound the supremum of the finite-dimensional process via a chaining or symmetrization argument (essentially the same machinery as in Rademacher complexity). The factor 2 is recovered by carefully tracking the Talagrand contraction step.

Why It Matters

DKW is what powers confidence bands for the empirical CDF. With probability at least $1 - \delta$ , $\|F_n - F\|_\infty \le \sqrt{\log(2/\delta)/(2n)}$ . This is the basis of the Kolmogorov-Smirnov goodness-of-fit test and of nonparametric confidence bands in survival analysis, change-point detection, and reliability theory.

Failure Mode

DKW gives a sub-Gaussian rate that holds uniformly across all CDFs $F$ . Sharper distribution-specific bounds (e.g. for continuous $F$ via the probability-integral transform) sometimes give better constants, but the DKW form is the right default in practice.

report a correction →

Function-Class Generalization

The classical Glivenko-Cantelli theorem is the special case of a much more general statement.

Definition

Glivenko-Cantelli Class

A class $\mathcal{F}$ of measurable functions is Glivenko-Cantelli with respect to $P$ if

\sup_{f \in \mathcal{F}} |P_n f - P f| \xrightarrow{\text{a.s.}} 0,

where $P_n f = \frac{1}{n}\sum_{i=1}^n f(X_i)$ and $P f = \mathbb{E}_P[f(X)]$ . The class is uniform Glivenko-Cantelli if the convergence holds uniformly over $P$ in some collection of distributions.

The classical theorem is the case $\mathcal{F} = \{\mathbf{1}[\cdot \le x] : x \in \mathbb{R}\}$ , the indicator-of-half-line class.

The cleanest sufficient condition for the Glivenko-Cantelli property is finite VC dimension. The corresponding rate is supplied by Rademacher complexity.

Sufficient condition	Strength of conclusion
Finite VC dimension	Uniform Glivenko-Cantelli over all $P$ (distribution-free)
Finite Rademacher complexity at sample size $n$	Quantitative rate $O(R_n(\mathcal{F}))$ on the sup-deviation
Bracketing entropy integral converges	Uniform Glivenko-Cantelli; permits unbounded $\mathcal{F}$
Pointwise LLN + uniform-equicontinuity	Glivenko-Cantelli (van der Vaart 1998, Theorem 19.4)

Each of these is a different route to the same destination. VC dimension is the combinatorial route (used in classical PAC learning); Rademacher complexity is the data-dependent route (used in modern margin-based bounds); bracketing-entropy integrals are the empirical-process route (used in semiparametric statistics and M-estimation).

Connection to Uniform Convergence and ERM

The link to learning theory is direct. For binary classification with zero-one loss, the loss class $\{(x, y) \mapsto \mathbf{1}[h(x) \ne y] : h \in \mathcal{H}\}$ is Glivenko-Cantelli iff ERM generalizes (in the population sense) for every distribution. Finite VC dimension of $\mathcal{H}$ is equivalent to this property: the Vapnik-Chervonenkis route to the Fundamental Theorem of Statistical Learning.

For real-valued losses and bounded function classes, the analogous statement uses Rademacher complexity. The empirical Rademacher complexity, computed on the actual sample, gives a data-dependent generalization bound that can be tighter than the worst-case VC bound when the realized data lies in a benign region.

This is why the Glivenko-Cantelli theorem is treated as the prototype: the qualitative statement (uniform convergence) is the classical content, and the modern theory upgrades the qualitative statement to a quantitative one with explicit rate constants.

Worked Example: Empirical CDF on Uniform[0, 1]

Take $X_1, \ldots, X_n \sim \mathrm{Uniform}[0, 1]$ , so $F(x) = x$ for $x \in [0, 1]$ . The empirical CDF is the staircase function with jumps at the order statistics. By DKW with $\delta = 0.05$ :

\|F_n - F\|_\infty \le \sqrt{\log(40)/(2 n)} \approx 1.358 / \sqrt{n}

with probability at least $0.95$ . Concrete numbers:

$n$	DKW $L_\infty$ bound	Approx pointwise std error $\sqrt{F(x)(1-F(x))/n}$ at $x=0.5$
100	0.136	0.050
1000	0.043	0.016
10000	0.014	0.005

The DKW bound is roughly $2.7\times$ the pointwise std error at $x = 0.5$ for any $n$ — that is the price of demanding uniform control over the entire $x$ -axis instead of pointwise control at a single $x$ . The factor 2.7 is essentially the constant in $\sqrt{\log(2/\delta)} \approx 2.72$ at $\delta = 0.05$ , divided by the pointwise standard-deviation factor $\sqrt{F(1-F)} = 0.5$ .

Common Confusions

Watch Out

Pointwise convergence is not uniform convergence

For each fixed $x$ , $F_n(x) \to F(x)$ a.s. by the strong law of large numbers, but this is pointwise convergence and does not by itself imply $\sup_x |F_n - F| \to 0$ . Glivenko-Cantelli is the upgrade. Without monotonicity of $F$ (or some equivalent compactness argument on $x$ ), pointwise convergence does not imply uniform convergence.

Watch Out

DKW gives a sub-Gaussian rate, not a Hoeffding rate

The exponential rate in DKW is $\exp(-2n\epsilon^2)$ , identical in form to Hoeffding for a single bounded variable. This is not a coincidence: the indicator-of-half-line class has VC dimension 2, so the supremum behaves like a single bounded random variable up to a small constant. For richer classes (higher VC dimension), the rate degrades to $\exp(-c n \epsilon^2 / d)$ with $d$ the VC dimension, by Sauer-Shelah.

Watch Out

Glivenko-Cantelli does not say anything about distributional convergence

The Glivenko-Cantelli theorem says $\|F_n - F\|_\infty \to 0$ a.s. — it controls the $L_\infty$ distance pathwise but says nothing about the distribution of $\sqrt n (F_n - F)$ as a process. The distributional statement is Donsker's theorem: $\sqrt n (F_n - F)$ converges weakly to a Brownian bridge. Donsker is strictly more than Glivenko-Cantelli; it requires the bracketing-entropy integral to converge, which is automatic for the indicator-of-half-line class but not for richer classes.

Exercises

ExerciseCore

Problem

Use DKW to give a $1 - \delta$ confidence band for the empirical CDF based on $n = 200$ samples with $\delta = 0.05$ . State the explicit value of the band half-width.

ExerciseAdvanced

Problem

Show that the class $\mathcal{F} = \{\mathbf{1}[\cdot \le x] : x \in \mathbb{R}\}$ has VC dimension exactly 2, and give the canonical shatter set.

References

Canonical:

Glivenko, V. (1933). Sulla determinazione empirica delle leggi di probabilità. Giornale dell'Istituto Italiano degli Attuari, 4, 92-99. The Italian half of the original theorem; covers the case $F$ continuous.
Cantelli, F. P. (1933). Sulla determinazione empirica delle leggi di probabilità. Giornale dell'Istituto Italiano degli Attuari, 4, 421-424. Cantelli's complement covering the general case.
Dvoretzky, A., Kiefer, J., & Wolfowitz, J. (1956). Asymptotic minimax character of the sample distribution function. Annals of Mathematical Statistics, 27(3), 642-669. The original DKW with non-sharp constants.
Massart, P. (1990). The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. Annals of Probability, 18(3), 1269-1283. The factor-of-2 sharp version that everyone now uses.

Current:

van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press. Chapter 19 covers Glivenko-Cantelli classes and the bracketing-entropy route in modern empirical-process language.
Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press. Chapters 4-5 give the modern non-asymptotic treatment with VC and Rademacher rates.
Pollard, D. (1984). Convergence of Stochastic Processes. Springer. The classical empirical-process textbook; chapters 2-3 give the function-class generalization with all the symmetrization machinery.
Vershynin, R. (2018). High-Dimensional Probability. Cambridge University Press. Chapter 8 connects DKW to sub-Gaussian concentration and Rademacher tails.

Next Topics

Empirical processes and chaining: the function-class generalization with bracketing entropy, the Talagrand chaining argument, and Donsker's theorem.
Rademacher complexity: the data-dependent rate that quantifies what Glivenko-Cantelli classes converge at.
VC dimension: the combinatorial sufficient condition for the Glivenko-Cantelli property.
Uniform convergence: the learning-theory framing in which Glivenko-Cantelli is the prototype.

Last reviewed: May 4, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

Concentration Inequalitieslayer 1 · tier 1
Uniform Convergencelayer 2 · tier 1
VC Dimensionlayer 2 · tier 1

Derived topics

No published topic currently declares this as a prerequisite.