Skip to main content

Learning Theory Core

Glivenko-Cantelli Theorem

Uniform almost-sure convergence of the empirical CDF to the true CDF — the prototype of every uniform-convergence statement in statistics and learning theory. The classical statement, the DKW exponential rate, the function-class generalization, and the bridge to VC and Rademacher bounds.

AdvancedTier 2StableCore spine~35 min

Why This Matters

The Glivenko-Cantelli theorem is the prototype of every uniform-convergence statement in statistics and learning theory. The classical version says the empirical CDF converges uniformly (over xx) to the true CDF, almost surely. Stripped to its essence, this is one function class, the indicator-of-half-line class, for which the empirical mean converges to the population mean uniformly across the class.

Every modern statement in empirical-process theory, Rademacher complexity, VC dimension, and PAC learning is a generalization of this single classical result. Glivenko-Cantelli classes, the function classes for which empirical-mean-to-population-mean uniform convergence holds, are exactly the classes for which a sample-average estimator works without further assumptions. Finite VC dimension implies the Glivenko-Cantelli property; bounded Rademacher complexity gives a quantitative rate. The classical theorem is the logical anchor for the whole framework.

Mental Model

Draw nn i.i.d. samples X1,,XnX_1, \ldots, X_n from an unknown distribution with CDF FF. Build the empirical CDF Fn(x)=1ni=1n1[Xix]F_n(x) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}[X_i \le x]. For a fixed xx, the law of large numbers says Fn(x)F(x)F_n(x) \to F(x) almost surely. Glivenko-Cantelli says this convergence is uniform in xx — the worst-case gap over all xx also goes to zero almost surely.

The same idea generalizes: for any class F\mathcal{F} of measurable functions, ask whether supfFPnfPf0\sup_{f \in \mathcal{F}} |P_n f - P f| \to 0 almost surely (or in probability), where Pnf=1nif(Xi)P_n f = \frac{1}{n}\sum_i f(X_i) is the empirical mean. Classes where this holds are called Glivenko-Cantelli classes, and they are exactly the classes one can hope to do statistics on with a finite sample.

Formal Setup

Let X1,,XnX_1, \ldots, X_n be i.i.d. samples from a probability distribution PP on R\mathbb{R} with CDF FF. The empirical CDF is

Fn(x)=1ni=1n1[Xix],xR.F_n(x) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}[X_i \le x], \qquad x \in \mathbb{R}.

The supremum-norm distance between empirical and true CDF is

FnF=supxRFn(x)F(x).\|F_n - F\|_\infty = \sup_{x \in \mathbb{R}} |F_n(x) - F(x)|.

This is the random variable Glivenko-Cantelli controls.

The Classical Theorem

Theorem

Glivenko-Cantelli Theorem

Statement

FnF=supxRFn(x)F(x)a.s.0as n.\|F_n - F\|_\infty = \sup_{x \in \mathbb{R}} |F_n(x) - F(x)| \xrightarrow{\text{a.s.}} 0 \qquad \text{as } n \to \infty.

The convergence is uniform in xx and almost-sure (not just in probability).

Intuition

For each fixed xx, Fn(x)F_n(x) is an empirical probability and converges to F(x)F(x) by the strong law. The Glivenko-Cantelli step upgrades this pointwise convergence to uniform convergence by exploiting the monotonicity of CDFs: a finite grid of xx-values controls the supremum because the gap between adjacent grid points is bounded by the jump in FF at those points.

Proof Sketch

Fix ϵ>0\epsilon > 0. Pick a finite grid =t0<t1<<tK=-\infty = t_0 < t_1 < \cdots < t_K = \infty such that F(tk)F(tk1)ϵF(t_{k}) - F(t_{k-1}) \le \epsilon for each kk (always possible since FF has bounded variation). On each grid interval, FnFF_n - F is bounded above by Fn(tk1)F(tk)+ϵF_n(t_{k-1}) - F(t_k) + \epsilon and below by Fn(tk)F(tk1)ϵF_n(t_k) - F(t_{k-1}) - \epsilon, using monotonicity of both CDFs and the grid spacing. The strong law applied to each Fn(tk)F(tk)F_n(t_k) \to F(t_k) at the K+1K + 1 grid points (a finite number, so the union-of-measure-zero-events is still measure zero) gives almost-sure pointwise convergence; combined with the grid-spacing bound, supxFnF2ϵ\sup_x |F_n - F| \le 2\epsilon eventually almost surely, for every ϵ>0\epsilon > 0. Send ϵ0\epsilon \to 0 along a countable sequence to conclude.

Why It Matters

This is the single most cited result in empirical-process theory. Any time a paper says "by uniform convergence" or "by Glivenko-Cantelli," this is the result they are appealing to (or its function-class generalization below). It is the reason the empirical CDF, the bootstrap, KS-test, and every nonparametric estimator built on the empirical distribution function have any hope of working at all.

Failure Mode

The classical theorem requires i.i.d. sampling and a real-valued XX with a CDF. For multivariate XX, the corresponding statement holds for the empirical measure on rectangular cells (or more generally, on any VC-class of sets) but the sup\sup over arbitrary measurable sets fails: the class of all Borel sets is too rich for any finite-sample uniform convergence to hold.

DKW: An Exponential Rate

Glivenko-Cantelli says the supremum goes to zero, but it does not give a rate. The Dvoretzky-Kiefer-Wolfowitz (DKW) inequality, in its sharp form due to Massart (1990), supplies one.

Theorem

Dvoretzky-Kiefer-Wolfowitz (DKW) Inequality

Statement

For every n1n \ge 1 and every ϵ>0\epsilon > 0,

P(FnF>ϵ)2e2nϵ2.\mathbb{P}\bigl(\|F_n - F\|_\infty > \epsilon\bigr) \le 2 e^{-2 n \epsilon^2}.

The constant 2 in the exponent and the leading factor of 2 are both sharp (Massart 1990).

Intuition

This is a sub-Gaussian tail bound on the LL_\infty distance between empirical and true CDF, identical in shape to Hoeffding's inequality for a single bounded random variable. The reason a single sub-Gaussian rate suffices for the supremum (not just a fixed point) is that the indicator-of-half-line class has bounded VC dimension (in fact, VC dimension 2), so the supremum behaves no worse than a single Hoeffding tail with a small constant adjustment.

Proof Sketch

Massart's proof goes via a clever rearrangement: approximate FnFF_n - F by a finite-dimensional process indexed by the nn data points, then bound the supremum of the finite-dimensional process via a chaining or symmetrization argument (essentially the same machinery as in Rademacher complexity). The factor 2 is recovered by carefully tracking the Talagrand contraction step.

Why It Matters

DKW is what powers confidence bands for the empirical CDF. With probability at least 1δ1 - \delta, FnFlog(2/δ)/(2n)\|F_n - F\|_\infty \le \sqrt{\log(2/\delta)/(2n)}. This is the basis of the Kolmogorov-Smirnov goodness-of-fit test and of nonparametric confidence bands in survival analysis, change-point detection, and reliability theory.

Failure Mode

DKW gives a sub-Gaussian rate that holds uniformly across all CDFs FF. Sharper distribution-specific bounds (e.g. for continuous FF via the probability-integral transform) sometimes give better constants, but the DKW form is the right default in practice.

Function-Class Generalization

The classical Glivenko-Cantelli theorem is the special case of a much more general statement.

Definition

Glivenko-Cantelli Class

A class F\mathcal{F} of measurable functions is Glivenko-Cantelli with respect to PP if

supfFPnfPfa.s.0,\sup_{f \in \mathcal{F}} |P_n f - P f| \xrightarrow{\text{a.s.}} 0,

where Pnf=1ni=1nf(Xi)P_n f = \frac{1}{n}\sum_{i=1}^n f(X_i) and Pf=EP[f(X)]P f = \mathbb{E}_P[f(X)]. The class is uniform Glivenko-Cantelli if the convergence holds uniformly over PP in some collection of distributions.

The classical theorem is the case F={1[x]:xR}\mathcal{F} = \{\mathbf{1}[\cdot \le x] : x \in \mathbb{R}\}, the indicator-of-half-line class.

The cleanest sufficient condition for the Glivenko-Cantelli property is finite VC dimension. The corresponding rate is supplied by Rademacher complexity.

Sufficient conditionStrength of conclusion
Finite VC dimensionUniform Glivenko-Cantelli over all PP (distribution-free)
Finite Rademacher complexity at sample size nnQuantitative rate O(Rn(F))O(R_n(\mathcal{F})) on the sup-deviation
Bracketing entropy integral convergesUniform Glivenko-Cantelli; permits unbounded F\mathcal{F}
Pointwise LLN + uniform-equicontinuityGlivenko-Cantelli (van der Vaart 1998, Theorem 19.4)

Each of these is a different route to the same destination. VC dimension is the combinatorial route (used in classical PAC learning); Rademacher complexity is the data-dependent route (used in modern margin-based bounds); bracketing-entropy integrals are the empirical-process route (used in semiparametric statistics and M-estimation).

Connection to Uniform Convergence and ERM

The link to learning theory is direct. For binary classification with zero-one loss, the loss class {(x,y)1[h(x)y]:hH}\{(x, y) \mapsto \mathbf{1}[h(x) \ne y] : h \in \mathcal{H}\} is Glivenko-Cantelli iff ERM generalizes (in the population sense) for every distribution. Finite VC dimension of H\mathcal{H} is equivalent to this property: the Vapnik-Chervonenkis route to the Fundamental Theorem of Statistical Learning.

For real-valued losses and bounded function classes, the analogous statement uses Rademacher complexity. The empirical Rademacher complexity, computed on the actual sample, gives a data-dependent generalization bound that can be tighter than the worst-case VC bound when the realized data lies in a benign region.

This is why the Glivenko-Cantelli theorem is treated as the prototype: the qualitative statement (uniform convergence) is the classical content, and the modern theory upgrades the qualitative statement to a quantitative one with explicit rate constants.

Worked Example: Empirical CDF on Uniform[0, 1]

Take X1,,XnUniform[0,1]X_1, \ldots, X_n \sim \mathrm{Uniform}[0, 1], so F(x)=xF(x) = x for x[0,1]x \in [0, 1]. The empirical CDF is the staircase function with jumps at the order statistics. By DKW with δ=0.05\delta = 0.05:

FnFlog(40)/(2n)1.358/n\|F_n - F\|_\infty \le \sqrt{\log(40)/(2 n)} \approx 1.358 / \sqrt{n}

with probability at least 0.950.95. Concrete numbers:

nnDKW LL_\infty boundApprox pointwise std error F(x)(1F(x))/n\sqrt{F(x)(1-F(x))/n} at x=0.5x=0.5
1000.1360.050
10000.0430.016
100000.0140.005

The DKW bound is roughly 2.7×2.7\times the pointwise std error at x=0.5x = 0.5 for any nn — that is the price of demanding uniform control over the entire xx-axis instead of pointwise control at a single xx. The factor 2.7 is essentially the constant in log(2/δ)2.72\sqrt{\log(2/\delta)} \approx 2.72 at δ=0.05\delta = 0.05, divided by the pointwise standard-deviation factor F(1F)=0.5\sqrt{F(1-F)} = 0.5.

Common Confusions

Watch Out

Pointwise convergence is not uniform convergence

For each fixed xx, Fn(x)F(x)F_n(x) \to F(x) a.s. by the strong law of large numbers, but this is pointwise convergence and does not by itself imply supxFnF0\sup_x |F_n - F| \to 0. Glivenko-Cantelli is the upgrade. Without monotonicity of FF (or some equivalent compactness argument on xx), pointwise convergence does not imply uniform convergence.

Watch Out

DKW gives a sub-Gaussian rate, not a Hoeffding rate

The exponential rate in DKW is exp(2nϵ2)\exp(-2n\epsilon^2), identical in form to Hoeffding for a single bounded variable. This is not a coincidence: the indicator-of-half-line class has VC dimension 2, so the supremum behaves like a single bounded random variable up to a small constant. For richer classes (higher VC dimension), the rate degrades to exp(cnϵ2/d)\exp(-c n \epsilon^2 / d) with dd the VC dimension, by Sauer-Shelah.

Watch Out

Glivenko-Cantelli does not say anything about distributional convergence

The Glivenko-Cantelli theorem says FnF0\|F_n - F\|_\infty \to 0 a.s. — it controls the LL_\infty distance pathwise but says nothing about the distribution of n(FnF)\sqrt n (F_n - F) as a process. The distributional statement is Donsker's theorem: n(FnF)\sqrt n (F_n - F) converges weakly to a Brownian bridge. Donsker is strictly more than Glivenko-Cantelli; it requires the bracketing-entropy integral to converge, which is automatic for the indicator-of-half-line class but not for richer classes.

Exercises

ExerciseCore

Problem

Use DKW to give a 1δ1 - \delta confidence band for the empirical CDF based on n=200n = 200 samples with δ=0.05\delta = 0.05. State the explicit value of the band half-width.

ExerciseAdvanced

Problem

Show that the class F={1[x]:xR}\mathcal{F} = \{\mathbf{1}[\cdot \le x] : x \in \mathbb{R}\} has VC dimension exactly 2, and give the canonical shatter set.

References

Canonical:

  • Glivenko, V. (1933). Sulla determinazione empirica delle leggi di probabilità. Giornale dell'Istituto Italiano degli Attuari, 4, 92-99. The Italian half of the original theorem; covers the case FF continuous.
  • Cantelli, F. P. (1933). Sulla determinazione empirica delle leggi di probabilità. Giornale dell'Istituto Italiano degli Attuari, 4, 421-424. Cantelli's complement covering the general case.
  • Dvoretzky, A., Kiefer, J., & Wolfowitz, J. (1956). Asymptotic minimax character of the sample distribution function. Annals of Mathematical Statistics, 27(3), 642-669. The original DKW with non-sharp constants.
  • Massart, P. (1990). The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. Annals of Probability, 18(3), 1269-1283. The factor-of-2 sharp version that everyone now uses.

Current:

  • van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press. Chapter 19 covers Glivenko-Cantelli classes and the bracketing-entropy route in modern empirical-process language.
  • Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press. Chapters 4-5 give the modern non-asymptotic treatment with VC and Rademacher rates.
  • Pollard, D. (1984). Convergence of Stochastic Processes. Springer. The classical empirical-process textbook; chapters 2-3 give the function-class generalization with all the symmetrization machinery.
  • Vershynin, R. (2018). High-Dimensional Probability. Cambridge University Press. Chapter 8 connects DKW to sub-Gaussian concentration and Rademacher tails.

Next Topics

  • Empirical processes and chaining: the function-class generalization with bracketing entropy, the Talagrand chaining argument, and Donsker's theorem.
  • Rademacher complexity: the data-dependent rate that quantifies what Glivenko-Cantelli classes converge at.
  • VC dimension: the combinatorial sufficient condition for the Glivenko-Cantelli property.
  • Uniform convergence: the learning-theory framing in which Glivenko-Cantelli is the prototype.

Last reviewed: May 4, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

3

Derived topics

0

No published topic currently declares this as a prerequisite.