Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Concentration Probability

Empirical Processes and Chaining

Bounding the supremum of empirical processes via covering numbers and chaining: Dudley's entropy integral and Talagrand's generic chaining, the sharpest tools in classical learning theory.

AdvancedTier 2Stable~65 min
0

Why This Matters

The central question of learning theory is: how fast does the empirical risk converge to the population risk, uniformly over a function class? This question reduces to bounding the supremum of an empirical process. Dudley's entropy integral and Talagrand's generic chaining are the two main tools for this task.

Dudley's bound converts covering number estimates (geometric information about the function class) into bounds on the expected supremum. Generic chaining improves on Dudley by a logarithmic factor in many cases. Together, they represent the culmination of the covering number approach to learning theory.

Formal Setup

Definition

Empirical Process

Given i.i.d. samples X1,,XnX_1, \ldots, X_n from distribution PP and a function class F\mathcal{F}, the empirical process is:

Gn(f)=n(1ni=1nf(Xi)E[f(X)])=n(PnfPf)G_n(f) = \sqrt{n}\left(\frac{1}{n}\sum_{i=1}^n f(X_i) - \mathbb{E}[f(X)]\right) = \sqrt{n}(P_n f - Pf)

where Pn=1ni=1nδXiP_n = \frac{1}{n}\sum_{i=1}^n \delta_{X_i} is the empirical measure.

Definition

Supremum of the Empirical Process

The quantity of interest for uniform convergence is:

supfFGn(f)=nsupfFPnfPf\sup_{f \in \mathcal{F}} |G_n(f)| = \sqrt{n} \sup_{f \in \mathcal{F}} |P_n f - Pf|

Bounding E[supfFGn(f)]\mathbb{E}[\sup_{f \in \mathcal{F}} |G_n(f)|] controls how well empirical risk approximates population risk uniformly over F\mathcal{F}.

Definition

Covering Number

The covering number N(ϵ,F,d)N(\epsilon, \mathcal{F}, d) is the minimum number of balls of radius ϵ\epsilon (in metric dd) needed to cover F\mathcal{F}. The metric entropy is logN(ϵ,F,d)\log N(\epsilon, \mathcal{F}, d).

The Chaining Idea

The naive approach is to discretize F\mathcal{F} at a single resolution ϵ\epsilon and take a union bound over the ϵ\epsilon-net. This gives:

E[supfFGn(f)]ϵn+logN(ϵ,F,L2(Pn))\mathbb{E}\left[\sup_{f \in \mathcal{F}} |G_n(f)|\right] \lesssim \epsilon\sqrt{n} + \sqrt{\log N(\epsilon, \mathcal{F}, L_2(P_n))}

The problem: choosing ϵ\epsilon involves a tradeoff. Small ϵ\epsilon controls approximation error but increases the covering number. Large ϵ\epsilon is the reverse.

Chaining resolves this by discretizing at all scales simultaneously. Build a sequence of increasingly fine ϵ\epsilon-nets F0F1\mathcal{F}_0 \subset \mathcal{F}_1 \subset \cdots with ϵk=2k\epsilon_k = 2^{-k}. For each fFf \in \mathcal{F}, write f=f0+(f1f0)+(f2f1)+f = f_0 + (f_1 - f_0) + (f_2 - f_1) + \cdots where fkf_k is the nearest point to ff in the kk-th net. Bound each increment separately.

Main Theorems

Theorem

Dudley's Entropy Integral

Statement

Let F\mathcal{F} be a function class with envelope FF (meaning fF|f| \leq F for all fFf \in \mathcal{F}) and let σ2=supfFVar(f(X))\sigma^2 = \sup_{f \in \mathcal{F}} \text{Var}(f(X)). Then:

E[supfFGn(f)]C0σlogN(ϵ,F,L2(P))dϵ\mathbb{E}\left[\sup_{f \in \mathcal{F}} |G_n(f)|\right] \leq C \int_0^{\sigma} \sqrt{\log N(\epsilon, \mathcal{F}, L_2(P))}\, d\epsilon

where CC is a universal constant. The integral converges whenever the covering numbers grow at most polynomially as ϵ0\epsilon \to 0.

Intuition

At each scale ϵ\epsilon, the contribution to the supremum is controlled by logN(ϵ)\sqrt{\log N(\epsilon)} (a union bound over the net at that scale). Chaining integrates these contributions across all scales. Coarse scales (large ϵ\epsilon) contribute little because the net is small. Fine scales (small ϵ\epsilon) contribute little because the increments are small. The integral balances these.

Proof Sketch

Fix a sequence of scales ϵk=2kσ\epsilon_k = 2^{-k} \sigma and corresponding ϵk\epsilon_k-nets Nk\mathcal{N}_k. For each ff, let πk(f)\pi_k(f) be the closest point in Nk\mathcal{N}_k. Write:

Gn(f)=Gn(π0(f))+k=1K[Gn(πk(f))Gn(πk1(f))]+Gn(f)Gn(πK(f))G_n(f) = G_n(\pi_0(f)) + \sum_{k=1}^K [G_n(\pi_k(f)) - G_n(\pi_{k-1}(f))] + G_n(f) - G_n(\pi_K(f))

The first term is bounded by logN0\sqrt{\log |\mathcal{N}_0|} via a union bound. Each increment has sub-Gaussian parameter O(ϵk/n)O(\epsilon_k / \sqrt{n}) and there are Nk|\mathcal{N}_k| choices, giving logNk\sqrt{\log |\mathcal{N}_k|} per scale. Summing over scales gives the integral.

Why It Matters

This is the standard tool for converting covering number estimates into generalization bounds. For a class with logN(ϵ)=O(dlog(1/ϵ))\log N(\epsilon) = O(d \log(1/\epsilon)) (like VC classes or linear predictors), the integral evaluates to O(d)O(\sqrt{d}), recovering the standard O(d/n)O(\sqrt{d/n}) generalization bound without going through VC theory.

Failure Mode

Dudley's bound is not tight in general. It can be loose by a logn\sqrt{\log n} factor compared to the true expected supremum. For classes where the covering numbers have complex structure at different scales, generic chaining gives tighter results.

Theorem

Talagrand's Generic Chaining (Majorizing Measures)

Statement

For a centered Gaussian process (Xt)tT(X_t)_{t \in T}, define the γ2\gamma_2 functional:

γ2(T,d)=inf{Ak}suptTk=02k/2d(t,Ak)\gamma_2(T, d) = \inf_{\{A_k\}} \sup_{t \in T} \sum_{k=0}^{\infty} 2^{k/2} d(t, A_k)

where the infimum is over all sequences of sets AkA_k with Ak22k|A_k| \leq 2^{2^k}. Then:

1Lγ2(T,d)E[suptTXt]Lγ2(T,d)\frac{1}{L} \gamma_2(T, d) \leq \mathbb{E}\left[\sup_{t \in T} X_t\right] \leq L \cdot \gamma_2(T, d)

for a universal constant LL. The γ2\gamma_2 functional characterizes the expected supremum up to universal constants.

Intuition

Generic chaining replaces the uniform covering at each scale (Dudley) with an optimized covering that can vary across the set TT. Some points in TT may need fine resolution while others do not. The γ2\gamma_2 functional captures this by allowing different "chains" for different points.

Proof Sketch

The upper bound follows the same chaining idea as Dudley, but with optimized nets AkA_k instead of minimal coverings. The lower bound (which makes this a characterization, not just an upper bound) uses the Sudakov minoration argument and a sophisticated construction of the admissible sequence.

Why It Matters

This is the definitive result on suprema of Gaussian processes. Since empirical processes can be compared to Gaussian processes via symmetrization and comparison inequalities, this provides the tightest possible bounds on learning theory quantities. It resolves a long-standing question in probability theory about when chaining gives the right answer.

Failure Mode

The γ2\gamma_2 functional is hard to compute in general. For most applications, Dudley's entropy integral is sufficient and much more practical. Generic chaining is most useful for proving theoretical optimality results rather than computing explicit bounds.

Connection to Learning Theory

The Rademacher complexity of a function class F\mathcal{F} is:

Rn(F)=E[supfF1ni=1nσif(Xi)]\mathfrak{R}_n(\mathcal{F}) = \mathbb{E}\left[\sup_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^n \sigma_i f(X_i)\right]

By symmetrization, E[supfPnfPf]2Rn(F)\mathbb{E}[\sup_f |P_n f - Pf|] \leq 2\mathfrak{R}_n(\mathcal{F}). Dudley's bound then gives:

Rn(F)Cn0logN(ϵ,F,L2(Pn))dϵ\mathfrak{R}_n(\mathcal{F}) \leq \frac{C}{\sqrt{n}} \int_0^\infty \sqrt{\log N(\epsilon, \mathcal{F}, L_2(P_n))}\, d\epsilon

This is how covering numbers and chaining enter generalization bounds.

Canonical Examples

Example

Linear classifiers in R^d

For F={xw,x:w1,x1}\mathcal{F} = \{x \mapsto \langle w, x \rangle : \|w\| \leq 1, \|x\| \leq 1\}, the covering number satisfies logN(ϵ,F,L2)dlog(3/ϵ)\log N(\epsilon, \mathcal{F}, L_2) \leq d \log(3/\epsilon). Dudley's integral gives:

01dlog(3/ϵ)dϵ=O(d)\int_0^1 \sqrt{d \log(3/\epsilon)}\, d\epsilon = O(\sqrt{d})

So Rn(F)=O(d/n)\mathfrak{R}_n(\mathcal{F}) = O(\sqrt{d/n}), recovering the standard bound.

Common Confusions

Watch Out

Chaining is not a single-scale argument

The power of chaining comes from simultaneously using approximations at all scales. A single-scale argument (discretize at one ϵ\epsilon) always involves a tradeoff between approximation error and the size of the net. Chaining eliminates this tradeoff by summing contributions across scales.

Watch Out

Dudley vs Talagrand: when does the difference matter?

For most function classes encountered in ML (VC classes, kernel classes, neural network classes with bounded norms), Dudley's integral gives bounds that are tight up to logarithmic factors. Generic chaining matters when the geometry of the function class has different complexity at different scales, which is rare in standard applications but important in probability theory.

Summary

  • Empirical process theory reduces generalization to bounding supfPnfPf\sup_f |P_n f - Pf|
  • Dudley's entropy integral: E[supGn]ClogN(ϵ)dϵ\mathbb{E}[\sup |G_n|] \leq C \int \sqrt{\log N(\epsilon)}\,d\epsilon
  • Chaining works by discretizing at all scales simultaneously
  • Generic chaining (γ2\gamma_2) is tight for Gaussian processes but hard to compute
  • For practical ML bounds, Dudley's integral with covering number estimates suffices

Exercises

ExerciseCore

Problem

For a function class with logN(ϵ,F,L2)=dlog(1/ϵ)\log N(\epsilon, \mathcal{F}, L_2) = d \log(1/\epsilon), evaluate Dudley's entropy integral up to constants and state the resulting generalization bound.

ExerciseAdvanced

Problem

Explain why a single-scale ϵ\epsilon-net argument gives a bound of O(ϵ+dlog(1/ϵ)/n)O(\epsilon + \sqrt{d\log(1/\epsilon)/n}) and why optimizing over ϵ\epsilon gives O(dlogn/n)O(\sqrt{d\log n / n}), which is worse than Dudley's O(d/n)O(\sqrt{d/n}) by a logn\sqrt{\log n} factor.

ExerciseResearch

Problem

State Sudakov's minoration: a lower bound on the expected supremum of a Gaussian process in terms of a single-scale packing number. Explain why this shows Dudley's integral cannot be improved by more than a logarithmic factor.

References

Canonical:

Current:

  • Talagrand, Upper and Lower Bounds for Stochastic Processes (2014)

  • Boucheron, Lugosi, Massart, Concentration Inequalities (2013), Chapters 2-6

  • Wainwright, High-Dimensional Statistics (2019), Chapter 2

  • van Handel, Probability in High Dimension (2016), Chapters 1-3

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics