Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Concentration Probability

Epsilon-Nets and Covering Numbers

Discretizing infinite sets for concentration arguments: epsilon-nets, covering numbers, packing numbers, the Dudley integral, and the connection to Rademacher complexity.

CoreTier 1Stable~60 min

Why This Matters

The fundamental challenge in learning theory is controlling the supremum of a random process over an infinite set:

suphHR(h)R^n(h)\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_n(h)|

For a finite class H\mathcal{H}, you apply a concentration inequality to each hh and take a union bound. For an infinite class, the union bound over uncountably many hypotheses is infinite. useless.

Epsilon-nets solve this by discretizing: approximate the infinite set by a finite set of points (the epsilon-net), apply the union bound to the finite set, and bound the error of the approximation. The covering number measures how large this finite set needs to be. It is the bridge between "finite class" bounds and "infinite class" bounds.

Mental Model

Imagine you want to bound supxTf(x)\sup_{x \in T} f(x) where TT is an infinite set and ff is "roughly continuous." Strategy:

  1. Pick a finite set of points NϵT\mathcal{N}_\epsilon \subset T such that every point in TT is within ϵ\epsilon of some point in Nϵ\mathcal{N}_\epsilon.
  2. Bound maxxNϵf(x)\max_{x \in \mathcal{N}_\epsilon} f(x) using a union bound over Nϵ|\mathcal{N}_\epsilon| points.
  3. Bound f(x)f(x)Lϵ|f(x) - f(x')| \leq L\epsilon for nearby points (Lipschitz continuity or similar).
  4. Combine: supxTf(x)maxxNϵf(x)+Lϵ\sup_{x \in T} f(x) \leq \max_{x \in \mathcal{N}_\epsilon} f(x) + L\epsilon.

The covering number N(ϵ,T,d)\mathcal{N}(\epsilon, T, d) tells you how many points you need in step 1. Steps 2-4 trade the "approximation error" LϵL\epsilon against the "union bound cost" logN(ϵ,T,d)\log \mathcal{N}(\epsilon, T, d).

Formal Definitions

Definition

Epsilon-Net (Covering)

An ϵ\epsilon-net (or ϵ\epsilon-covering) of a set TT in a metric space (T,d)(T, d) is a finite set NϵT\mathcal{N}_\epsilon \subseteq T such that for every xTx \in T, there exists yNϵy \in \mathcal{N}_\epsilon with d(x,y)ϵd(x, y) \leq \epsilon.

Equivalently, TyNϵB(y,ϵ)T \subseteq \bigcup_{y \in \mathcal{N}_\epsilon} B(y, \epsilon) where B(y,ϵ)B(y, \epsilon) is the closed ball of radius ϵ\epsilon centered at yy.

Definition

Covering Number

The covering number N(ϵ,T,d)\mathcal{N}(\epsilon, T, d) is the minimum size of an ϵ\epsilon-net of TT with respect to metric dd:

N(ϵ,T,d)=min{Nϵ:Nϵ is an ϵ-net of T}\mathcal{N}(\epsilon, T, d) = \min\{|\mathcal{N}_\epsilon| : \mathcal{N}_\epsilon \text{ is an } \epsilon\text{-net of } T\}

Smaller ϵ\epsilon requires more points. As ϵ0\epsilon \to 0, the covering number grows. The rate of growth captures the "metric complexity" of TT.

Definition

Packing Number

The ϵ\epsilon-packing number M(ϵ,T,d)\mathcal{M}(\epsilon, T, d) is the maximum number of points in TT that are mutually ϵ\epsilon-separated:

M(ϵ,T,d)=max{S:ST,  d(x,y)>ϵ for all xyS}\mathcal{M}(\epsilon, T, d) = \max\{|S| : S \subseteq T, \; d(x, y) > \epsilon \text{ for all } x \neq y \in S\}

A maximal packing is always a covering (if the packing is maximal, every point in TT is within ϵ\epsilon of some packing point). This gives the fundamental relationship between covering and packing numbers.

Main Theorems

Theorem

Covering-Packing Duality

Statement

For any set TT in a metric space and any ϵ>0\epsilon > 0:

M(2ϵ,T,d)N(ϵ,T,d)M(ϵ,T,d)\mathcal{M}(2\epsilon, T, d) \leq \mathcal{N}(\epsilon, T, d) \leq \mathcal{M}(\epsilon, T, d)

That is, the covering number at scale ϵ\epsilon is sandwiched between packing numbers at scales ϵ\epsilon and 2ϵ2\epsilon.

Intuition

Upper bound (NM\mathcal{N} \leq \mathcal{M}): A maximal ϵ\epsilon-packing is automatically an ϵ\epsilon-covering. If you cannot add any more ϵ\epsilon-separated points, then every point in TT must be within ϵ\epsilon of an existing packing point.

Lower bound (M(2ϵ)N(ϵ)\mathcal{M}(2\epsilon) \leq \mathcal{N}(\epsilon)): If you have mm points that are pairwise 2ϵ2\epsilon-separated, then each requires its own "representative" in any ϵ\epsilon-net (two 2ϵ2\epsilon-separated points cannot share the same ϵ\epsilon-net point). So the covering must have at least mm points.

Proof Sketch

Upper bound: Take a maximal ϵ\epsilon-packing SS. If some xTx \in T has d(x,y)>ϵd(x, y) > \epsilon for all ySy \in S, then S{x}S \cup \{x\} is a larger ϵ\epsilon-packing, contradicting maximality. So SS is an ϵ\epsilon-covering and N(ϵ)S=M(ϵ)\mathcal{N}(\epsilon) \leq |S| = \mathcal{M}(\epsilon).

Lower bound: Let SS be a 2ϵ2\epsilon-packing with S=M(2ϵ)|S| = \mathcal{M}(2\epsilon). Any ϵ\epsilon-covering Nϵ\mathcal{N}_\epsilon must contain a distinct point within ϵ\epsilon of each sSs \in S (since points in SS are 2ϵ2\epsilon-apart, their ϵ\epsilon-balls are disjoint). So NϵS|\mathcal{N}_\epsilon| \geq |S|.

Why It Matters

This duality is why covering and packing numbers are interchangeable (up to a factor of 2 in ϵ\epsilon) in most applications. You can use whichever is easier to compute. Packing numbers are often easier to bound by volume arguments; covering numbers are what you directly need in discretization arguments.

Failure Mode

The factor of 2 between the scales matters in precise calculations. In some information-theoretic lower bounds (Fano, Le Cam), you need packing numbers specifically, and the factor of 2 affects the final constants. For upper bounds in learning theory, the factor of 2 is absorbed into constants.

Volume-Based Covering Number Bounds

Definition

Covering Numbers for Euclidean Balls

For the unit ball B2d={xRd:x21}B_2^d = \{x \in \mathbb{R}^d : \|x\|_2 \leq 1\} in 2\ell_2 metric:

(1ϵ)dN(ϵ,B2d,2)(3ϵ)d\left(\frac{1}{\epsilon}\right)^d \leq \mathcal{N}(\epsilon, B_2^d, \|\cdot\|_2) \leq \left(\frac{3}{\epsilon}\right)^d

Proof idea (upper bound): Place ϵ\epsilon-balls to cover B2dB_2^d. By a volume argument, the number of balls needed is at most the volume ratio:

N(ϵ)Vol(B2d(1+ϵ))Vol(B2d(ϵ))=(1+ϵϵ)d(3ϵ)d\mathcal{N}(\epsilon) \leq \frac{\text{Vol}(B_2^d(1 + \epsilon))}{\text{Vol}(B_2^d(\epsilon))} = \left(\frac{1 + \epsilon}{\epsilon}\right)^d \leq \left(\frac{3}{\epsilon}\right)^d

Proof idea (lower bound): A packing argument. Each ϵ\epsilon-ball centered at a packing point has volume Vol(B2d(ϵ))\text{Vol}(B_2^d(\epsilon)), and they are disjoint within B2dB_2^d. So the packing size is at most Vol(B2d)/Vol(B2d(ϵ))=(1/ϵ)d\text{Vol}(B_2^d) / \text{Vol}(B_2^d(\epsilon)) = (1/\epsilon)^d.

Key takeaway: The logarithm of the covering number for Euclidean balls is logN(ϵ)=Θ(dlog(1/ϵ))\log \mathcal{N}(\epsilon) = \Theta(d \log(1/\epsilon)). it grows linearly in the dimension dd and logarithmically in 1/ϵ1/\epsilon.

The Discretization Argument

The standard template for using covering numbers in learning theory:

Setup: Let {Xt}tT\{X_t\}_{t \in T} be a random process indexed by TT, where each XtX_t is centered and sub-Gaussian with parameter σ\sigma. We want to bound E[suptTXt]\mathbb{E}[\sup_{t \in T} X_t].

Step 1 (Discretize): Fix ϵ>0\epsilon > 0 and let Nϵ\mathcal{N}_\epsilon be an ϵ\epsilon-net of TT.

Step 2 (Union bound on net): For each tNϵt \in \mathcal{N}_\epsilon, sub-Gaussian concentration gives P(Xtu)eu2/(2σ2)\mathbb{P}(X_t \geq u) \leq e^{-u^2/(2\sigma^2)}. By union bound over Nϵ|\mathcal{N}_\epsilon| points:

P ⁣(maxtNϵXtu)N(ϵ)eu2/(2σ2)\mathbb{P}\!\left(\max_{t \in \mathcal{N}_\epsilon} X_t \geq u\right) \leq \mathcal{N}(\epsilon) \cdot e^{-u^2/(2\sigma^2)}

Setting u=σ2logN(ϵ)u = \sigma\sqrt{2\log \mathcal{N}(\epsilon)} gives the bound with high probability.

Step 3 (Approximation): For any tTt \in T, let π(t)Nϵ\pi(t) \in \mathcal{N}_\epsilon be its nearest net point. If XtXπ(t)Lϵ|X_t - X_{\pi(t)}| \leq L\epsilon (Lipschitz condition), then:

suptTXtmaxtNϵXt+Lϵ\sup_{t \in T} X_t \leq \max_{t \in \mathcal{N}_\epsilon} X_t + L\epsilon

Step 4 (Optimize ϵ\epsilon): Balance the union bound cost σ2logN(ϵ)\sigma\sqrt{2\log \mathcal{N}(\epsilon)} against the approximation error LϵL\epsilon.

The Dudley Integral

Instead of using a single epsilon-net at one scale, chaining uses nets at all scales simultaneously. This is Dudley's entropy integral.

Theorem

Dudley Entropy Integral Bound

Statement

Let {Xt}tT\{X_t\}_{t \in T} be a centered random process such that XsXtX_s - X_t is sub-Gaussian with parameter d(s,t)d(s, t) for all s,tTs, t \in T. Then:

E ⁣[suptTXt]C0diam(T)logN(ϵ,T,d)dϵ\mathbb{E}\!\left[\sup_{t \in T} X_t\right] \leq C \int_0^{\text{diam}(T)} \sqrt{\log \mathcal{N}(\epsilon, T, d)}\, d\epsilon

where CC is a universal constant and diam(T)=sups,td(s,t)\text{diam}(T) = \sup_{s,t} d(s,t).

Intuition

The Dudley integral sums contributions from all scales ϵ\epsilon. At coarse scales (large ϵ\epsilon), the covering number is small but the approximation captures only the rough structure. At fine scales (small ϵ\epsilon), the covering number is large but the approximation is precise. The integral optimally balances these scales.

The logN(ϵ)\sqrt{\log \mathcal{N}(\epsilon)} at each scale comes from the sub-Gaussian union bound: to control max\max over N(ϵ)\mathcal{N}(\epsilon) sub-Gaussian variables, you pay logN(ϵ)\sqrt{\log \mathcal{N}(\epsilon)}. Integrating this across scales gives the Dudley bound.

Proof Sketch

Chaining construction: Choose epsilon-nets at geometrically decreasing scales: ϵk=2kdiam(T)\epsilon_k = 2^{-k} \cdot \text{diam}(T) for k=0,1,2,k = 0, 1, 2, \ldots Let πk(t)\pi_k(t) be the nearest point to tt in the ϵk\epsilon_k-net.

For any tTt \in T, telescope:

Xt=Xπ0(t)+k=1(Xπk(t)Xπk1(t))X_t = X_{\pi_0(t)} + \sum_{k=1}^{\infty} (X_{\pi_k(t)} - X_{\pi_{k-1}(t)})

Each increment Xπk(t)Xπk1(t)X_{\pi_k(t)} - X_{\pi_{k-1}(t)} is sub-Gaussian with parameter d(πk(t),πk1(t))ϵk1+ϵk3ϵkd(\pi_k(t), \pi_{k-1}(t)) \leq \epsilon_{k-1} + \epsilon_k \leq 3\epsilon_k.

At scale kk, the increment takes at most N(ϵk)N(ϵk1)\mathcal{N}(\epsilon_k) \cdot \mathcal{N}(\epsilon_{k-1}) possible values. The sub-Gaussian maximum over these values is controlled by logN(ϵk)ϵk\sqrt{\log \mathcal{N}(\epsilon_k)} \cdot \epsilon_k.

Summing over all scales kk and converting the sum to an integral gives the Dudley bound.

Why It Matters

The Dudley integral is the main tool for bounding the expected supremum of empirical processes. In learning theory, the empirical process suphR^n(h)R(h)\sup_h |\hat{R}_n(h) - R(h)| is of this form. The Dudley integral converts the problem of bounding this supremum into computing (or bounding) covering numbers of the hypothesis class.

For the dd-dimensional unit ball: logN(ϵ)dlog(1/ϵ)\log \mathcal{N}(\epsilon) \approx d\log(1/\epsilon), so the Dudley integral gives 01dlog(1/ϵ)dϵ=O(d)\int_0^1 \sqrt{d\log(1/\epsilon)}\,d\epsilon = O(\sqrt{d}). recovering the d/n\sqrt{d/n} rates of VC theory.

Failure Mode

The Dudley integral can be loose by a logn\sqrt{\log n} factor compared to the tightest possible bound (given by the generic chaining / majorizing measures theorem of Talagrand). The Dudley integral uses a "worst-case" analysis at each scale, while generic chaining adapts to the local geometry of TT. For most applications in learning theory, Dudley is sufficient.

Connection to Rademacher Complexity

The Rademacher complexity of a function class F\mathcal{F} on a sample S=(x1,,xn)S = (x_1, \ldots, x_n) is:

R^n(F)=Eσ ⁣[supfF1ni=1nσif(xi)]\hat{\mathcal{R}}_n(\mathcal{F}) = \mathbb{E}_\sigma\!\left[\sup_{f \in \mathcal{F}} \frac{1}{n}\sum_{i=1}^n \sigma_i f(x_i)\right]

where σi\sigma_i are i.i.d. Rademacher (±1\pm 1) variables.

The process Xf=1niσif(xi)X_f = \frac{1}{n}\sum_i \sigma_i f(x_i) is sub-Gaussian with the 2\ell_2 metric d(f,g)=1ni(f(xi)g(xi))2d(f, g) = \frac{1}{\sqrt{n}}\sqrt{\sum_i (f(x_i) - g(x_i))^2} on the function class (viewed as vectors in Rn\mathbb{R}^n).

Applying Dudley:

R^n(F)Cn0diamlogN(ϵ,FS,2)dϵ\hat{\mathcal{R}}_n(\mathcal{F}) \leq \frac{C}{\sqrt{n}} \int_0^{\text{diam}} \sqrt{\log \mathcal{N}(\epsilon, \mathcal{F}|_S, \|\cdot\|_2)}\,d\epsilon

where FS={(f(x1),,f(xn)):fF}\mathcal{F}|_S = \{(f(x_1), \ldots, f(x_n)) : f \in \mathcal{F}\} is the restriction of F\mathcal{F} to the sample. This connects covering numbers to generalization bounds via Rademacher complexity.

Canonical Examples

Example

Covering number of a finite set

If T=N|T| = N (a finite set), then N(ϵ,T,d)N\mathcal{N}(\epsilon, T, d) \leq N for all ϵ>0\epsilon > 0 (the set is its own covering). For ϵ\epsilon less than the minimum pairwise distance, N(ϵ)=N\mathcal{N}(\epsilon) = N. The Dudley integral reduces to logN\sqrt{\log N}. recovering the finite-class union bound.

Example

Covering the unit sphere in R^d

The unit sphere Sd1={xRd:x2=1}S^{d-1} = \{x \in \mathbb{R}^d : \|x\|_2 = 1\} has:

N(ϵ,Sd1,2)(3ϵ)d\mathcal{N}(\epsilon, S^{d-1}, \|\cdot\|_2) \leq \left(\frac{3}{\epsilon}\right)^d

and logN(ϵ)dlog(3/ϵ)\log \mathcal{N}(\epsilon) \leq d\log(3/\epsilon). This is used in bounding the operator norm of random matrices: to control supx=1Ax\sup_{\|x\|=1} \|Ax\|, discretize the sphere and union-bound.

Example

Covering linear function classes

Let F={xw,x:w2B}\mathcal{F} = \{x \mapsto \langle w, x \rangle : \|w\|_2 \leq B\} with data xi2R\|x_i\|_2 \leq R. The restriction to a sample gives vectors in Rn\mathbb{R}^n with 2\ell_2 norm at most BRnBR\sqrt{n}.

The covering number is: logN(ϵ,FS,2)dlog(1+2BRn/ϵ)\log \mathcal{N}(\epsilon, \mathcal{F}|_S, \|\cdot\|_2) \leq d \log(1 + 2BR\sqrt{n}/\epsilon).

Plugging into Dudley gives Rademacher complexity O(BRd/n)O(BR\sqrt{d/n}), the standard generalization bound for bounded linear predictors.

Common Confusions

Watch Out

Covering numbers depend on the metric

The same set TT can have vastly different covering numbers under different metrics. Covering the 1\ell_1 ball in 1\ell_1 metric vs. 2\ell_2 metric vs. \ell_\infty metric gives different numbers. Always specify the metric. In learning theory, the metric is usually the empirical L2L^2 norm on the sample.

Watch Out

The net does not have to be a subset of T

Some definitions require NϵT\mathcal{N}_\epsilon \subseteq T (internal covering), others allow NϵM\mathcal{N}_\epsilon \subseteq M (external covering, where MM is the ambient metric space). The difference is at most a factor of 2 in ϵ\epsilon (an internal ϵ\epsilon-net is an external ϵ\epsilon-net, and an external ϵ/2\epsilon/2-net of TT can be projected to an internal ϵ\epsilon-net). Most learning theory results hold with either definition.

Watch Out

Covering number growth rate, not the number itself, is what matters

A covering number of 1010010^{100} looks huge, but if the log is 230230, then the union bound cost logN15\sqrt{\log \mathcal{N}} \approx 15 is modest. What matters is logN(ϵ)\log \mathcal{N}(\epsilon) as a function of ϵ\epsilon and the dimension/complexity of TT, not the raw number.

Summary

  • ϵ\epsilon-net: finite set that approximates all of TT within ϵ\epsilon
  • Covering number N(ϵ,T,d)\mathcal{N}(\epsilon, T, d): minimum epsilon-net size
  • Packing number M(ϵ)\mathcal{M}(\epsilon): maximum epsilon-separated subset
  • Duality: M(2ϵ)N(ϵ)M(ϵ)\mathcal{M}(2\epsilon) \leq \mathcal{N}(\epsilon) \leq \mathcal{M}(\epsilon)
  • Euclidean ball: logN(ϵ)=Θ(dlog(1/ϵ))\log \mathcal{N}(\epsilon) = \Theta(d \log(1/\epsilon))
  • Discretization argument: union bound on net + Lipschitz approximation
  • Dudley integral: E[suptXt]C0DlogN(ϵ)dϵ\mathbb{E}[\sup_t X_t] \leq C\int_0^D \sqrt{\log \mathcal{N}(\epsilon)}\,d\epsilon
  • Connects to Rademacher complexity via covering numbers of FS\mathcal{F}|_S

Exercises

ExerciseCore

Problem

Compute the covering number N(ϵ,[0,1],)\mathcal{N}(\epsilon, [0, 1], |\cdot|) for the unit interval under the absolute value metric. What is its logarithm?

ExerciseCore

Problem

Use a covering number argument to bound E[suptTXt]\mathbb{E}[\sup_{t \in T} X_t] where T={t1,,tN}T = \{t_1, \ldots, t_N\} is a finite set and each XtiX_{t_i} is sub-Gaussian with parameter σ\sigma. Recover the standard result E[maxiXti]σ2logN\mathbb{E}[\max_i X_{t_i}] \leq \sigma\sqrt{2\log N}.

ExerciseAdvanced

Problem

Let B2dB_2^d be the unit ball in Rd\mathbb{R}^d with 2\ell_2 metric. Using the volume argument, prove that N(ϵ,B2d,2)(1+2/ϵ)d\mathcal{N}(\epsilon, B_2^d, \|\cdot\|_2) \leq (1 + 2/\epsilon)^d. Then compute the Dudley integral 01logN(ϵ)dϵ\int_0^1 \sqrt{\log \mathcal{N}(\epsilon)}\,d\epsilon and show it is O(d)O(\sqrt{d}).

Related Comparisons

References

Canonical:

  • Vershynin, High-Dimensional Probability (2018), Chapter 4
  • van der Vaart & Wellner, Weak Convergence and Empirical Processes (1996), Chapter 2.5
  • Dudley, Uniform Central Limit Theorems (1999)

Current:

  • Wainwright, High-Dimensional Statistics (2019), Chapter 5

  • Talagrand, Upper and Lower Bounds for Stochastic Processes (2014)

  • Boucheron, Lugosi, Massart, Concentration Inequalities (2013), Chapters 2-6

Next Topics

Building on epsilon-nets and covering numbers:

  • Rademacher complexity: the data-dependent complexity measure that uses covering numbers
  • VC dimension: combinatorial complexity connected to covering numbers via Sauer's lemma
  • Generic chaining: Talagrand's refinement of the Dudley integral

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics