Epsilon-Nets and Covering Numbers

Sneiderman, Robby

Concentration Probability

Epsilon-Nets and Covering Numbers

Discretizing infinite sets for concentration arguments: epsilon-nets, covering numbers, packing numbers, the Dudley integral, and the connection to Rademacher complexity.

CoreTier 1StableSupporting~60 min

Prerequisites

Subgaussian Random Variables Concentration Inequalities Contraction Inequality Subexponential Random Variables

Start 8-question practice · 4 available Prereq Map

Learning position

Read this page in the graph.

concentration-probability | layer 3 | tier 1. This page has 5 direct prerequisites and 3 published dependents.

Open Atlas Prerequisites Leads to

What next

Rademacher Complexity

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

The fundamental challenge in learning theory is controlling the supremum of a random process over an infinite set:

Five-panel infographic: covering numbers and epsilon-nets as a way to discretize an infinite class, the formal definition of an epsilon-net, applications (uniform convergence, chaining, metric entropy), volume arguments for compact sets, and connection to Rademacher complexity. — An epsilon-net replaces an infinite hypothesis class with a finite cover. Covering numbers measure how rich the class is at scale epsilon.

$\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_n(h)|$

For a finite class $\mathcal{H}$ , you apply a concentration inequality to each $h$ and take a union bound. For an infinite class, the naive union bound over uncountably many hypotheses is useless.

Epsilon-nets solve this by discretizing: approximate the infinite set by a finite set of points (the epsilon-net), apply the union bound to the finite set, and bound the error of the approximation. The covering number measures how large this finite set needs to be. It is the bridge between "finite class" bounds and "infinite class" bounds.

theorem visual

Cover the infinite set, then union-bound the cover

$An epsilon-net replaces an uncountable search over T with a finite search over representative points, plus a controlled approximation error.$

geometry

$Pick N_{ϵ} \subset T so every x \in T is near some y \in N_{ϵ} .$

probability

$Apply concentration only to the finite cover: the price is lo g N (ϵ, T, d) .$

tradeoff

$Smaller ϵ reduces approximation error but increases the union-bound cost.$

Mental Model

Imagine you want to bound $\sup_{x \in T} f(x)$ where $T$ is an infinite set and $f$ is "roughly continuous." Strategy:

Pick a finite set of points $\mathcal{N}_\epsilon \subset T$ such that every point in $T$ is within $\epsilon$ of some point in $\mathcal{N}_\epsilon$ .
Bound $\max_{x \in \mathcal{N}_\epsilon} f(x)$ using a union bound over $|\mathcal{N}_\epsilon|$ points.
Bound $|f(x) - f(x')| \leq L\epsilon$ for nearby points (Lipschitz continuity or similar).
Combine: $\sup_{x \in T} f(x) \leq \max_{x \in \mathcal{N}_\epsilon} f(x) + L\epsilon$ .

The covering number $\mathcal{N}(\epsilon, T, d)$ tells you how many points you need in step 1. Steps 2-4 trade the "approximation error" $L\epsilon$ against the "union bound cost" $\log \mathcal{N}(\epsilon, T, d)$ .

Formal Definitions

Definition

Epsilon-Net (Covering) $N_{ϵ}$

An $\epsilon$ -net (or $\epsilon$ -covering) of a set $T$ in a metric space $(T, d)$ is a finite set $\mathcal{N}_\epsilon \subseteq T$ such that for every $x \in T$ , there exists $y \in \mathcal{N}_\epsilon$ with $d(x, y) \leq \epsilon$ .

Equivalently, $T \subseteq \bigcup_{y \in \mathcal{N}_\epsilon} B(y, \epsilon)$ where $B(y, \epsilon)$ is the closed ball of radius $\epsilon$ centered at $y$ .

Definition

Covering Number $N (ϵ, T, d)$

The covering number $\mathcal{N}(\epsilon, T, d)$ is the minimum size of an $\epsilon$ -net of $T$ with respect to metric $d$ :

$\mathcal{N}(\epsilon, T, d) = \min\{|\mathcal{N}_\epsilon| : \mathcal{N}_\epsilon \text{ is an } \epsilon\text{-net of } T\}$

Smaller $\epsilon$ requires more points. As $\epsilon \to 0$ , the covering number grows. The rate of growth captures the "metric complexity" of $T$ .

Definition

Packing Number $M (ϵ, T, d)$

The $\epsilon$ -packing number $\mathcal{M}(\epsilon, T, d)$ is the maximum size of a subset of $T$ whose points are mutually $\epsilon$ -separated:

$\mathcal{M}(\epsilon, T, d) = \max\{|S| : S \subseteq T, \; d(x, y) > \epsilon \text{ for all } x \neq y \in S\}$

A maximal packing is always a covering (if the packing is maximal, every point in $T$ is within $\epsilon$ of some packing point). This gives the fundamental relationship between covering and packing numbers.

Main Theorems

Theorem

Covering-Packing Duality

Statement

For any set $T$ in a metric space and any $\epsilon > 0$ :

$\mathcal{M}(2\epsilon, T, d) \leq \mathcal{N}(\epsilon, T, d) \leq \mathcal{M}(\epsilon, T, d)$

That is, the covering number at scale $\epsilon$ is sandwiched between packing numbers at scales $\epsilon$ and $2\epsilon$ .

Intuition

Upper bound ( $\mathcal{N} \leq \mathcal{M}$ ): A maximal $\epsilon$ -packing is automatically an $\epsilon$ -covering. If you cannot add any more $\epsilon$ -separated points, then every point in $T$ must be within $\epsilon$ of an existing packing point.

Lower bound ( $\mathcal{M}(2\epsilon) \leq \mathcal{N}(\epsilon)$ ): If you have $m$ points that are pairwise $2\epsilon$ -separated, then each requires its own "representative" in any $\epsilon$ -net (two $2\epsilon$ -separated points cannot share the same $\epsilon$ -net point). So the covering must have at least $m$ points.

Proof Sketch

Upper bound: Take a maximal $\epsilon$ -packing $S$ . If some $x \in T$ has $d(x, y) > \epsilon$ for all $y \in S$ , then $S \cup \{x\}$ is a larger $\epsilon$ -packing, contradicting maximality. So $S$ is an $\epsilon$ -covering and $\mathcal{N}(\epsilon) \leq |S| = \mathcal{M}(\epsilon)$ .

Lower bound: Let $S$ be a $2\epsilon$ -packing with $|S| = \mathcal{M}(2\epsilon)$ . Any $\epsilon$ -covering $\mathcal{N}_\epsilon$ must contain a distinct point within $\epsilon$ of each $s \in S$ (since points in $S$ are $2\epsilon$ -apart, their $\epsilon$ -balls are disjoint). So $|\mathcal{N}_\epsilon| \geq |S|$ .

Why It Matters

This duality is why covering and packing numbers are interchangeable (up to a factor of 2 in $\epsilon$ ) in most applications. You can use whichever is easier to compute. Packing numbers are often easier to bound by volume arguments; covering numbers are what you directly need in discretization arguments.

Failure Mode

The factor of 2 between the scales matters in precise calculations. In some information-theoretic lower bounds (Fano, Le Cam), you need packing numbers specifically, and the factor of 2 affects the final constants. For upper bounds in learning theory, the factor of 2 is absorbed into constants.

report a correction →

Volume-Based Covering Number Bounds

Definition

Covering Numbers for Euclidean Balls

For the unit ball $B_2^d = \{x \in \mathbb{R}^d : \|x\|_2 \leq 1\}$ in $\ell_2$ metric:

$\left(\frac{1}{\epsilon}\right)^d \leq \mathcal{N}(\epsilon, B_2^d, \|\cdot\|_2) \leq \left(\frac{3}{\epsilon}\right)^d$

Proof idea (upper bound): Take a maximal $\epsilon$ -packing $S$ of $B_2^d$ (this is also an $\epsilon$ -covering, so $\mathcal{N}(\epsilon) \leq |S|$ ). By the packing property, the open balls of radius $\epsilon/2$ around points of $S$ are disjoint, and by the triangle inequality they are all contained in $B_2^d(1 + \epsilon/2)$ . Volume comparison gives $|S| \cdot \text{Vol}(B_2^d(\epsilon/2)) \leq \text{Vol}(B_2^d(1 + \epsilon/2))$ , so:

$\mathcal{N}(\epsilon) \leq |S| \leq \frac{\text{Vol}(B_2^d(1 + \epsilon/2))}{\text{Vol}(B_2^d(\epsilon/2))} = \left(\frac{1 + \epsilon/2}{\epsilon/2}\right)^d = \left(1 + \frac{2}{\epsilon}\right)^d \leq \left(\frac{3}{\epsilon}\right)^d$

where the final inequality uses $\epsilon \leq 1$ (equivalently, $1 + 2/\epsilon \leq 3/\epsilon$ iff $\epsilon \leq 1$ ).

Proof idea (lower bound): A packing argument. Each $\epsilon$ -ball centered at a packing point has volume $\text{Vol}(B_2^d(\epsilon))$ , and they are disjoint within $B_2^d$ . So the packing size is at most $\text{Vol}(B_2^d) / \text{Vol}(B_2^d(\epsilon)) = (1/\epsilon)^d$ .

The logarithm of the covering number for Euclidean balls is $\log \mathcal{N}(\epsilon) = \Theta(d \log(1/\epsilon))$ : it grows linearly in the dimension $d$ and logarithmically in $1/\epsilon$ .

The Discretization Argument

The standard template for using covering numbers in learning theory:

Setup: Let $\{X_t\}_{t \in T}$ be a random process indexed by $T$ , where each $X_t$ is centered and sub-Gaussian with parameter $\sigma$ . We want to bound $\mathbb{E}[\sup_{t \in T} X_t]$ .

Step 1 (Discretize): Fix $\epsilon > 0$ and let $\mathcal{N}_\epsilon$ be an $\epsilon$ -net of $T$ .

Step 2 (Union bound on net): For each $t \in \mathcal{N}_\epsilon$ , sub-Gaussian concentration gives $\mathbb{P}(X_t \geq u) \leq e^{-u^2/(2\sigma^2)}$ . By union bound over $|\mathcal{N}_\epsilon|$ points:

$\mathbb{P}\!\left(\max_{t \in \mathcal{N}_\epsilon} X_t \geq u\right) \leq \mathcal{N}(\epsilon) \cdot e^{-u^2/(2\sigma^2)}$

Setting $u = \sigma\sqrt{2\log \mathcal{N}(\epsilon)}$ gives the bound with high probability.

Step 3 (Approximation): For any $t \in T$ , let $\pi(t) \in \mathcal{N}_\epsilon$ be its nearest net point. If $|X_t - X_{\pi(t)}| \leq L\epsilon$ (Lipschitz condition), then:

$\sup_{t \in T} X_t \leq \max_{t \in \mathcal{N}_\epsilon} X_t + L\epsilon$

Step 4 (Optimize $\epsilon$ ): Balance the union bound cost $\sigma\sqrt{2\log \mathcal{N}(\epsilon)}$ against the approximation error $L\epsilon$ .

The Dudley Integral

Instead of using a single epsilon-net at one scale, chaining uses nets at all scales simultaneously. This is Dudley's entropy integral.

Theorem

Dudley Entropy Integral Bound

Statement

Let $\{X_t\}_{t \in T}$ be a centered random process such that $X_s - X_t$ is sub-Gaussian with parameter $d(s, t)$ for all $s, t \in T$ . Then:

$\mathbb{E}\!\left[\sup_{t \in T} X_t\right] \leq C \int_0^{\text{diam}(T)} \sqrt{\log \mathcal{N}(\epsilon, T, d)}\, d\epsilon$

where $C$ is a universal constant and $\text{diam}(T) = \sup_{s,t} d(s,t)$ .

Intuition

The Dudley integral sums contributions from all scales $\epsilon$ . At coarse scales (large $\epsilon$ ), the covering number is small but the approximation captures only the rough structure. At fine scales (small $\epsilon$ ), the covering number is large but the approximation is precise. The integral optimally balances these scales.

The $\sqrt{\log \mathcal{N}(\epsilon)}$ at each scale comes from the sub-Gaussian union bound: to control $\max$ over $\mathcal{N}(\epsilon)$ sub-Gaussian variables, you pay $\sqrt{\log \mathcal{N}(\epsilon)}$ . Integrating this across scales gives the Dudley bound.

Proof Sketch

Chaining construction: Choose epsilon-nets at geometrically decreasing scales: $\epsilon_k = 2^{-k} \cdot \text{diam}(T)$ for $k = 0, 1, 2, \ldots$ Let $\pi_k(t)$ be the nearest point to $t$ in the $\epsilon_k$ -net.

For any $t \in T$ , telescope:

$X_t = X_{\pi_0(t)} + \sum_{k=1}^{\infty} (X_{\pi_k(t)} - X_{\pi_{k-1}(t)})$

Each increment $X_{\pi_k(t)} - X_{\pi_{k-1}(t)}$ is sub-Gaussian with parameter $d(\pi_k(t), \pi_{k-1}(t)) \leq \epsilon_{k-1} + \epsilon_k \leq 3\epsilon_k$ .

At scale $k$ , the increment takes at most $\mathcal{N}(\epsilon_k) \cdot \mathcal{N}(\epsilon_{k-1})$ possible values. The sub-Gaussian maximum over these values is controlled by $\sqrt{\log \mathcal{N}(\epsilon_k)} \cdot \epsilon_k$ .

Summing over all scales $k$ and converting the sum to an integral gives the Dudley bound.

Why It Matters

The Dudley integral is the main tool for bounding the expected supremum of empirical processes. In learning theory, the empirical process $\sup_h |\hat{R}_n(h) - R(h)|$ is of this form. The Dudley integral converts the problem of bounding this supremum into computing (or bounding) covering numbers of the hypothesis class.

For the $d$ -dimensional unit ball: $\log \mathcal{N}(\epsilon) \approx d\log(1/\epsilon)$ , so the Dudley integral gives $\int_0^1 \sqrt{d\log(1/\epsilon)}\,d\epsilon = O(\sqrt{d})$ . recovering the $\sqrt{d/n}$ rates of VC theory.

Failure Mode

The Dudley integral can be loose by a $\sqrt{\log n}$ factor compared to the tightest possible bound (given by the generic chaining / majorizing measures theorem of Talagrand). The Dudley integral uses a "worst-case" analysis at each scale, while generic chaining adapts to the local geometry of $T$ . For most applications in learning theory, Dudley is sufficient.

report a correction →

Connection to Rademacher Complexity

The Rademacher complexity of a function class $\mathcal{F}$ on a sample $S = (x_1, \ldots, x_n)$ is:

$\hat{\mathcal{R}}_n(\mathcal{F}) = \mathbb{E}_\sigma\!\left[\sup_{f \in \mathcal{F}} \frac{1}{n}\sum_{i=1}^n \sigma_i f(x_i)\right]$

where $\sigma_i$ are i.i.d. Rademacher ( $\pm 1$ ) variables.

The process $X_f = \frac{1}{n}\sum_i \sigma_i f(x_i)$ is sub-Gaussian with the $\ell_2$ metric $d(f, g) = \frac{1}{\sqrt{n}}\sqrt{\sum_i (f(x_i) - g(x_i))^2}$ on the function class (viewed as vectors in $\mathbb{R}^n$ ).

Applying Dudley:

$\hat{\mathcal{R}}_n(\mathcal{F}) \leq \frac{C}{\sqrt{n}} \int_0^{\text{diam}} \sqrt{\log \mathcal{N}(\epsilon, \mathcal{F}|_S, \|\cdot\|_2)}\,d\epsilon$

where $\mathcal{F}|_S = \{(f(x_1), \ldots, f(x_n)) : f \in \mathcal{F}\}$ is the restriction of $\mathcal{F}$ to the sample. This connects covering numbers to generalization bounds via Rademacher complexity.

Canonical Examples

Example

Covering number of a finite set

If $|T| = N$ (a finite set), then $\mathcal{N}(\epsilon, T, d) \leq N$ for all $\epsilon > 0$ (the set is its own covering). For $\epsilon$ less than the minimum pairwise distance, $\mathcal{N}(\epsilon) = N$ . The Dudley integral reduces to $\sqrt{\log N}$ . recovering the finite-class union bound.

Example

Covering the unit sphere in R^d

The unit sphere $S^{d-1} = \{x \in \mathbb{R}^d : \|x\|_2 = 1\}$ has:

$\mathcal{N}(\epsilon, S^{d-1}, \|\cdot\|_2) \leq \left(\frac{3}{\epsilon}\right)^d$

and $\log \mathcal{N}(\epsilon) \leq d\log(3/\epsilon)$ . This is used in bounding the operator norm of random matrices: to control $\sup_{\|x\|=1} \|Ax\|$ , discretize the sphere and union-bound.

Example

Covering linear function classes

Let $\mathcal{F} = \{x \mapsto \langle w, x \rangle : \|w\|_2 \leq B\}$ with data $\|x_i\|_2 \leq R$ . The restriction to a sample gives vectors in $\mathbb{R}^n$ with $\ell_2$ norm at most $BR\sqrt{n}$ .

The covering number is: $\log \mathcal{N}(\epsilon, \mathcal{F}|_S, \|\cdot\|_2) \leq d \log(1 + 2BR\sqrt{n}/\epsilon)$ .

Plugging into Dudley gives Rademacher complexity $O\!\left(BR\sqrt{d/n}\right)$ , illustrating the chaining method on a parametric class.

Caveat — covering vs.\ direct bound. This $\sqrt{d/n}$ rate has a spurious dimension dependence. A direct calculation gives the sharper, dimension-free bound: by Cauchy-Schwarz,

\hat{\mathcal{R}}_n(\mathcal{F}) = \mathbb{E}_\sigma\!\sup_{\|w\|\leq B}\frac{1}{n}\sum_i \sigma_i\langle w, x_i\rangle \leq \frac{B}{n}\,\mathbb{E}_\sigma\Big\|\sum_i \sigma_i x_i\Big\| \leq \frac{B}{n}\sqrt{\sum_i \|x_i\|^2} \leq \frac{BR}{\sqrt{n}}.

So the canonical Rademacher rate for norm-bounded linear classes is $BR/\sqrt{n}$ , independent of $d$ . The covering-number route is valid but loose in this example — useful for parametric classes whose geometry is otherwise opaque, but not the right benchmark for what is known about linear predictors. State Dudley as a general tool whose output may be improved by direct calculations when more structure is available.

Common Confusions

Watch Out

Covering numbers depend on the metric

The same set $T$ can have vastly different covering numbers under different metrics. Covering the $\ell_1$ ball in $\ell_1$ metric vs. $\ell_2$ metric vs. $\ell_\infty$ metric gives different numbers. Always specify the metric. In learning theory, the metric is usually the empirical $L^2$ norm on the sample.

Watch Out

The net does not have to be a subset of T

Some definitions require $\mathcal{N}_\epsilon \subseteq T$ (internal covering), others allow $\mathcal{N}_\epsilon \subseteq M$ (external covering, where $M$ is the ambient metric space). The difference is at most a factor of 2 in $\epsilon$ (an internal $\epsilon$ -net is an external $\epsilon$ -net, and an external $\epsilon/2$ -net of $T$ can be projected to an internal $\epsilon$ -net). Most learning theory results hold with either definition.

Watch Out

Covering number growth rate, not the number itself, is what matters

A covering number of $10^{100}$ looks huge, but if the log is $230$ , then the union bound cost $\sqrt{\log \mathcal{N}} \approx 15$ is modest. What matters is $\log \mathcal{N}(\epsilon)$ as a function of $\epsilon$ and the dimension/complexity of $T$ , not the raw number.

Summary

$\epsilon$ -net: finite set that approximates all of $T$ within $\epsilon$
Covering number $\mathcal{N}(\epsilon, T, d)$ : minimum epsilon-net size
Packing number $\mathcal{M}(\epsilon)$ : maximum epsilon-separated subset
Duality: $\mathcal{M}(2\epsilon) \leq \mathcal{N}(\epsilon) \leq \mathcal{M}(\epsilon)$
Euclidean ball: $\log \mathcal{N}(\epsilon) = \Theta(d \log(1/\epsilon))$
Discretization argument: union bound on net + Lipschitz approximation
Dudley integral: $\mathbb{E}[\sup_t X_t] \leq C\int_0^D \sqrt{\log \mathcal{N}(\epsilon)}\,d\epsilon$
Connects to Rademacher complexity via covering numbers of $\mathcal{F}|_S$

Exercises

ExerciseCore

Problem

Compute the covering number $\mathcal{N}(\epsilon, [0, 1], |\cdot|)$ for the unit interval under the absolute value metric. What is its logarithm?

ExerciseCore

Problem

Use a covering number argument to bound $\mathbb{E}[\sup_{t \in T} X_t]$ where $T = \{t_1, \ldots, t_N\}$ is a finite set and each $X_{t_i}$ is sub-Gaussian with parameter $\sigma$ . Recover the standard result $\mathbb{E}[\max_i X_{t_i}] \leq \sigma\sqrt{2\log N}$ .

ExerciseAdvanced

Problem

Let $B_2^d$ be the unit ball in $\mathbb{R}^d$ with $\ell_2$ metric. Using the volume argument, prove that $\mathcal{N}(\epsilon, B_2^d, \|\cdot\|_2) \leq (1 + 2/\epsilon)^d$ . Then compute the Dudley integral $\int_0^1 \sqrt{\log \mathcal{N}(\epsilon)}\,d\epsilon$ and show it is $O(\sqrt{d})$ .

Related Comparisons

Covering Numbers vs. Packing Numbers

References

Canonical:

Vershynin, High-Dimensional Probability (2018), Chapter 4
van der Vaart & Wellner, Weak Convergence and Empirical Processes (1996), Chapter 2.5
Dudley, Uniform Central Limit Theorems (1999)

Current:

Wainwright, High-Dimensional Statistics (2019), Chapter 5
Talagrand, Upper and Lower Bounds for Stochastic Processes (2014), Chapters 1-2
Boucheron, Lugosi, Massart, Concentration Inequalities (2013), Chapters 2-6

Next Topics

Building on epsilon-nets and covering numbers:

Rademacher complexity: the data-dependent complexity measure that uses covering numbers
VC dimension: combinatorial complexity connected to covering numbers via Sauer's lemma
Generic chaining: Talagrand's refinement of the Dudley integral

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Concentration Inequalitieslayer 1 · tier 1
Sub-Exponential Random Variableslayer 2 · tier 1
Sub-Gaussian Random Variableslayer 2 · tier 1
Symmetrization Inequalitylayer 3 · tier 1
Contraction Inequalitylayer 3 · tier 2

Derived topics

3

Measure Concentration and Geometric Functional Analysislayer 3 · tier 1
Empirical Processes and Chaininglayer 3 · tier 2
Random Matrix Theory Overviewlayer 4 · tier 2

Graph-backed continuations

Empirical Processes and Chaining Measure Concentration and Geometric Functional Analysis Random Matrix Theory Overview