Covering Numbers vs. Packing Numbers. Metric Entropy Duality

What Each Measures

Both covering numbers and packing numbers quantify the "size" of a set in a metric space at a given resolution $\varepsilon$ . They answer dual questions about the same geometric object.

Covering numbers ask: what is the minimum number of $\varepsilon$ -balls needed to cover the entire set?

Packing numbers ask: what is the maximum number of points in the set that are mutually $\varepsilon$ -separated?

Side-by-Side Statement

Definition

Covering Number $N (ε, T, d)$

The covering number $\mathcal{N}(\varepsilon, T, d)$ is the minimum cardinality of an $\varepsilon$ -net for $T$ in metric $d$ . That is, the smallest set $\{t_1, \ldots, t_N\} \subseteq T$ such that for every $t \in T$ , there exists $t_i$ with $d(t, t_i) \leq \varepsilon$ .

Definition

Packing Number $M (ε, T, d)$

The packing number $\mathcal{M}(\varepsilon, T, d)$ is the maximum cardinality of an $\varepsilon$ -packing of $T$ . That is, the largest set $\{t_1, \ldots, t_M\} \subseteq T$ such that $d(t_i, t_j) > \varepsilon$ for all $i \neq j$ .

The Factor-of-2 Relationship

Theorem

Covering-Packing Sandwich Inequality

Statement

For any subset $T$ of a metric space $(X, d)$ and any $\varepsilon > 0$ :

$\mathcal{M}(2\varepsilon, T, d) \leq \mathcal{N}(\varepsilon, T, d) \leq \mathcal{M}(\varepsilon, T, d)$

Intuition

The right inequality says: a maximal packing is also a covering. If you have a maximal $\varepsilon$ -packing $\{t_1, \ldots, t_M\}$ , then every point in $T$ must be within $\varepsilon$ of some $t_i$ (otherwise you could add it to the packing, contradicting maximality). So the packing is itself an $\varepsilon$ -net.

The left inequality says: every $\varepsilon$ -ball in a covering can contain at most one point from a $2\varepsilon$ -packing (since packing points are $>2\varepsilon$ apart, they cannot both be within $\varepsilon$ of the same center).

Failure Mode

The factor of 2 is tight in general. For the unit interval $[0,1]$ with the standard metric and $\varepsilon = 1/4$ : $\mathcal{N}(1/4) = 2$ (centers at $1/4$ and $3/4$ ) while $\mathcal{M}(1/4) = 4$ (points at $0, 1/3, 2/3, 1$ ). The gap matters when you need exact constants, but for asymptotic analysis (log covering number vs. log packing number), the factor of 2 vanishes.

report a correction →

Where Each Appears Naturally

Covering numbers in upper bounds

Covering numbers appear when you need to discretize a continuous set and apply a union bound. The standard pattern:

Cover the hypothesis class with an $\varepsilon$ -net of size $\mathcal{N}(\varepsilon)$ .
Apply a concentration inequality to each element of the net (finitely many).
Union bound over the net.
Use the $\varepsilon$ -net property to extend the bound to the full class.

This yields generalization bounds like:

$\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_n(h)| \leq \varepsilon + \sqrt{\frac{\log \mathcal{N}(\varepsilon, \mathcal{H}, \|\cdot\|_\infty) + \log(2/\delta)}{2n}}$

Packing numbers in lower bounds

Packing numbers appear in minimax lower bounds via Fano's inequality or Assouad's lemma. The pattern:

Construct a large $\varepsilon$ -packing of the parameter space.
Show that distinguishing between packing elements requires many samples.
Conclude that no estimator can achieve accuracy better than $\varepsilon$ with fewer than the required samples.

The packing number gives the number of "well-separated hypotheses" you must distinguish among, which controls the information-theoretic difficulty.

Key Assumptions That Differ

	Covering Number	Packing Number
Asks	Minimum set to approximate all of $T$	Maximum set of well-separated points
Extremal	Minimization (fewest centers)	Maximization (most points)
Proof role	Upper bounds (discretization + union bound)	Lower bounds (Fano, Assouad)
Relation to entropy	$\log \mathcal{N}(\varepsilon)$ is the metric entropy	$\log \mathcal{M}(\varepsilon)$ is the packing entropy
Asymptotically	Same rate as packing (up to $\varepsilon \to \varepsilon/2$ )	Same rate as covering

Metric Entropy

The metric entropy is $\log \mathcal{N}(\varepsilon, T, d)$ , which measures the number of bits needed to specify a point in $T$ to accuracy $\varepsilon$ . By the sandwich inequality, $\log \mathcal{M}(\varepsilon)$ and $\log \mathcal{N}(\varepsilon)$ have the same asymptotic growth rate.

Example

Unit ball in R^d

For the unit ball $B_2^d = \{x \in \mathbb{R}^d : \|x\|_2 \leq 1\}$ with the Euclidean metric:

$\left(\frac{1}{\varepsilon}\right)^d \leq \mathcal{N}(\varepsilon, B_2^d, \|\cdot\|_2) \leq \left(\frac{3}{\varepsilon}\right)^d$

The metric entropy is $\Theta(d \log(1/\varepsilon))$ , linear in the dimension. This is why dimension appears linearly in generalization bounds based on covering numbers.

Example

Lipschitz function classes

The class of $L$ -Lipschitz functions from $[0,1]$ to $[0,1]$ has covering number $\mathcal{N}(\varepsilon) \sim (L/\varepsilon)$ in the sup-norm. The metric entropy is $\Theta(\log(1/\varepsilon))$ , independent of any "dimension" parameter. This is why nonparametric regression over Lipschitz classes achieves $n^{-2/3}$ rates in one dimension.

Common Confusions

Watch Out

Covering and packing numbers are not the same

Despite being within a factor of 2, they are not interchangeable in proofs. Using a packing number where a covering number is needed (or vice versa) introduces an extra factor of 2 in the $\varepsilon$ argument, which can change the final bound. In asymptotic analysis this rarely matters, but in non-asymptotic bounds the constants can be important.

Watch Out

The metric matters as much as the set

The covering number of a hypothesis class depends heavily on the metric used. In learning theory, the relevant metric is usually the empirical $L_2$ or $L_\infty$ norm, not the parameter-space Euclidean norm. A class that is "small" in parameter space can be "large" in function space, and vice versa.

References

Canonical:

van der Vaart & Wellner, Weak Convergence and Empirical Processes (1996), Section 2.1
Vershynin, High-Dimensional Probability (2018), Chapter 4

Current:

Wainwright, High-Dimensional Statistics (2019), Chapter 5