Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Comparison

Covering Numbers vs. Packing Numbers

Covering numbers count the minimum eps-net size. Packing numbers count the maximum number of eps-separated points. They are within a factor of 2, but each appears naturally in different proof contexts.

What Each Measures

Both covering numbers and packing numbers quantify the "size" of a set in a metric space at a given resolution ε\varepsilon. They answer dual questions about the same geometric object.

Covering numbers ask: what is the minimum number of ε\varepsilon-balls needed to cover the entire set?

Packing numbers ask: what is the maximum number of points in the set that are mutually ε\varepsilon-separated?

Side-by-Side Statement

Definition

Covering Number

The covering number N(ε,T,d)\mathcal{N}(\varepsilon, T, d) is the minimum cardinality of an ε\varepsilon-net for TT in metric dd. That is, the smallest set {t1,,tN}T\{t_1, \ldots, t_N\} \subseteq T such that for every tTt \in T, there exists tit_i with d(t,ti)εd(t, t_i) \leq \varepsilon.

Definition

Packing Number

The packing number M(ε,T,d)\mathcal{M}(\varepsilon, T, d) is the maximum cardinality of an ε\varepsilon-packing of TT. That is, the largest set {t1,,tM}T\{t_1, \ldots, t_M\} \subseteq T such that d(ti,tj)>εd(t_i, t_j) > \varepsilon for all iji \neq j.

The Factor-of-2 Relationship

Theorem

Covering-Packing Sandwich Inequality

Statement

For any subset TT of a metric space (X,d)(X, d) and any ε>0\varepsilon > 0:

M(2ε,T,d)N(ε,T,d)M(ε,T,d)\mathcal{M}(2\varepsilon, T, d) \leq \mathcal{N}(\varepsilon, T, d) \leq \mathcal{M}(\varepsilon, T, d)

Intuition

The right inequality says: a maximal packing is also a covering. If you have a maximal ε\varepsilon-packing {t1,,tM}\{t_1, \ldots, t_M\}, then every point in TT must be within ε\varepsilon of some tit_i (otherwise you could add it to the packing, contradicting maximality). So the packing is itself an ε\varepsilon-net.

The left inequality says: every ε\varepsilon-ball in a covering can contain at most one point from a 2ε2\varepsilon-packing (since packing points are >2ε>2\varepsilon apart, they cannot both be within ε\varepsilon of the same center).

Failure Mode

The factor of 2 is tight in general. For the unit interval [0,1][0,1] with the standard metric and ε=1/4\varepsilon = 1/4: N(1/4)=2\mathcal{N}(1/4) = 2 (centers at 1/41/4 and 3/43/4) while M(1/4)=4\mathcal{M}(1/4) = 4 (points at 0,1/3,2/3,10, 1/3, 2/3, 1). The gap matters when you need exact constants, but for asymptotic analysis (log covering number vs. log packing number), the factor of 2 vanishes.

Where Each Appears Naturally

Covering numbers in upper bounds

Covering numbers appear when you need to discretize a continuous set and apply a union bound. The standard pattern:

  1. Cover the hypothesis class with an ε\varepsilon-net of size N(ε)\mathcal{N}(\varepsilon).
  2. Apply a concentration inequality to each element of the net (finitely many).
  3. Union bound over the net.
  4. Use the ε\varepsilon-net property to extend the bound to the full class.

This yields generalization bounds like:

suphHR(h)R^n(h)ε+logN(ε,H,)+log(2/δ)2n\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_n(h)| \leq \varepsilon + \sqrt{\frac{\log \mathcal{N}(\varepsilon, \mathcal{H}, \|\cdot\|_\infty) + \log(2/\delta)}{2n}}

Packing numbers in lower bounds

Packing numbers appear in minimax lower bounds via Fano's inequality or Assouad's lemma. The pattern:

  1. Construct a large ε\varepsilon-packing of the parameter space.
  2. Show that distinguishing between packing elements requires many samples.
  3. Conclude that no estimator can achieve accuracy better than ε\varepsilon with fewer than the required samples.

The packing number gives the number of "well-separated hypotheses" you must distinguish among, which controls the information-theoretic difficulty.

Key Assumptions That Differ

Covering NumberPacking Number
AsksMinimum set to approximate all of TTMaximum set of well-separated points
ExtremalMinimization (fewest centers)Maximization (most points)
Proof roleUpper bounds (discretization + union bound)Lower bounds (Fano, Assouad)
Relation to entropylogN(ε)\log \mathcal{N}(\varepsilon) is the metric entropylogM(ε)\log \mathcal{M}(\varepsilon) is the packing entropy
AsymptoticallySame rate as packing (up to εε/2\varepsilon \to \varepsilon/2)Same rate as covering

Metric Entropy

The metric entropy is logN(ε,T,d)\log \mathcal{N}(\varepsilon, T, d), which measures the number of bits needed to specify a point in TT to accuracy ε\varepsilon. By the sandwich inequality, logM(ε)\log \mathcal{M}(\varepsilon) and logN(ε)\log \mathcal{N}(\varepsilon) have the same asymptotic growth rate.

Example

Unit ball in R^d

For the unit ball B2d={xRd:x21}B_2^d = \{x \in \mathbb{R}^d : \|x\|_2 \leq 1\} with the Euclidean metric:

(1ε)dN(ε,B2d,2)(3ε)d\left(\frac{1}{\varepsilon}\right)^d \leq \mathcal{N}(\varepsilon, B_2^d, \|\cdot\|_2) \leq \left(\frac{3}{\varepsilon}\right)^d

The metric entropy is Θ(dlog(1/ε))\Theta(d \log(1/\varepsilon)), linear in the dimension. This is why dimension appears linearly in generalization bounds based on covering numbers.

Example

Lipschitz function classes

The class of LL-Lipschitz functions from [0,1][0,1] to [0,1][0,1] has covering number N(ε)(L/ε)\mathcal{N}(\varepsilon) \sim (L/\varepsilon) in the sup-norm. The metric entropy is Θ(log(1/ε))\Theta(\log(1/\varepsilon)), independent of any "dimension" parameter. This is why nonparametric regression over Lipschitz classes achieves n2/3n^{-2/3} rates in one dimension.

Common Confusions

Watch Out

Covering and packing numbers are not the same

Despite being within a factor of 2, they are not interchangeable in proofs. Using a packing number where a covering number is needed (or vice versa) introduces an extra factor of 2 in the ε\varepsilon argument, which can change the final bound. In asymptotic analysis this rarely matters, but in non-asymptotic bounds the constants can be important.

Watch Out

The metric matters as much as the set

The covering number of a hypothesis class depends heavily on the metric used. In learning theory, the relevant metric is usually the empirical L2L_2 or LL_\infty norm, not the parameter-space Euclidean norm. A class that is "small" in parameter space can be "large" in function space, and vice versa.

References

Canonical:

Current: