What Each Measures
Both covering numbers and packing numbers quantify the "size" of a set in a metric space at a given resolution . They answer dual questions about the same geometric object.
Covering numbers ask: what is the minimum number of -balls needed to cover the entire set?
Packing numbers ask: what is the maximum number of points in the set that are mutually -separated?
Side-by-Side Statement
Covering Number
The covering number is the minimum cardinality of an -net for in metric . That is, the smallest set such that for every , there exists with .
Packing Number
The packing number is the maximum cardinality of an -packing of . That is, the largest set such that for all .
The Factor-of-2 Relationship
Covering-Packing Sandwich Inequality
Statement
For any subset of a metric space and any :
Intuition
The right inequality says: a maximal packing is also a covering. If you have a maximal -packing , then every point in must be within of some (otherwise you could add it to the packing, contradicting maximality). So the packing is itself an -net.
The left inequality says: every -ball in a covering can contain at most one point from a -packing (since packing points are apart, they cannot both be within of the same center).
Failure Mode
The factor of 2 is tight in general. For the unit interval with the standard metric and : (centers at and ) while (points at ). The gap matters when you need exact constants, but for asymptotic analysis (log covering number vs. log packing number), the factor of 2 vanishes.
Where Each Appears Naturally
Covering numbers in upper bounds
Covering numbers appear when you need to discretize a continuous set and apply a union bound. The standard pattern:
- Cover the hypothesis class with an -net of size .
- Apply a concentration inequality to each element of the net (finitely many).
- Union bound over the net.
- Use the -net property to extend the bound to the full class.
This yields generalization bounds like:
Packing numbers in lower bounds
Packing numbers appear in minimax lower bounds via Fano's inequality or Assouad's lemma. The pattern:
- Construct a large -packing of the parameter space.
- Show that distinguishing between packing elements requires many samples.
- Conclude that no estimator can achieve accuracy better than with fewer than the required samples.
The packing number gives the number of "well-separated hypotheses" you must distinguish among, which controls the information-theoretic difficulty.
Key Assumptions That Differ
| Covering Number | Packing Number | |
|---|---|---|
| Asks | Minimum set to approximate all of | Maximum set of well-separated points |
| Extremal | Minimization (fewest centers) | Maximization (most points) |
| Proof role | Upper bounds (discretization + union bound) | Lower bounds (Fano, Assouad) |
| Relation to entropy | is the metric entropy | is the packing entropy |
| Asymptotically | Same rate as packing (up to ) | Same rate as covering |
Metric Entropy
The metric entropy is , which measures the number of bits needed to specify a point in to accuracy . By the sandwich inequality, and have the same asymptotic growth rate.
Unit ball in R^d
For the unit ball with the Euclidean metric:
The metric entropy is , linear in the dimension. This is why dimension appears linearly in generalization bounds based on covering numbers.
Lipschitz function classes
The class of -Lipschitz functions from to has covering number in the sup-norm. The metric entropy is , independent of any "dimension" parameter. This is why nonparametric regression over Lipschitz classes achieves rates in one dimension.
Common Confusions
Covering and packing numbers are not the same
Despite being within a factor of 2, they are not interchangeable in proofs. Using a packing number where a covering number is needed (or vice versa) introduces an extra factor of 2 in the argument, which can change the final bound. In asymptotic analysis this rarely matters, but in non-asymptotic bounds the constants can be important.
The metric matters as much as the set
The covering number of a hypothesis class depends heavily on the metric used. In learning theory, the relevant metric is usually the empirical or norm, not the parameter-space Euclidean norm. A class that is "small" in parameter space can be "large" in function space, and vice versa.
References
Canonical:
- van der Vaart & Wellner, Weak Convergence and Empirical Processes (1996), Section 2.1
- Vershynin, High-Dimensional Probability (2018), Chapter 4
Current:
- Wainwright, High-Dimensional Statistics (2019), Chapter 5