Paper breakdown

Uniform Convergence May Be Unable to Explain Generalization in Deep Learning

Vaishnavh Nagarajan and J. Zico Kolter · 2019 · NeurIPS 2019

Constructs an overparameterised setting where every uniform-convergence-based generalisation bound — including margin and norm-based bounds — must be vacuous, while SGD generalises well. Establishes a theoretical limit on what classical statistical-learning theory can explain about deep networks.

arXiv:1902.04742

Overview

Nagarajan and Kolter (2019) gave classical generalisation theory its sharpest negative result for deep learning. The paper constructs a simple overparameterised classification setup — a high-dimensional Gaussian-mixture problem trained with SGD — where the trained model generalises well but where every two-sided uniform-convergence bound that holds for the relevant hypothesis class must be vacuous.

The result is sharp in two senses. It applies not just to vanilla VC-dimension-style bounds (which were already known to be loose for deep networks) but to the entire class of uniform-convergence bounds, including Rademacher complexity, margin bounds, norm-based bounds (Bartlett-Foster-Telgarsky, Neyshabur, etc.), and PAC-Bayes bounds when those reduce to uniform convergence. It is a structural lower bound, not a tightness result for a specific bound.

The paper does not claim deep networks do not generalise. It claims that if you want to explain why they generalise, classical uniform-convergence machinery is provably insufficient — you need to use something about the training algorithm or the data distribution that does not factor through uniform convergence over the relevant hypothesis class.

Mathematical Contributions

The construction

Consider a binary classification problem on $\mathbb{R}^d$ with class-conditional distributions $\mathcal{N}(\mu_y, I_d / d)$ for $y \in \{-1, +1\}$ , where $\|\mu_+ - \mu_-\| = 1$ . The hypothesis class $\mathcal{H}$ is the set of linear classifiers obtained by running gradient descent with a fixed step size from a fixed initialisation on training samples of size $n$ , for varying training samples. The class is in effect parameterised by the training set itself.

The key choice is dimension scaling: take $d$ growing with $n$ (the paper uses $d$ much larger than $n$ ). The trained classifier $h_S$ depends sensitively on $S$ — flipping a single sample's label produces a different $h_S$ — but $h_S$ generalises well on the original distribution.

Why uniform convergence must be vacuous

Uniform convergence at confidence $1 - \delta$ states:

$\sup_{h \in \mathcal{H}} \big| L_D(h) - L_S(h) \big| \leq \epsilon(n, \mathcal{H}, \delta)$

where $L_D$ is the population risk, $L_S$ the empirical risk, and $\epsilon$ a complexity term depending on $n$ , the hypothesis class, and the failure probability.

The paper's "two-sided" version of this bound (the standard one) requires $\epsilon$ to upper bound the gap for every $h \in \mathcal{H}$ . The construction shows that within $\mathcal{H}$ — the class of trained classifiers reachable from this distribution and algorithm — there exist hypotheses with gap close to $1/2$ even though the actual trained classifier has gap close to zero.

The mechanism is geometric. The trained classifier has a positive margin on its training set $S$ , so it generalises. But within $\mathcal{H}$ , one can construct (by adversarial choice of training set) a classifier whose decision boundary cuts through the population mass while still classifying its specific training set perfectly. Such a classifier has empirical risk zero and population risk near $1/2$ . Uniform convergence must accommodate this hypothesis. So the bound on $\sup_h |L_D(h) - L_S(h)|$ has to be at least $\sim 1/2$ , which is vacuous.

Why this is sharp

Two technical points make the result a genuine lower bound rather than a hand-wave.

First, the relevant hypothesis class is "the set of classifiers actually produced by SGD on this distribution" — not "all linear classifiers" or "all networks of bounded norm." The paper proves that even this realised, algorithm-dependent class is rich enough to contain bad hypotheses for some training sets.

Second, the construction works for any uniform-convergence bound, including PAC-Bayes when it reduces to uniform convergence. The result is not about a specific complexity measure being too loose — it is about the uniform-convergence framework itself.

What the paper does not claim

The paper is careful about scope. It does not claim:

Deep networks fail to generalise.
All known generalisation bounds are vacuous (one-sided bounds and algorithm-dependent stability bounds escape the construction).
Generalisation is unexplainable.

It claims that if your bound only depends on the training set size, the data distribution's marginal complexity, and a property of the hypothesis class, then on this construction it must be vacuous. To explain generalisation here, you need stability, implicit bias, or some property of the trajectory SGD actually takes.

Connections to TheoremPath Topics

Uniform convergence — the framework whose limits this paper documents.
Rademacher complexity — the standard data-dependent uniform-convergence tool that the paper's construction defeats.
VC dimension — the classical capacity measure; vacuous in the relevant overparameterised regime well before this paper.
Implicit bias and modern generalization — the alternative explanation programme this paper motivates: study what SGD actually converges to, not what its hypothesis class allows.
Empirical risk minimization — the classical framing the paper challenges.
Sample complexity bounds — what uniform-convergence bounds are typically used to prove.
Hypothesis classes and function spaces — the structure on which uniform convergence is defined.

Why It Matters Now

This paper is the cleanest articulation of what classical statistical-learning theory cannot explain. Five years on, it shapes how the deep-learning-theory community frames its open problems.

The constructive consequence is that explanations of deep-learning generalisation must be algorithm-dependent. Algorithmic stability (Bousquet-Elisseeff and successors), implicit-bias arguments (gradient descent on logistic loss converges to the max-margin classifier on linearly separable data), and trajectory-aware bounds are the directions that survive the Nagarajan-Kolter critique because they do not factor through uniform convergence over a fixed hypothesis class. The benign-overfitting and double-descent literatures live in the same intellectual neighbourhood.

The paper also clarifies what was previously confused. Many norm-based bounds proposed between 2017 and 2019 were vacuous on standard benchmarks but treated as "the right direction." Nagarajan and Kolter explained that the vacuity is not a quantitative shortcoming to be fixed by a tighter constant; it is a structural feature of uniform-convergence reasoning in this regime.

For practitioners, the paper is one of the few theory-side results that has clear implications for what to measure. If you care about generalisation, look at the algorithm and the trajectory, not at the hypothesis class. The interpretability and scaling-laws literatures both implicitly accept this premise.

References

Canonical:

Nagarajan, V., & Kolter, J. Z. (2019). "Uniform Convergence May Be Unable to Explain Generalization in Deep Learning." NeurIPS. arXiv:1902.04742.

Direct precursors and motivation:

Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017). "Understanding Deep Learning Requires Rethinking Generalization." ICLR. arXiv:1611.03530. The paper that opened the question.
Neyshabur, B., Bhojanapalli, S., McAllester, D., & Srebro, N. (2017). "Exploring Generalization in Deep Learning." NeurIPS. arXiv:1706.08947.
Bartlett, P. L., Foster, D. J., & Telgarsky, M. J. (2017). "Spectrally-Normalized Margin Bounds for Neural Networks." NeurIPS. arXiv:1706.08498. One of the bounds the construction defeats.

Algorithm-dependent alternatives:

Soudry, D. et al. (2018). "The Implicit Bias of Gradient Descent on Separable Data." JMLR. arXiv:1710.10345.
Bartlett, P. L., Long, P. M., Lugosi, G., & Tsigler, A. (2020). "Benign overfitting in linear regression." PNAS. arXiv:1906.11300.
Hardt, M., Recht, B., & Singer, Y. (2016). "Train Faster, Generalize Better: Stability of Stochastic Gradient Descent." ICML. arXiv:1509.01240.

Standard treatments of uniform convergence:

Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge. Chapters 4-6, 26.
Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge. Chapter 4.
Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2018). Foundations of Machine Learning. MIT Press. Chapter 3.

Connected topics

Last reviewed: May 5, 2026