Paper breakdown

Reconciling modern machine learning practice and the bias-variance trade-off

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal · 2018 · PNAS 2019

Documents the double-descent risk curve: as model capacity grows past the interpolation threshold, test error decreases again instead of monotonically rising. Reframes the bias-variance picture for over-parameterized models.

arXiv:1812.11118

Overview

Belkin, Hsu, Ma, and Mandal (2018) collected experiments from random features, random forests, and 2-layer neural networks and overlaid the test error against model capacity on a single shape. The classical bias-variance picture predicts a U-shaped curve: a sweet spot at moderate capacity, with test error climbing as the model overfits at high capacity. The empirical curve does this for capacity below an "interpolation threshold" — the smallest model class that can fit the training data exactly — but then descends again past that threshold, sometimes to test errors below the sweet-spot minimum.

The paper does not claim deep neural networks generalize because of double descent. It claims that the textbook U-curve is wrong as a general account of model capacity, and that the over-parameterized regime — fit the training data to zero error, then keep growing — is its own thing with its own risk curve. The naming and the figure are the contribution; the experiments themselves are short and the analysis is left to follow-on work.

The paper is now read alongside Zhang et al. (2016) and Nakkiran et al. (2019). Zhang showed deep networks fit random labels (so VC and Rademacher capacity bounds are vacuous). Belkin et al. showed the failure mode of those classical bounds is shaped, not random — there really is a regime where bigger is better. Nakkiran later extended the picture to show double descent as a function of training time and dataset size, not only model size.

Mathematical Contributions

The interpolation threshold

For a hypothesis class $\mathcal{H}_N$ indexed by capacity parameter $N$ (e.g. number of features, number of parameters, depth), the interpolation threshold is the smallest $N^*$ such that $\mathcal{H}_{N^*}$ can fit the training set $S = \{(x_i, y_i)\}_{i=1}^n$ exactly:

$N^* = \min\{N : \exists\, h \in \mathcal{H}_N\ \text{with}\ h(x_i) = y_i\ \forall\, (x_i, y_i) \in S\}$

For random feature models with i.i.d. Gaussian features, $N^* \approx n$ . For shallow ReLU networks, $N^* \approx n / d$ where $d$ is input dimension. The threshold is empirically the location of the test-error spike.

The risk curve

The paper's central figure plots test risk as a function of $N$ for a fixed training set:

$R(N) = \mathbb{E}_{(x,y)}[\ell(\hat h_N(x), y)]$

where $\hat h_N$ is the empirical risk minimizer over $\mathcal{H}_N$ (with appropriate tie-breaking — typically the minimum-norm interpolating solution). For $N \ll N^*$ , $R(N)$ decreases (bias decreasing faster than variance grows). For $N \to N^*$ from below, $R(N)$ spikes — the unique interpolating solution is forced onto the data and inherits its noise. For $N \gg N^*$ , $R(N)$ decreases again because the interpolation manifold widens and the implicit minimum-norm tie-break selects a smoother fit.

Why the spike

At the threshold the interpolating hypothesis is unique. For random features with standard Gaussian weights and $N \approx n$ , the design matrix $\Phi \in \mathbb{R}^{n \times N}$ is approximately square, the smallest singular value $\sigma_{\min}(\Phi)$ approaches zero, and the closed-form solution $\hat\beta = (\Phi^\top \Phi)^{-1} \Phi^\top y$ has norm $\sim 1/\sigma_{\min}$ that diverges. The interpolating fit is forced through every label including label noise and exhibits high variance at test points.

Why the second descent

For $N \gg N^*$ the system is under-determined: many $\beta$ interpolate the training data. The minimum- $L^2$ -norm solution $\hat\beta = \Phi^+ y$ (Moore-Penrose pseudoinverse) is one of them, and it is the implicit choice made by gradient descent from zero initialization. As $N$ grows, the interpolation manifold grows, and the minimum-norm interpolant uses smaller per-feature weights, reducing variance. The bias does not increase fast enough to offset the variance reduction, so test risk descends.

This argument is made rigorous for random feature regression by Mei and Montanari (2019) and Hastie et al. (2019), with explicit asymptotic risk expressions in the proportional-asymptotics regime $N, n, d \to \infty$ with fixed ratios. The closed-form curve matches the experimental observation in the original paper.

What "capacity" means here

The paper uses several capacity proxies. For random forests, the number of trees. For random features, the number of features. For neural networks, the width. The key property is that the proxy crosses the interpolation threshold within the experimentally accessible range. VC dimension and Rademacher complexity are explicitly not the right axis for this picture — those are non-monotone in $N$ and do not produce a clean threshold.

Connections to TheoremPath Topics

Double descent — the standard treatment with model-wise, sample-wise, and epoch-wise variants.
Bias-variance tradeoff — the classical decomposition the paper revises for the over-parameterized regime.
Benign overfitting — the closely related phenomenon (zero training error, low test error) studied via min-norm interpolation in linear regression.
Implicit bias and modern generalization — the implicit min-norm bias of gradient descent that selects the descending side of the curve.
Uniform convergence — why classical capacity bounds become vacuous in this regime.
Rademacher complexity — why distribution-free Rademacher bounds cannot capture the second descent.
PAC learning framework — the framework against which the curve breaks.

Why It Matters Now

The paper is the canonical empirical evidence that classical capacity-based generalization theory does not describe the regime modern deep learning actually operates in. Combined with Zhang et al. (2016)'s memorization result, it forced the field to take seriously the question of why over-parameterized networks generalize, which produced the implicit-bias literature, the neural-tangent-kernel line of work, and the benign-overfitting analyses for linear and kernel regression.

For practice, the lesson is concrete: do not trust a sweet-spot story. If you have noticeable test-error degradation as model size grows, you may be at the interpolation threshold; the next step is to add capacity, not subtract it. This is now the orthodoxy in deep learning, but in 2018 it was an empirical curiosity.

The follow-on work matters more than the original paper for theory. Nakkiran et al. (2019) extended the picture to deep networks and to sample size and training time. Mei and Montanari (2019) gave the closed-form risk for random feature regression. Frei, Vardi, Bartlett, Srebro (2022) showed benign overfitting requires specific data distributions, not all over-parameterization is benign. The 2018 paper named the phenomenon and made it visible; the 2019-2022 papers explained it.

The paper is also a reminder that classical learning theory (VC, Rademacher) was not wrong — it was answering a different question. Distribution-free worst-case bounds are sharp for adversarial data, but real data has structure that makes the min-norm interpolant generalize. The interesting theory question shifted from "can we bound the worst case" to "what does gradient descent actually find."

References

Canonical:

Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). "Reconciling modern machine learning practice and the bias-variance trade-off." PNAS 116(32). arXiv:1812.11118.

Direct precursors:

Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017). "Understanding deep learning requires rethinking generalization." ICLR. arXiv:1611.03530. Random labels are perfectly fit; classical bounds vacuous.
Geman, S., Bienenstock, E., & Doursat, R. (1992). "Neural Networks and the Bias/Variance Dilemma." Neural Computation 4(1). The classical U-curve formulation.

Theoretical analyses:

Hastie, T., Montanari, A., Rosset, S., & Tibshirani, R. J. (2022). "Surprises in High-Dimensional Ridgeless Least Squares Interpolation." Annals of Statistics 50(2). arXiv:1903.08560. Closed-form asymptotic risk for min-norm interpolation.
Mei, S., & Montanari, A. (2022). "The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve." Communications on Pure and Applied Mathematics 75(4). arXiv:1908.05355. The first analytic derivation of the double-descent shape.
Bartlett, P. L., Long, P. M., Lugosi, G., & Tsigler, A. (2020). "Benign overfitting in linear regression." PNAS 117(48). arXiv:1906.11300. Conditions under which interpolation is harmless.

Follow-on work:

Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., & Sutskever, I. (2020). "Deep Double Descent: Where Bigger Models and More Data Hurt." ICLR. arXiv:1912.02292. Extends to deep nets, training time, dataset size.
Frei, S., Vardi, G., Bartlett, P., & Srebro, N. (2023). "Benign Overfitting in Linear Classifiers and Leaky ReLU Networks from KKT Conditions." arXiv:2303.01462.

Standard textbook:

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. Chapter 7 — the classical bias-variance picture this paper revises.
Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2018). Foundations of Machine Learning (2nd ed.). MIT Press. Chapter 3 — VC and Rademacher framework.

Connected topics

Last reviewed: May 5, 2026