Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Modern Generalization

Double Descent

Test error follows a double-descent curve: it decreases, peaks at the interpolation threshold, then decreases again in the overparameterized regime, defying classical bias-variance intuition.

AdvancedTier 2Current~65 min

Why This Matters

underparameterizedoverparameterizedinterpolation threshold0.10.20.30.4Test errorModel complexity (parameters / samples)Classical bias-varianceModern double descent

For decades, the conventional wisdom was simple: as model complexity increases beyond a certain point, test error goes up (overfitting). This is the textbook U-shaped bias-variance tradeoff curve. Double descent breaks this picture. In modern overparameterized models, test error can decrease again past the interpolation threshold, creating a second descent. The phenomenon was observed by Belkin, Hsu, Mandal (2018) in "Overfitting or Perfect Fitting?" and by Spigler, Geiger, d'Ascoli, Sarao Mannelli, Biroli, Wyart (2018) as a jamming transition. Belkin et al. (2019) coined the term "double descent" and documented it systematically across linear models, random features, decision trees, and neural networks. It has since been given substantial theoretical footing via random matrix theory and benign overfitting results (Hastie et al. 2022, Bartlett et al. 2020, Mei & Montanari 2022), but the tension between overparameterization and classical uniform-convergence bounds remains an open question for general architectures.

Mental Model

Imagine increasing the number of parameters pp in your model while keeping the training set size nn fixed:

  1. Underparameterized regime (pnp \ll n): Classical behavior. More parameters reduce bias, test error decreases.
  2. Interpolation threshold (pnp \approx n): The model has just enough capacity to perfectly fit (interpolate) the training data. But it does so in the most fragile way possible. any noise in the data gets amplified. Test error peaks.
  3. Overparameterized regime (pnp \gg n): Many models interpolate the data. Gradient descent from zero initialization stays in the row space of XX; the unique interpolator in row(X)\text{row}(X) is the minimum Euclidean norm solution, which turns out to be surprisingly smooth. Test error decreases again.

The Double Descent Phenomenon

Definition

Model-wise Double Descent

Fix a training set of size nn. Plot test error as a function of model complexity (e.g., number of parameters pp). The curve exhibits:

  1. A classical U-shape for p<np < n
  2. A peak at pnp \approx n (the interpolation threshold)
  3. A second descent for p>np > n, where test error can drop below the minimum achieved in the classical regime

This was demonstrated empirically by Belkin et al. (2019) across linear models, random features, decision trees, and neural networks.

Definition

Epoch-wise Double Descent

Fix the model architecture. Plot test error as a function of training epochs. The curve can exhibit a similar pattern: test error first decreases, then increases (as the model begins to interpolate noisy labels), then decreases again with further training. This is especially pronounced in large models trained with label noise.

Definition

Interpolation Threshold

The interpolation threshold is the critical model complexity at which the model first achieves zero training error. For linear regression with pp features and nn samples, this occurs at p=np = n. At this point, the system of equations Xw=yXw = y transitions from overdetermined (no exact solution) to underdetermined (infinitely many solutions).

The Peak Explained

Theorem

Divergence at the Interpolation Threshold

Statement

Consider linear regression with pp features and nn samples where XX has i.i.d. entries, noise variance σ2>0\sigma^2 > 0, and γ=p/n\gamma = p/n. The variance term of the ridgeless least-squares risk is proportional to σ2/γ1\sigma^2 / |\gamma - 1|, which diverges as γ1\gamma \to 1 from either side:

R(w^)asp/n1,σ2>0R(\hat{w}) \to \infty \quad \text{as} \quad p/n \to 1, \quad \sigma^2 > 0

At exactly γ=1\gamma = 1 the min-norm estimator is not uniquely defined. In the noiseless case (σ2=0\sigma^2 = 0) the estimator recovers β\beta^* exactly and no peak appears. The divergence is driven by the smallest nonzero singular value of XX collapsing at the transition point.

Intuition

At p=np = n, the design matrix XX is square. Its smallest singular value approaches zero, meaning the system is nearly rank-deficient. The minimum-norm solution must "stretch" enormously to interpolate the data through this near-singular system, amplifying noise into wild predictions.

Proof Sketch

Fix c=p/nc = p/n. For iid Gaussian entries (Marchenko-Pastur, Bai-Silverstein 2010 Ch. 3; Vershynin HDP 2018 §4.7):

  • If c<1c < 1: XX/nX^\top X / n has full rank pp almost surely and its smallest eigenvalue converges to (1c)2(1 - \sqrt{c})^2, which vanishes as c1c \to 1.
  • If c>1c > 1: XX/nX^\top X / n is p×pp \times p of rank at most nn, with exactly pnp - n zero eigenvalues. The smallest nonzero eigenvalue, equivalently the smallest eigenvalue of XX/nXX^\top / n, converges to (11/c)2(1 - \sqrt{1/c})^2, which vanishes as c1c \to 1.
  • At c=1c = 1: the bulk edge collapses to 00.

The variance of the min-norm estimator involves an inverse-eigenvalue sum against this spectrum, which diverges as the edge approaches zero from either side.

Why It Matters

This gives a precise mathematical explanation for the interpolation peak. It is not just an empirical curiosity. It is a consequence of spectral properties of random matrices. The peak is sharpest when there is label noise; with zero noise, it can disappear.

Failure Mode

This analysis assumes isotropic (structureless) features. Real data has structure (low effective rank, clustered features), which can shift, broaden, or even eliminate the peak. The theory gives qualitative but not always quantitative predictions for structured data.

Min-Norm Interpolation and the Second Descent

Theorem

Min-Norm Interpolation in Overparameterized Regression

Statement

Under these assumptions, gradient descent on the least-squares loss converges to the minimum Euclidean norm interpolator:

w^min-norm=X(XX)1y\hat{w}_{\text{min-norm}} = X^\top(XX^\top)^{-1}y

For isotropic features with p/nγ>1p/n \to \gamma > 1 and unit-norm signal, the Hastie-Montanari-Rosset-Tibshirani (2022, Thm 1) asymptotic risk is

R(w^min-norm)=σ2γ1+β2(11γ)R(\hat{w}_{\text{min-norm}}) = \frac{\sigma^2}{\gamma - 1} + \|\beta^*\|^2 \left(1 - \frac{1}{\gamma}\right)

The first term is variance; the second is bias. As γ\gamma \to \infty, the variance vanishes and the bias term approaches β2\|\beta^*\|^2, so the total risk converges to β2\|\beta^*\|^2 (the null-signal risk). The risk does not converge to a fixed bias; both terms move with γ\gamma.

Intuition

When there are many more parameters than data points, the model has many ways to interpolate. The minimum-norm solution is the simplest (smallest weight vector), which acts as an implicit regularizer. More parameters means the minimum-norm solution can spread the "work" of fitting the data across more dimensions, requiring less extreme weights.

Proof Sketch

Decompose the risk into bias and variance. The variance term equals σ2n/(pn)\sigma^2 \cdot n/(p-n) for the min-norm estimator (from Marchenko-Pastur theory). As pp \to \infty with nn fixed, the variance term 0\to 0.

Why It Matters

This explains the second descent: past the interpolation threshold, adding more parameters reduces the variance term because the min-L2 bias spreads coordinate-wise error across more dimensions. The profile is specific to the min-L2 implicit bias. A different implicit bias (e.g. min-L1 from a different algorithm or loss) gives a different variance profile; see Hastie et al. 2022 Thm 2 and the Gunasekar et al. 2018 characterization of implicit bias by algorithm and geometry.

Failure Mode

The min-L2 characterization fails outside the stated assumptions:

  • Logistic regression on separable data: gradient descent converges in direction to the max-L2-margin solution, not to a min-norm interpolator (Soudry, Hoffer, Nacson, Gunasekar, Srebro 2018, arXiv:1710.10345).
  • Finite-width neural networks: no clean min-norm characterization outside the NTK regime; implicit bias depends on architecture, parameterization, and step size.
  • Other algorithms and geometries: Gunasekar, Lee, Soudry, Srebro (2018) "Characterizing Implicit Bias in Terms of Optimization Geometry" enumerates which settings give min-L2, min-L1, or other implicit regularizers.
  • Structured ground truth: even within the linear-regression setting, sparse or anisotropic β\beta^* can be better recovered by explicit regularization (ridge, lasso) than by the min-norm interpolator, especially near the threshold.

Benign Overfitting. Effective Rank Conditions

Bartlett, Long, Lugosi, Tsigler (2020), "Benign Overfitting in Linear Regression" (PNAS 117(48):30063-30070), identify when min-norm interpolation of noisy data is consistent. For a Gaussian linear model with covariance Σ\Sigma, define the effective ranks

rk(Σ)=i>kλiλk+1,Rk(Σ)=(i>kλi)2i>kλi2r_k(\Sigma) = \frac{\sum_{i > k} \lambda_i}{\lambda_{k+1}}, \qquad R_k(\Sigma) = \frac{\left(\sum_{i > k} \lambda_i\right)^2}{\sum_{i > k} \lambda_i^2}

where λ1λ2\lambda_1 \ge \lambda_2 \ge \dots are the eigenvalues of Σ\Sigma. Their main theorem gives matching upper and lower bounds on the excess risk of the min-norm interpolator in terms of rkr_k and RkR_k. Consistency requires a fast-decay head plus slow-decay tail: a small number of directions carrying signal, and a large effective-rank tail that absorbs noise. With exponentially decaying or uniformly heavy spectra the bound fails; benign overfitting is a spectrum-specific phenomenon.

Why Classical Bias-Variance Fails

The classical bias-variance tradeoff predicts a U-shaped curve:

Test error=Bias2+Variance+Noise\text{Test error} = \text{Bias}^2 + \text{Variance} + \text{Noise}

As complexity increases, bias decreases monotonically and variance increases monotonically. Their sum is U-shaped with a unique minimum.

This fails in the overparameterized regime because:

  1. Variance is not monotonically increasing. Past the interpolation threshold, variance decreases with more parameters, due to the interaction of the min-L2 implicit bias with high-dimensional feature geometry. Under a different implicit bias the profile changes (HMRT 2022 Thm 2).
  2. Bias is non-monotone in the overparameterized regime. For isotropic features the bias term β2(11/γ)\|\beta^*\|^2 (1 - 1/\gamma) increases with γ\gamma. The total risk can still decrease because the variance decrease dominates.
  3. The interpolation threshold is a phase transition, not a smooth tradeoff. The classical framework has no mechanism for the peak.

Canonical Examples

Example

Double descent in linear regression

Consider y=xw+ϵy = x^\top w^* + \epsilon with wRpw^* \in \mathbb{R}^p, n=100n = 100, and ϵN(0,0.1)\epsilon \sim \mathcal{N}(0, 0.1). Plot test MSE vs. pp:

  • For p<100p < 100: test MSE decreases as the model captures more of the signal
  • At p=100p = 100: test MSE spikes (the interpolation peak)
  • For p>100p > 100: test MSE decreases, eventually approaching the noise floor

With p=1000p = 1000 (10x overparameterized), the test MSE can be lower than the best p<100p < 100 model.

Example

Epoch-wise double descent in ResNets

Nakkiran et al. (2020, arXiv:1912.02292) showed that ResNet-18 trained on CIFAR-10 with 20% label noise exhibits epoch-wise double descent: test error first decreases, then increases as the model begins memorizing noisy labels, then decreases again with further training. In one configuration from the paper these three phases appear around epochs 50, 150, and 500, but those specific values are indicative only. Exact peak locations depend on learning rate schedule, batch size, width multiplier, and optimizer. The second descent requires sufficient model capacity to enter the interpolation regime.

Common Confusions

Watch Out

Double descent does not mean you should always use huge models

Double descent shows that overparameterization can generalize well, but properly tuned regularization (e.g., optimal ridge penalty) often outperforms the second descent. The practical lesson is not "bigger is always better" but rather "the interpolation regime is not catastrophic."

Watch Out

The peak requires noise

With zero label noise, there may be no interpolation peak. The double descent phenomenon is most dramatic when there is label noise, because the model is forced to fit the noise exactly at the interpolation threshold. Clean data can still show double descent, but the peak is much less pronounced.

Watch Out

Double descent is not unique to neural networks

The phenomenon appears in linear regression, random features, kernel methods, decision trees, and boosting. It is a property of overparameterized interpolation, not of any specific architecture.

Summary

  • Test error can follow a double descent curve: decrease, peak at the interpolation threshold (pnp \approx n), then decrease again
  • The peak is explained by random matrix theory: the condition number diverges at p=np = n
  • In the overparameterized regime, gradient descent finds the min-norm interpolator, which generalizes well
  • Epoch-wise double descent also occurs: test error can temporarily increase then decrease with more training
  • Classical bias-variance tradeoff fails because variance is non-monotone in the overparameterized regime

Exercises

ExerciseCore

Problem

In linear regression with n=50n = 50 samples and isotropic Gaussian features, you observe a sharp peak in test error at p=50p = 50 parameters. You add L2L_2 regularization (ridge regression) with penalty λ>0\lambda > 0. What happens to the peak? Why?

ExerciseAdvanced

Problem

Consider the min-norm interpolator w^=X(XX)1y\hat{w} = X^\top(XX^\top)^{-1}y in the regime p=5np = 5n. The noise variance is σ2\sigma^2. Using the formula Var=σ2n/(pn)\text{Var} = \sigma^2 n/(p-n), compute the variance component of the test risk. Compare this to the variance at p=1.1np = 1.1n (just past the threshold).

References

Canonical:

  • Belkin, Hsu, Mandal, "Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate", NeurIPS 2018; arXiv:1806.05161. Early observation of the non-monotone generalization behavior of interpolating estimators.
  • Spigler, Geiger, d'Ascoli, Sarao Mannelli, Biroli, Wyart, "A jamming transition from under- to over-parametrization affects generalization in deep learning", Journal of Physics A (2018); arXiv:1810.09665. Independent observation of the interpolation peak as a jamming transition in physics language.
  • Belkin, Hsu, Ma, Mandal, "Reconciling modern machine-learning practice and the classical bias-variance trade-off", PNAS 116(32):15849-15854 (2019); arXiv:1812.11118. Coined the term "double descent" and documented it systematically across linear models, random features, and neural networks.
  • Nakkiran, Kaplun, Bansal, Yang, Barak, Sutskever, "Deep Double Descent: Where Bigger Models and More Data Can Hurt", ICLR 2020; arXiv:1912.02292. Model-wise, sample-wise, and epoch-wise double descent in modern deep networks.
  • Hastie, Montanari, Rosset, Tibshirani, "Surprises in High-Dimensional Ridgeless Least Squares Interpolation", Annals of Statistics 50(2):949-986 (2022); arXiv:1903.08560. Random-matrix analysis of the ridgeless min-norm interpolator; Theorems 1 and 2 give the isotropic and anisotropic asymptotic risks.

Theory of the peak and the second descent:

  • Bartlett, Long, Lugosi, Tsigler, "Benign overfitting in linear regression", PNAS 117(48):30063-30070 (2020); arXiv:1906.11300. Effective-rank conditions rk(Σ)r_k(\Sigma) and Rk(Σ)R_k(\Sigma) under which min-norm interpolation of noisy data is consistent.
  • Mei, Montanari, "The generalization error of random features regression: Precise asymptotics and the double descent curve", Communications on Pure and Applied Mathematics (2022); arXiv:1908.05355. Exact asymptotics for random feature models.
  • Advani, Saxe, Sompolinsky, "High-dimensional dynamics of generalization error in neural networks", Neural Networks 132:428-446 (2020); arXiv:1710.03667. Early analytic account of non-monotone generalization error.

Implicit bias and Marchenko-Pastur background:

  • Soudry, Hoffer, Nacson, Gunasekar, Srebro, "The Implicit Bias of Gradient Descent on Separable Data", JMLR 19(70):1-57 (2018); arXiv:1710.10345. Logistic regression on separable data converges in direction to the max-L2-margin solution, not a min-norm interpolator.
  • Gunasekar, Lee, Soudry, Srebro, "Characterizing Implicit Bias in Terms of Optimization Geometry", ICML 2018; arXiv:1802.08246. Which algorithm-loss-geometry triples yield min-L2, min-L1, or other implicit regularizers.
  • Bai, Silverstein, Spectral Analysis of Large Dimensional Random Matrices (Springer, 2010), Chapter 3. Marchenko-Pastur law and edge behavior for both c<1c < 1 and c>1c > 1.
  • Vershynin, High-Dimensional Probability, Cambridge, 2018, §4.7. Non-asymptotic bounds on extreme singular values of random matrices.

Next Topics

  • Benign overfitting: when interpolating noisy data is provably harmless
  • Implicit bias: why gradient descent finds min-norm solutions

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics