Modern Generalization
Double Descent
Test error follows a double-descent curve: it decreases, peaks at the interpolation threshold, then decreases again in the overparameterized regime, defying classical bias-variance intuition.
Why This Matters
For decades, the conventional wisdom was simple: as model complexity increases beyond a certain point, test error goes up (overfitting). This is the textbook U-shaped bias-variance tradeoff curve. Double descent breaks this picture. In modern overparameterized models, test error can decrease again past the interpolation threshold, creating a second descent. The phenomenon was observed by Belkin, Hsu, Mandal (2018) in "Overfitting or Perfect Fitting?" and by Spigler, Geiger, d'Ascoli, Sarao Mannelli, Biroli, Wyart (2018) as a jamming transition. Belkin et al. (2019) coined the term "double descent" and documented it systematically across linear models, random features, decision trees, and neural networks. It has since been given substantial theoretical footing via random matrix theory and benign overfitting results (Hastie et al. 2022, Bartlett et al. 2020, Mei & Montanari 2022), but the tension between overparameterization and classical uniform-convergence bounds remains an open question for general architectures.
Mental Model
Imagine increasing the number of parameters in your model while keeping the training set size fixed:
- Underparameterized regime (): Classical behavior. More parameters reduce bias, test error decreases.
- Interpolation threshold (): The model has just enough capacity to perfectly fit (interpolate) the training data. But it does so in the most fragile way possible. any noise in the data gets amplified. Test error peaks.
- Overparameterized regime (): Many models interpolate the data. Gradient descent from zero initialization stays in the row space of ; the unique interpolator in is the minimum Euclidean norm solution, which turns out to be surprisingly smooth. Test error decreases again.
The Double Descent Phenomenon
Model-wise Double Descent
Fix a training set of size . Plot test error as a function of model complexity (e.g., number of parameters ). The curve exhibits:
- A classical U-shape for
- A peak at (the interpolation threshold)
- A second descent for , where test error can drop below the minimum achieved in the classical regime
This was demonstrated empirically by Belkin et al. (2019) across linear models, random features, decision trees, and neural networks.
Epoch-wise Double Descent
Fix the model architecture. Plot test error as a function of training epochs. The curve can exhibit a similar pattern: test error first decreases, then increases (as the model begins to interpolate noisy labels), then decreases again with further training. This is especially pronounced in large models trained with label noise.
Interpolation Threshold
The interpolation threshold is the critical model complexity at which the model first achieves zero training error. For linear regression with features and samples, this occurs at . At this point, the system of equations transitions from overdetermined (no exact solution) to underdetermined (infinitely many solutions).
The Peak Explained
Divergence at the Interpolation Threshold
Statement
Consider linear regression with features and samples where has i.i.d. entries, noise variance , and . The variance term of the ridgeless least-squares risk is proportional to , which diverges as from either side:
At exactly the min-norm estimator is not uniquely defined. In the noiseless case () the estimator recovers exactly and no peak appears. The divergence is driven by the smallest nonzero singular value of collapsing at the transition point.
Intuition
At , the design matrix is square. Its smallest singular value approaches zero, meaning the system is nearly rank-deficient. The minimum-norm solution must "stretch" enormously to interpolate the data through this near-singular system, amplifying noise into wild predictions.
Proof Sketch
Fix . For iid Gaussian entries (Marchenko-Pastur, Bai-Silverstein 2010 Ch. 3; Vershynin HDP 2018 §4.7):
- If : has full rank almost surely and its smallest eigenvalue converges to , which vanishes as .
- If : is of rank at most , with exactly zero eigenvalues. The smallest nonzero eigenvalue, equivalently the smallest eigenvalue of , converges to , which vanishes as .
- At : the bulk edge collapses to .
The variance of the min-norm estimator involves an inverse-eigenvalue sum against this spectrum, which diverges as the edge approaches zero from either side.
Why It Matters
This gives a precise mathematical explanation for the interpolation peak. It is not just an empirical curiosity. It is a consequence of spectral properties of random matrices. The peak is sharpest when there is label noise; with zero noise, it can disappear.
Failure Mode
This analysis assumes isotropic (structureless) features. Real data has structure (low effective rank, clustered features), which can shift, broaden, or even eliminate the peak. The theory gives qualitative but not always quantitative predictions for structured data.
Min-Norm Interpolation and the Second Descent
Min-Norm Interpolation in Overparameterized Regression
Statement
Under these assumptions, gradient descent on the least-squares loss converges to the minimum Euclidean norm interpolator:
For isotropic features with and unit-norm signal, the Hastie-Montanari-Rosset-Tibshirani (2022, Thm 1) asymptotic risk is
The first term is variance; the second is bias. As , the variance vanishes and the bias term approaches , so the total risk converges to (the null-signal risk). The risk does not converge to a fixed bias; both terms move with .
Intuition
When there are many more parameters than data points, the model has many ways to interpolate. The minimum-norm solution is the simplest (smallest weight vector), which acts as an implicit regularizer. More parameters means the minimum-norm solution can spread the "work" of fitting the data across more dimensions, requiring less extreme weights.
Proof Sketch
Decompose the risk into bias and variance. The variance term equals for the min-norm estimator (from Marchenko-Pastur theory). As with fixed, the variance term .
Why It Matters
This explains the second descent: past the interpolation threshold, adding more parameters reduces the variance term because the min-L2 bias spreads coordinate-wise error across more dimensions. The profile is specific to the min-L2 implicit bias. A different implicit bias (e.g. min-L1 from a different algorithm or loss) gives a different variance profile; see Hastie et al. 2022 Thm 2 and the Gunasekar et al. 2018 characterization of implicit bias by algorithm and geometry.
Failure Mode
The min-L2 characterization fails outside the stated assumptions:
- Logistic regression on separable data: gradient descent converges in direction to the max-L2-margin solution, not to a min-norm interpolator (Soudry, Hoffer, Nacson, Gunasekar, Srebro 2018, arXiv:1710.10345).
- Finite-width neural networks: no clean min-norm characterization outside the NTK regime; implicit bias depends on architecture, parameterization, and step size.
- Other algorithms and geometries: Gunasekar, Lee, Soudry, Srebro (2018) "Characterizing Implicit Bias in Terms of Optimization Geometry" enumerates which settings give min-L2, min-L1, or other implicit regularizers.
- Structured ground truth: even within the linear-regression setting, sparse or anisotropic can be better recovered by explicit regularization (ridge, lasso) than by the min-norm interpolator, especially near the threshold.
Benign Overfitting. Effective Rank Conditions
Bartlett, Long, Lugosi, Tsigler (2020), "Benign Overfitting in Linear Regression" (PNAS 117(48):30063-30070), identify when min-norm interpolation of noisy data is consistent. For a Gaussian linear model with covariance , define the effective ranks
where are the eigenvalues of . Their main theorem gives matching upper and lower bounds on the excess risk of the min-norm interpolator in terms of and . Consistency requires a fast-decay head plus slow-decay tail: a small number of directions carrying signal, and a large effective-rank tail that absorbs noise. With exponentially decaying or uniformly heavy spectra the bound fails; benign overfitting is a spectrum-specific phenomenon.
Why Classical Bias-Variance Fails
The classical bias-variance tradeoff predicts a U-shaped curve:
As complexity increases, bias decreases monotonically and variance increases monotonically. Their sum is U-shaped with a unique minimum.
This fails in the overparameterized regime because:
- Variance is not monotonically increasing. Past the interpolation threshold, variance decreases with more parameters, due to the interaction of the min-L2 implicit bias with high-dimensional feature geometry. Under a different implicit bias the profile changes (HMRT 2022 Thm 2).
- Bias is non-monotone in the overparameterized regime. For isotropic features the bias term increases with . The total risk can still decrease because the variance decrease dominates.
- The interpolation threshold is a phase transition, not a smooth tradeoff. The classical framework has no mechanism for the peak.
Canonical Examples
Double descent in linear regression
Consider with , , and . Plot test MSE vs. :
- For : test MSE decreases as the model captures more of the signal
- At : test MSE spikes (the interpolation peak)
- For : test MSE decreases, eventually approaching the noise floor
With (10x overparameterized), the test MSE can be lower than the best model.
Epoch-wise double descent in ResNets
Nakkiran et al. (2020, arXiv:1912.02292) showed that ResNet-18 trained on CIFAR-10 with 20% label noise exhibits epoch-wise double descent: test error first decreases, then increases as the model begins memorizing noisy labels, then decreases again with further training. In one configuration from the paper these three phases appear around epochs 50, 150, and 500, but those specific values are indicative only. Exact peak locations depend on learning rate schedule, batch size, width multiplier, and optimizer. The second descent requires sufficient model capacity to enter the interpolation regime.
Common Confusions
Double descent does not mean you should always use huge models
Double descent shows that overparameterization can generalize well, but properly tuned regularization (e.g., optimal ridge penalty) often outperforms the second descent. The practical lesson is not "bigger is always better" but rather "the interpolation regime is not catastrophic."
The peak requires noise
With zero label noise, there may be no interpolation peak. The double descent phenomenon is most dramatic when there is label noise, because the model is forced to fit the noise exactly at the interpolation threshold. Clean data can still show double descent, but the peak is much less pronounced.
Double descent is not unique to neural networks
The phenomenon appears in linear regression, random features, kernel methods, decision trees, and boosting. It is a property of overparameterized interpolation, not of any specific architecture.
Summary
- Test error can follow a double descent curve: decrease, peak at the interpolation threshold (), then decrease again
- The peak is explained by random matrix theory: the condition number diverges at
- In the overparameterized regime, gradient descent finds the min-norm interpolator, which generalizes well
- Epoch-wise double descent also occurs: test error can temporarily increase then decrease with more training
- Classical bias-variance tradeoff fails because variance is non-monotone in the overparameterized regime
Exercises
Problem
In linear regression with samples and isotropic Gaussian features, you observe a sharp peak in test error at parameters. You add regularization (ridge regression) with penalty . What happens to the peak? Why?
Problem
Consider the min-norm interpolator in the regime . The noise variance is . Using the formula , compute the variance component of the test risk. Compare this to the variance at (just past the threshold).
References
Canonical:
- Belkin, Hsu, Mandal, "Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate", NeurIPS 2018; arXiv:1806.05161. Early observation of the non-monotone generalization behavior of interpolating estimators.
- Spigler, Geiger, d'Ascoli, Sarao Mannelli, Biroli, Wyart, "A jamming transition from under- to over-parametrization affects generalization in deep learning", Journal of Physics A (2018); arXiv:1810.09665. Independent observation of the interpolation peak as a jamming transition in physics language.
- Belkin, Hsu, Ma, Mandal, "Reconciling modern machine-learning practice and the classical bias-variance trade-off", PNAS 116(32):15849-15854 (2019); arXiv:1812.11118. Coined the term "double descent" and documented it systematically across linear models, random features, and neural networks.
- Nakkiran, Kaplun, Bansal, Yang, Barak, Sutskever, "Deep Double Descent: Where Bigger Models and More Data Can Hurt", ICLR 2020; arXiv:1912.02292. Model-wise, sample-wise, and epoch-wise double descent in modern deep networks.
- Hastie, Montanari, Rosset, Tibshirani, "Surprises in High-Dimensional Ridgeless Least Squares Interpolation", Annals of Statistics 50(2):949-986 (2022); arXiv:1903.08560. Random-matrix analysis of the ridgeless min-norm interpolator; Theorems 1 and 2 give the isotropic and anisotropic asymptotic risks.
Theory of the peak and the second descent:
- Bartlett, Long, Lugosi, Tsigler, "Benign overfitting in linear regression", PNAS 117(48):30063-30070 (2020); arXiv:1906.11300. Effective-rank conditions and under which min-norm interpolation of noisy data is consistent.
- Mei, Montanari, "The generalization error of random features regression: Precise asymptotics and the double descent curve", Communications on Pure and Applied Mathematics (2022); arXiv:1908.05355. Exact asymptotics for random feature models.
- Advani, Saxe, Sompolinsky, "High-dimensional dynamics of generalization error in neural networks", Neural Networks 132:428-446 (2020); arXiv:1710.03667. Early analytic account of non-monotone generalization error.
Implicit bias and Marchenko-Pastur background:
- Soudry, Hoffer, Nacson, Gunasekar, Srebro, "The Implicit Bias of Gradient Descent on Separable Data", JMLR 19(70):1-57 (2018); arXiv:1710.10345. Logistic regression on separable data converges in direction to the max-L2-margin solution, not a min-norm interpolator.
- Gunasekar, Lee, Soudry, Srebro, "Characterizing Implicit Bias in Terms of Optimization Geometry", ICML 2018; arXiv:1802.08246. Which algorithm-loss-geometry triples yield min-L2, min-L1, or other implicit regularizers.
- Bai, Silverstein, Spectral Analysis of Large Dimensional Random Matrices (Springer, 2010), Chapter 3. Marchenko-Pastur law and edge behavior for both and .
- Vershynin, High-Dimensional Probability, Cambridge, 2018, §4.7. Non-asymptotic bounds on extreme singular values of random matrices.
Next Topics
- Benign overfitting: when interpolating noisy data is provably harmless
- Implicit bias: why gradient descent finds min-norm solutions
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Implicit Bias and Modern GeneralizationLayer 4
- Gradient Descent VariantsLayer 1
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Linear RegressionLayer 1
- Maximum Likelihood EstimationLayer 0B
- Common Probability DistributionsLayer 0A
- VC DimensionLayer 2
- Empirical Risk MinimizationLayer 2
- Concentration InequalitiesLayer 1
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Rademacher ComplexityLayer 3
- Random Matrix Theory OverviewLayer 4
- Matrix ConcentrationLayer 3
- Sub-Gaussian Random VariablesLayer 2
- Sub-Exponential Random VariablesLayer 2
- Epsilon-Nets and Covering NumbersLayer 3