Double Descent

Sneiderman, Robby

Modern Generalization

Double Descent

Test error follows a double-descent curve: it decreases, peaks at the interpolation threshold, then decreases again in the overparameterized regime, defying classical bias-variance intuition.

AdvancedTier 2CurrentSupporting~65 min

Prerequisites

Implicit Bias and Modern Generalization Random Matrix Theory Overview Bias Variance Tradeoff Grokking

Start 8-question practice · 3 available Prereq Map

Learning position

Read this page in the graph.

modern-generalization | layer 4 | tier 2. This page has 7 direct prerequisites and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Benign Overfitting

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

For decades, the conventional wisdom was simple: as model complexity increases beyond a certain point, test error goes up (overfitting). This is the textbook U-shaped bias-variance tradeoff curve. Double descent breaks this picture. In modern overparameterized models, test error can decrease again past the interpolation threshold, creating a second descent. The phenomenon was observed by Belkin, Hsu, Mandal (2018) in "Overfitting or Perfect Fitting?" and by Spigler, Geiger, d'Ascoli, Sarao Mannelli, Biroli, Wyart (2018) as a jamming transition. Belkin et al. (2019) coined the term "double descent" and documented it systematically across linear models, random features, decision trees, and neural networks. It has since been given substantial theoretical footing via random matrix theory and benign overfitting results (Hastie et al. 2022, Bartlett et al. 2020, Mei & Montanari 2022), but the tension between overparameterization and classical uniform-convergence bounds remains an open question for general architectures. See the Belkin et al. paper breakdown for the original 2018 figure and how the follow-on theoretical work corrected and extended it.

Mental Model

Imagine increasing the number of parameters $p$ in your model while keeping the training set size $n$ fixed:

Underparameterized regime ( $p \ll n$ ): Classical behavior. More parameters reduce bias, test error decreases.
Interpolation threshold ( $p \approx n$ ): The model has just enough capacity to perfectly fit (interpolate) the training data. But it does so in the most fragile way possible. any noise in the data gets amplified. Test error peaks.
Overparameterized regime ( $p \gg n$ ): Many models interpolate the data. Gradient descent from zero initialization stays in the row space of $X$ ; the unique interpolator in $\text{row}(X)$ is the minimum Euclidean norm solution, which turns out to be surprisingly smooth. Test error decreases again.

The Double Descent Phenomenon

Definition

Model-wise Double Descent

Fix a training set of size $n$ . Plot test error as a function of model complexity (e.g., number of parameters $p$ ). The curve exhibits:

A classical U-shape for $p < n$
A peak at $p \approx n$ (the interpolation threshold)
A second descent for $p > n$ , where test error can drop below the minimum achieved in the classical regime

This was demonstrated empirically by Belkin et al. (2019) across linear models, random features, decision trees, and neural networks.

Definition

Epoch-wise Double Descent

Fix the model architecture. Plot test error as a function of training epochs. The curve can exhibit a similar pattern: test error first decreases, then increases (as the model begins to interpolate noisy labels), then decreases again with further training. This is especially pronounced in large models trained with label noise.

Definition

Interpolation Threshold

The interpolation threshold is the critical model complexity at which the model first achieves zero training error. For linear regression with $p$ features and $n$ samples, this occurs at $p = n$ . At this point, the system of equations $Xw = y$ transitions from overdetermined (no exact solution) to underdetermined (infinitely many solutions).

The Peak Explained

Theorem

Divergence at the Interpolation Threshold

Statement

Consider linear regression with $p$ features and $n$ samples where $X$ has i.i.d. entries, noise variance $\sigma^2 > 0$ , and $\gamma = p/n$ . The variance term of the ridgeless least-squares risk is proportional to $\sigma^2 / |\gamma - 1|$ , which diverges as $\gamma \to 1$ from either side:

$R(\hat{w}) \to \infty \quad \text{as} \quad p/n \to 1, \quad \sigma^2 > 0$

At exactly $\gamma = 1$ the min-norm estimator is not uniquely defined. In the noiseless case ( $\sigma^2 = 0$ ) the estimator recovers $\beta^*$ exactly and no peak appears. The divergence is driven by the smallest nonzero singular value of $X$ collapsing at the transition point.

Intuition

At $p = n$ , the design matrix $X$ is square. Its smallest singular value approaches zero, meaning the system is nearly rank-deficient. The minimum-norm solution must "stretch" enormously to interpolate the data through this near-singular system, amplifying noise into wild predictions.

Proof Sketch

Fix $c = p/n$ . For iid Gaussian entries (Marchenko-Pastur, Bai-Silverstein 2010 Ch. 3; Vershynin HDP 2018 §4.7):

If $c < 1$ : $X^\top X / n$ has full rank $p$ almost surely and its smallest eigenvalue converges to $(1 - \sqrt{c})^2$ , which vanishes as $c \to 1$ .
If $c > 1$ : $X^\top X / n$ is $p \times p$ of rank at most $n$ , with exactly $p - n$ zero eigenvalues. The smallest nonzero eigenvalue, equivalently the smallest eigenvalue of $XX^\top / n$ , converges to $(1 - \sqrt{1/c})^2$ , which vanishes as $c \to 1$ .
At $c = 1$ : the bulk edge collapses to $0$ .

The variance of the min-norm estimator involves an inverse-eigenvalue sum against this spectrum, which diverges as the edge approaches zero from either side.

Why It Matters

This gives a precise mathematical explanation for the interpolation peak. It has a specific spectral mechanism: the edge behavior of random matrices. The peak is sharpest when there is label noise; with zero noise, it can disappear.

Failure Mode

This analysis assumes isotropic (structureless) features. Real data has structure (low effective rank, clustered features), which can shift, broaden, or even eliminate the peak. The theory gives qualitative but not always quantitative predictions for structured data.

report a correction →

Min-Norm Interpolation and the Second Descent

Theorem

Min-Norm Interpolation in Overparameterized Regression

Statement

Under these assumptions, gradient descent on the least-squares loss converges to the minimum Euclidean norm interpolator:

$\hat{w}_{\text{min-norm}} = X^\top(XX^\top)^{-1}y$

For isotropic features with $p/n \to \gamma > 1$ and unit-norm signal, the Hastie-Montanari-Rosset-Tibshirani (2022, Thm 1) asymptotic risk is

$R(\hat{w}_{\text{min-norm}}) = \frac{\sigma^2}{\gamma - 1} + \|\beta^*\|^2 \left(1 - \frac{1}{\gamma}\right)$

The first term is variance; the second is bias. As $\gamma \to \infty$ , the variance vanishes and the bias term approaches $\|\beta^*\|^2$ , so the total risk converges to $\|\beta^*\|^2$ (the null-signal risk). The risk does not converge to a fixed bias; both terms move with $\gamma$ .

Intuition

When there are many more parameters than data points, the model has many ways to interpolate. The minimum-norm solution is the simplest (smallest weight vector), which acts as an implicit regularizer. More parameters means the minimum-norm solution can spread the "work" of fitting the data across more dimensions, requiring less extreme weights.

Proof Sketch

Decompose the risk into bias and variance. The variance term equals $\sigma^2 \cdot n/(p-n)$ for the min-norm estimator (from Marchenko-Pastur theory). As $p \to \infty$ with $n$ fixed, the variance term $\to 0$ .

Why It Matters

This explains the second descent: past the interpolation threshold, adding more parameters reduces the variance term because the min-L2 bias spreads coordinate-wise error across more dimensions. The profile is specific to the min-L2 implicit bias. A different implicit bias (e.g. min-L1 from a different algorithm or loss) gives a different variance profile; see Hastie et al. 2022 Thm 2 and the Gunasekar et al. 2018 characterization of implicit bias by algorithm and geometry.

Failure Mode

The min-L2 characterization fails outside the stated assumptions:

Logistic regression on separable data: gradient descent converges in direction to the max-L2-margin solution, not to a min-norm interpolator (Soudry, Hoffer, Nacson, Gunasekar, Srebro 2018, arXiv:1710.10345).
Finite-width neural networks: no clean min-norm characterization outside the NTK regime; implicit bias depends on architecture, parameterization, and step size.
Other algorithms and geometries: Gunasekar, Lee, Soudry, Srebro (2018) "Characterizing Implicit Bias in Terms of Optimization Geometry" enumerates which settings give min-L2, min-L1, or other implicit regularizers.
Structured ground truth: even within the linear-regression setting, sparse or anisotropic $\beta^*$ can be better recovered by explicit regularization (ridge, lasso) than by the min-norm interpolator, especially near the threshold.

report a correction →

Benign Overfitting. Effective Rank Conditions

Bartlett, Long, Lugosi, Tsigler (2020), "Benign Overfitting in Linear Regression" (PNAS 117(48):30063-30070), identify when min-norm interpolation of noisy data is consistent. For a Gaussian linear model with covariance $\Sigma$ , define the effective ranks

$r_k(\Sigma) = \frac{\sum_{i > k} \lambda_i}{\lambda_{k+1}}, \qquad R_k(\Sigma) = \frac{\left(\sum_{i > k} \lambda_i\right)^2}{\sum_{i > k} \lambda_i^2}$

where $\lambda_1 \ge \lambda_2 \ge \dots$ are the eigenvalues of $\Sigma$ . Their main theorem gives matching upper and lower bounds on the excess risk of the min-norm interpolator in terms of $r_k$ and $R_k$ . Consistency requires a fast-decay head plus slow-decay tail: a small number of directions carrying signal, and a large effective-rank tail that absorbs noise. With exponentially decaying or uniformly heavy spectra the bound fails; benign overfitting is a spectrum-specific phenomenon.

Why Classical Bias-Variance Fails

The classical bias-variance tradeoff predicts a U-shaped curve:

$\text{Test error} = \text{Bias}^2 + \text{Variance} + \text{Noise}$

As complexity increases, bias decreases monotonically and variance increases monotonically. Their sum is U-shaped with a unique minimum.

This fails in the overparameterized regime because:

Variance is not monotonically increasing. Past the interpolation threshold, variance decreases with more parameters, due to the interaction of the min-L2 implicit bias with high-dimensional feature geometry. Under a different implicit bias the profile changes (HMRT 2022 Thm 2).
Bias is non-monotone in the overparameterized regime. For isotropic features the bias term $\|\beta^*\|^2 (1 - 1/\gamma)$ increases with $\gamma$ . The total risk can still decrease because the variance decrease dominates.
The interpolation threshold is a phase transition, not a smooth tradeoff. The classical framework has no mechanism for the peak.

How to Read a Double-Descent Claim

Double descent is not one theorem. It is a family of related curves, and a paper should say which curve is being plotted. The x-axis matters:

Curve type	X-axis	What usually drives the peak	What the claim does not prove
Model-wise	parameter count, feature count, width	singular values or effective dimension near interpolation	that any larger model is better
Sample-wise	training-set size	crossing from over- to under-parameterized for a fixed model	that less data is desirable
Epoch-wise	training time	memorization of noisy labels before later implicit regularization	that long training always helps
Regularization-wise	penalty strength, early stopping, dropout	loss of explicit smoothing near interpolation	that interpolation is optimal

Source support for this taxonomy is intentionally split: Belkin et al. define the broad double-descent framing, Nakkiran et al. document model-wise, sample-wise, and epoch-wise curves, and HMRT analyze the ridgeless linear-regression mechanism.

The cleanest mathematical story is still the ridgeless linear-regression story: the design matrix approaches a nearly singular transition, and the min-norm interpolator amplifies noise. The neural-network story is a useful analogy only after you identify the mechanism. In the NTK regime, training is close to kernel regression and the representation barely moves. In the lazy-vs-feature-learning contrast, feature learning can change the geometry during training, so the random-matrix calculation is no longer a complete explanation. In implicit-bias-and-modern-generalization, the optimizer selects one interpolating solution among many, and that selection is part of the generalization story.

Three diagnostic questions keep the claim precise:

Where is zero training error reached? The interpolation threshold should be marked, not inferred from the visual shape of the test curve.
What noise is present? Label noise, feature noise, and model misspecification affect the peak differently. A clean synthetic task can hide the peak even when the same model class shows it under noisy labels.
Which inductive bias chooses the interpolator? Linear least squares, kernel regression, margin-maximizing logistic regression, and deep networks trained with normalization do not select the same interpolating solution.

This is also why double descent connects to several pages in the ML spine. Ridge regression smooths the interpolation peak by adding a spectral floor. Random matrix theory explains the collapse of the smallest singular value in the isotropic model. Grokking is a separate but related training-time phenomenon where test performance can improve late after a memorization-like phase. Overfitting did not disappear; model complexity, training time, optimizer geometry, and data spectrum have to be read together.

Canonical Examples

Example

Double descent in linear regression

Consider $y = x^\top w^* + \epsilon$ with $w^* \in \mathbb{R}^p$ , $n = 100$ , and $\epsilon \sim \mathcal{N}(0, 0.1)$ . Plot test MSE vs. $p$ :

For $p < 100$ : test MSE decreases as the model captures more of the signal
At $p = 100$ : test MSE spikes (the interpolation peak)
For $p > 100$ : test MSE decreases, eventually approaching the noise floor

With $p = 1000$ (10x overparameterized), the test MSE can be lower than the best $p < 100$ model.

Example

Epoch-wise double descent in ResNets

Nakkiran et al. (2020, arXiv:1912.02292) showed that ResNet-18 trained on CIFAR-10 with 20% label noise exhibits epoch-wise double descent: test error first decreases, then increases as the model begins memorizing noisy labels, then decreases again with further training. In one configuration from the paper these three phases appear around epochs 50, 150, and 500, but those specific values are indicative only. Exact peak locations depend on learning rate schedule, batch size, width multiplier, and optimizer. The second descent requires sufficient model capacity to enter the interpolation regime.

Common Confusions

Watch Out

Double descent does not mean you should always use huge models

Double descent shows that overparameterization can generalize well, but properly tuned regularization (e.g., optimal ridge penalty) often outperforms the second descent. The practical lesson is not "bigger is always better" but rather "the interpolation regime is not catastrophic."

Watch Out

The peak requires noise

With zero label noise, there may be no interpolation peak. The double descent phenomenon is most dramatic when there is label noise, because the model is forced to fit the noise exactly at the interpolation threshold. Clean data can still show double descent, but the peak is much less pronounced.

Watch Out

Double descent is not unique to neural networks

The phenomenon appears in linear regression, random features, kernel methods, decision trees, and boosting. It is a property of overparameterized interpolation, not of any specific architecture.

Summary

Test error can follow a double descent curve: decrease, peak at the interpolation threshold ( $p \approx n$ ), then decrease again
The peak is explained by random matrix theory: the condition number diverges at $p = n$
For overparameterized linear regression from zero initialization, gradient descent converges to the minimum- $\ell_2$ -norm interpolator, and the second descent of the curve has been characterized in random- matrix asymptotics. For deep networks the empirical curve is similar but the linear analysis is not a complete explanation; the deep story depends on architecture, optimizer, and step-size schedule
Epoch-wise double descent also occurs: test error can temporarily increase then decrease with more training
Classical bias-variance tradeoff fails because variance is non-monotone in the overparameterized regime

Exercises

ExerciseCore

Problem

In linear regression with $n = 50$ samples and isotropic Gaussian features, you observe a sharp peak in test error at $p = 50$ parameters. You add $L_2$ regularization (ridge regression) with penalty $\lambda > 0$ . What happens to the peak? Why?

ExerciseAdvanced

Problem

Consider the min-norm interpolator $\hat{w} = X^\top(XX^\top)^{-1}y$ in the regime $p = 5n$ . The noise variance is $\sigma^2$ . Using the formula $\text{Var} = \sigma^2 n/(p-n)$ , compute the variance component of the test risk. Compare this to the variance at $p = 1.1n$ (just past the threshold).

References

Canonical:

Belkin, Hsu, Mandal, "Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate", NeurIPS 2018; arXiv:1806.05161. Early observation of the non-monotone generalization behavior of interpolating estimators.
Spigler, Geiger, d'Ascoli, Sarao Mannelli, Biroli, Wyart, "A jamming transition from under- to over-parametrization affects generalization in deep learning", Journal of Physics A (2018); arXiv:1810.09665. Independent observation of the interpolation peak as a jamming transition in physics language.
Belkin, Hsu, Ma, Mandal, "Reconciling modern machine-learning practice and the classical bias-variance trade-off", PNAS 116(32):15849-15854 (2019); arXiv:1812.11118. Coined the term "double descent" and documented it systematically across linear models, random features, and neural networks.
Nakkiran, Kaplun, Bansal, Yang, Barak, Sutskever, "Deep Double Descent: Where Bigger Models and More Data Can Hurt", ICLR 2020; arXiv:1912.02292. Model-wise, sample-wise, and epoch-wise double descent in modern deep networks.
Hastie, Montanari, Rosset, Tibshirani, "Surprises in High-Dimensional Ridgeless Least Squares Interpolation", Annals of Statistics 50(2):949-986 (2022); arXiv:1903.08560. Random-matrix analysis of the ridgeless min-norm interpolator; Theorems 1 and 2 give the isotropic and anisotropic asymptotic risks.

Theory of the peak and the second descent:

Bartlett, Long, Lugosi, Tsigler, "Benign overfitting in linear regression", PNAS 117(48):30063-30070 (2020); arXiv:1906.11300. Effective-rank conditions $r_k(\Sigma)$ and $R_k(\Sigma)$ under which min-norm interpolation of noisy data is consistent.
Mei, Montanari, "The generalization error of random features regression: Precise asymptotics and the double descent curve", Communications on Pure and Applied Mathematics (2022); arXiv:1908.05355. Exact asymptotics for random feature models.
Advani, Saxe, Sompolinsky, "High-dimensional dynamics of generalization error in neural networks", Neural Networks 132:428-446 (2020); arXiv:1710.03667. Early analytic account of non-monotone generalization error.

Implicit bias and Marchenko-Pastur background:

Soudry, Hoffer, Nacson, Gunasekar, Srebro, "The Implicit Bias of Gradient Descent on Separable Data", JMLR 19(70):1-57 (2018); arXiv:1710.10345. Logistic regression on separable data converges in direction to the max-L2-margin solution, not a min-norm interpolator.
Gunasekar, Lee, Soudry, Srebro, "Characterizing Implicit Bias in Terms of Optimization Geometry", ICML 2018; arXiv:1802.08246. Which algorithm-loss-geometry triples yield min-L2, min-L1, or other implicit regularizers.
Bai, Silverstein, Spectral Analysis of Large Dimensional Random Matrices (Springer, 2010), Chapter 3. Marchenko-Pastur law and edge behavior for both $c < 1$ and $c > 1$ .
Vershynin, High-Dimensional Probability, Cambridge, 2018, §4.7. Non-asymptotic bounds on extreme singular values of random matrices.

Next Topics

Benign overfitting: when interpolating noisy data is provably harmless
Implicit bias: why gradient descent finds min-norm solutions

Last reviewed: July 3, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

7

Ridge Regressionlayer 1 · tier 1
Implicit Bias and Modern Generalizationlayer 4 · tier 1
Neural Tangent Kernel: Lazy Training, Kernel Equivalence, μP, and the Limits of Widthlayer 4 · tier 1
Bias-Variance Tradeofflayer 2 · tier 2
Grokkinglayer 4 · tier 2

Derived topics

1

Benign Overfittinglayer 4 · tier 2

Graph-backed continuations

Benign Overfitting