Benign Overfitting

Sneiderman, Robby

Modern Generalization

Benign Overfitting

When interpolation (zero training error) does not hurt generalization: the min-norm interpolator fits noise in harmless directions while preserving signal. Bartlett et al. 2020, effective rank conditions, and why benign overfitting happens in overparameterized but not classical regimes.

AdvancedTier 2CurrentSupporting~65 min

Prerequisites

Implicit Bias and Modern Generalization Random Matrix Theory Overview Double Descent Neural Network Optimization Landscape

Start 8-question practice · 3 available Prereq Map

Learning position

Read this page in the graph.

modern-generalization | layer 4 | tier 2. This page has 6 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Neural Tangent Kernel: Lazy Training, Kernel Equivalence, μP, and the Limits of Width

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Classical statistics teaches a clear lesson: fitting the training data perfectly, including its noise, leads to poor generalization. This is overfitting, and it is the central cautionary tale of introductory ML courses.

Modern deep learning contradicts this lesson daily. Models with billions of parameters achieve zero training loss on noisy data and still generalize well. They interpolate the training data (including the noise) and yet their test performance is excellent.

Benign overfitting is the formal study of when and why interpolation is harmless. It provides the sharpest theoretical explanation for the overparameterized generalization puzzle: under specific conditions on the data covariance structure, the minimum-norm interpolating solution fits noise in directions that do not affect predictions on new data.

Mental Model

Imagine fitting a curve through noisy data points in 2 dimensions. With exactly as many parameters as data points, the curve must contort wildly to pass through every point, amplifying noise into large oscillations. This is catastrophic overfitting.

Now imagine the same data in 1000 dimensions. The model has 1000 parameters for the same number of data points. It can still interpolate every point, but now it has immense freedom in how it interpolates. The minimum-norm solution uses this freedom to fit the noise in the "extra" 998 dimensions that have nothing to do with the signal. The noise component is spread so thinly across so many irrelevant directions that it barely affects predictions on new data.

The key insight: in high dimensions, noise can hide in harmless directions.

The Setup

Consider linear regression with $n$ training samples in $\mathbb{R}^d$ where $d > n$ :

$y_i = x_i^\top w^* + \epsilon_i, \quad i = 1, \ldots, n$

where $w^* \in \mathbb{R}^d$ is the true signal, $x_i \sim \mathcal{N}(0, \Sigma)$ with population covariance $\Sigma$ , and $\epsilon_i$ is independent noise with $\mathbb{E}[\epsilon_i] = 0$ , $\text{Var}(\epsilon_i) = \sigma^2$ .

Since $d > n$ , the system $Xw = y$ is underdetermined and has infinitely many solutions. Gradient descent from zero initialization converges to the minimum-norm interpolator, which is the Moore-Penrose pseudoinverse solution $\hat{w} = X^+ y$ . When $X$ has full row rank (generic for $d > n$ ), the pseudoinverse has the closed form:

$\hat{w} = X^+ y = X^\top(XX^\top)^{-1}y$

Equivalently, $\hat{w}$ is the ridgeless limit $\lim_{\lambda \downarrow 0} (X^\top X + \lambda I)^{-1} X^\top y$ of ridge regression. Under the benign overfitting conditions stated below, the excess risk of $\hat{w}$ converges to zero as $n \to \infty$ despite interpolating every noisy label.

Definition

Benign Overfitting

Benign overfitting occurs when a model perfectly interpolates the training data (achieving zero training error, including fitting the noise) while maintaining low test error. The excess risk $R(\hat{w}) - R(w^*)$ converges to zero (or a small value) even though $\hat{w}$ interpolates every noisy training label.

Definition

Catastrophic Overfitting

Catastrophic overfitting occurs when interpolation does harm generalization: the model fits the noise in a way that corrupts predictions on new data. The excess risk is large, typically diverging as noise variance increases.

Note. The term is used in two disjoint senses in the literature. Mallinar et al. (2022) use it for the benign/tempered/catastrophic trichotomy of interpolators studied here. Wong, Rice, and Kolter (2020, arXiv 2001.03994) use "catastrophic overfitting" for a different phenomenon: the sudden collapse of robust accuracy during single-step adversarial training. On this page the term refers exclusively to the Mallinar et al. usage.

Definition

Effective Ranks of a Covariance (Bartlett et al. 2020)

Let the population covariance $\Sigma$ have eigenvalues $\lambda_1 \geq \lambda_2 \geq \cdots > 0$ . Following Bartlett, Long, Lugosi, Tsigler (2020, Definition 1), two distinct effective ranks govern the behavior of the min-norm interpolator:

$r_k(\Sigma) = \frac{\sum_{i > k} \lambda_i}{\lambda_{k+1}} \qquad \text{(bias-controlling)}$

$R_k(\Sigma) = \frac{\left(\sum_{i > k} \lambda_i\right)^2}{\sum_{i > k} \lambda_i^2} \qquad \text{(variance-controlling)}$

The quantity $r_k(\Sigma)$ measures how heavy the tail is relative to the leading tail eigenvalue $\lambda_{k+1}$ , and governs the bias of the interpolator. The quantity $R_k(\Sigma)$ is the squared ratio of the $\ell_1$ to $\ell_2$ norm of the tail spectrum, and governs the variance. In general $R_k(\Sigma) \geq r_k(\Sigma)$ , with equality when all tail eigenvalues are equal. This follows from $\sum_{i>k} \lambda_i^2 \leq \lambda_{k+1} \sum_{i>k} \lambda_i$ (since $\lambda_i \leq \lambda_{k+1}$ for $i>k$ ), which gives $R_k = (\sum \lambda_i)^2 / \sum \lambda_i^2 \geq \sum \lambda_i / \lambda_{k+1} = r_k$ ; see Bartlett-Long-Lugosi-Tsigler 2020 Lemma 6. The two conditions $r_k(\Sigma) \geq b n$ and $R_k(\Sigma) \geq c n$ must hold simultaneously for benign overfitting.

The question is: what distinguishes benign from catastrophic overfitting?

The Bartlett et al. 2020 Result

Theorem

Benign Overfitting in Linear Regression

Statement

Let the eigenvalues of the population covariance $\Sigma$ be $\lambda_1 \geq \lambda_2 \geq \cdots > 0$ , and let $r_k(\Sigma)$ and $R_k(\Sigma)$ be the two effective ranks defined above. Assume there exist constants $b, c > 0$ and an index $k^* \geq 0$ such that both conditions hold:

$r_{k^*}(\Sigma) \geq b \cdot n \qquad \text{(bias condition)}$

$R_{k^*}(\Sigma) \geq c \cdot n \qquad \text{(variance condition)}$

Then the excess risk of the minimum-norm interpolator satisfies, with high probability:

$R(\hat{w}) - R(w^*) \lesssim \underbrace{\|w^*\|_\Sigma^2 \left(\sqrt{\frac{k^*}{n}} + \frac{n}{r_{k^*}(\Sigma)}\right)}_{\text{bias}} + \underbrace{\sigma^2 \left(\frac{k^*}{n} + \frac{n}{R_{k^*}(\Sigma)}\right)}_{\text{variance}}$

Benign overfitting occurs when both $r_{k^*}(\Sigma) / n \to \infty$ and $R_{k^*}(\Sigma) / n \to \infty$ : the tail is simultaneously heavy enough (bias) and flat enough (variance) that both terms vanish. If either condition fails, interpolation is not benign.

Intuition

The covariance spectrum splits into two parts: the first $k^*$ eigenvalues carry the signal, and the remaining eigenvalues form the "tail." The minimum-norm interpolator fits the noise using directions in the tail. Two properties of the tail matter. Its size relative to the top eigenvalue (captured by $r_{k^*}(\Sigma)$ ) determines whether the interpolator can recover the signal directions, i.e. controls the bias. Its flatness (captured by $R_{k^*}(\Sigma)$ ) determines whether noise is spread across many directions or concentrated in a few, i.e. controls the variance. Both properties must be strong for benign overfitting.

High variance effective rank in the tail means many nearly-equal small eigenvalues. The noise hides in these many low-variance directions, each too weak to distort predictions.

Proof Sketch

Decompose the excess risk into bias and variance. The bias comes from the minimum-norm interpolator not perfectly recovering $w^*$ in the top $k^*$ eigendirections, and is controlled by $k^*/n$ together with $n/r_{k^*}(\Sigma)$ times the signal energy. The variance comes from fitting the noise $\epsilon$ : the contribution of the noise component of $\hat{w}$ to test error is of order $\sigma^2 (k^*/n + n / R_{k^*}(\Sigma))$ . The key step uses random matrix theory to show that the resolvent $(XX^\top)^{-1}$ projects the noise into the high-dimensional tail, where it is diluted by the variance effective rank $R_{k^*}(\Sigma)$ . The proof requires careful control of the eigenvalues of the sample covariance $XX^\top$ via non-asymptotic random matrix bounds.

Why It Matters

This is the first rigorous result characterizing exactly when interpolation is benign. Both effective rank conditions are checkable (in principle) for any data distribution. The result explains why overparameterized models generalize: high-dimensional data whose covariance has a heavy, flat tail satisfies both conditions. It also explains why the classical regime fails: with $d \leq n$ there is no high-dimensional tail to absorb the noise, so $R_k(\Sigma) / n$ cannot grow.

Failure Mode

The result is exact for linear regression with Gaussian features. For nonlinear models (neural networks), the analysis does not directly apply. There are extensions to kernel regression and random features models, but the precise conditions for benign overfitting in deep networks remain an open problem. Also, the result requires sub-Gaussian noise: heavy-tailed noise distributions may violate the conditions.

report a correction →

The Effective Rank Condition

Proposition

Variance Effective Rank Determines Benign vs Catastrophic

Statement

Fix $k$ so that the bias effective rank $r_k(\Sigma) = \sum_{j>k} \lambda_j / \lambda_{k+1}$ already satisfies $r_k(\Sigma) \gtrsim n$ . The variance effective rank of the tail is:

$R_k(\Sigma) = \frac{\left(\sum_{j > k} \lambda_j\right)^2}{\sum_{j > k} \lambda_j^2}$

Conditional on the bias condition, the overfitting regime is determined by $R_k(\Sigma)$ :

Benign if and only if $R_k(\Sigma) / n \to \infty$ : the noise variance component $\sigma^2 n / R_k(\Sigma) \to 0$ .
Catastrophic if and only if $R_k(\Sigma) / n \to 0$ : the noise variance component $\sigma^2 n / R_k(\Sigma) \to \infty$ .
Tempered if and only if $R_k(\Sigma) / n \to c$ for a constant $c > 0$ : the noise contributes a non-trivial but bounded amount to the excess risk (see Mallinar et al. 2022).

The variance rank $R_k(\Sigma)$ is maximized when all tail eigenvalues are equal (flat spectrum), giving $R_k(\Sigma) = d - k$ . It is minimized when one tail eigenvalue dominates (peaked spectrum), giving $R_k(\Sigma) \approx 1$ .

Intuition

The effective rank counts the "effective number of dimensions" in the tail. A flat spectrum means many equally important directions, so noise gets diluted across all of them. A peaked spectrum means a few dominant directions, so noise concentrates and damages predictions.

For benign overfitting, you need $d - k \gg n$ (many more tail dimensions than samples) and a relatively flat tail spectrum (so the effective rank is close to $d - k$ , not much smaller).

Why It Matters

This gives a concrete diagnostic: examine the eigenvalue spectrum of your data covariance. If the top eigenvalues capture the signal and the remaining eigenvalues are many and roughly equal, benign overfitting is expected. If the spectrum has a long, slowly decaying tail (power law), the effective rank may not grow fast enough for benign overfitting. A word of caution on a common claim: natural images do not have a rapidly decaying spectrum. Simoncelli and Olshausen (2001) document that natural image power spectra follow a power law $\sim 1/f^\alpha$ with $\alpha \approx 1.6$ to $2.0$ , which translates to a slow polynomial decay of covariance eigenvalues rather than exponential decay. Under the Mallinar et al. (2022) taxonomy this typically places natural images in the tempered regime rather than the benign one. Exact benign overfitting requires a tail heavy enough that $R_k(\Sigma) / n \to \infty$ , which most naturally occurring spectra only approach, not satisfy.

report a correction →

Why Benign Overfitting Happens in Overparameterized Regimes

In the classical regime ( $d < n$ ), there is no interpolation: the model cannot perfectly fit the training data (in general). Overfitting in this regime means fitting noise at the expense of increasing test error. The bias-variance tradeoff applies straightforwardly.

In the overparameterized regime ( $d \gg n$ ), interpolation is possible, and the minimum-norm solution has a specific structure:

$\hat{w} = \underbrace{X^\top(XX^\top)^{-1}X w^*}_{\text{signal component}} + \underbrace{X^\top(XX^\top)^{-1}\epsilon}_{\text{noise component}}$

The signal component recovers $w^*$ projected onto the row space of $X$ . The noise component is $X^\top(XX^\top)^{-1}\epsilon$ . The key question is whether the noise component corrupts test predictions.

For a new test point $x_{\text{test}}$ , the noise contribution to the prediction is:

$x_{\text{test}}^\top X^\top(XX^\top)^{-1}\epsilon$

When $d \gg n$ and the tail spectrum is flat, $x_{\text{test}}$ and the noise-fitting directions are nearly orthogonal (in high dimensions, random vectors are nearly orthogonal). The inner product is small, and the noise does not affect the prediction.

This is the geometric essence of benign overfitting: in high dimensions, the model interpolates noise in directions that are orthogonal to the directions that matter for prediction.

Connection to Double Descent

Benign overfitting explains the second descent in the double descent curve:

At the interpolation threshold ( $d \approx n$ ): The variance effective rank of the tail is small ( $R_k(\Sigma) \approx 1$ ). The noise component is concentrated in a few directions. Overfitting is catastrophic. Test error peaks.
Far past the threshold ( $d \gg n$ ): The variance effective rank of the tail is large ( $R_k(\Sigma) \gg n$ ). The noise is spread thinly. Overfitting is benign. Test error decreases.

The transition from catastrophic to benign overfitting as $d/n$ increases produces the second descent. In this sense, double descent is the phase transition from catastrophic to benign overfitting along the overparameterization axis.

Benign, Tempered, or Catastrophic: The Refined Taxonomy

Mallinar, Simon, Abedsoltan, Pandit, Belkin, and Nakkiran (2022, arXiv 2207.06569) argue that the benign/catastrophic dichotomy is too coarse for real models. They propose a trichotomy based on how excess risk scales with label noise:

Benign: excess risk from noise is $o(\sigma^2)$ as $n \to \infty$ . The interpolator is asymptotically as good as the Bayes predictor. Rare in practice.
Tempered: excess risk is $\Theta(\sigma^2)$ but bounded by a constant multiple, so interpolation is suboptimal but not ruinous. Empirically the typical regime for kernel regression, random features, and many deep networks on real data.
Catastrophic: excess risk from noise diverges in $\sigma^2$ or in $n$ . The interpolator is useless.

The taxonomy is driven by the tail of the relevant kernel or covariance spectrum. A power-law tail $\lambda_j \sim j^{-\alpha}$ on kernel eigenvalues typically yields a tempered regime, not a benign one, because $R_k(\Sigma)/n$ stays bounded rather than diverging. The practical takeaway: most "benign overfitting" observed in deep learning is more precisely tempered overfitting. Interpolation is not ruinous, but it is not free either, and explicit regularization still pays.

Connection to Random Matrix Theory

The proof of benign overfitting relies on precise control of the eigenvalues of the sample covariance matrix $X^\top X$ (or equivalently $XX^\top$ ). The key tools from random matrix theory:

Marchenko-Pastur law: Describes the limiting spectral distribution of $XX^\top / n$ when $d/n \to \gamma$ . Determines how sample eigenvalues relate to population eigenvalues.
Resolvent estimates: The resolvent $(XX^\top + \lambda I)^{-1}$ (at $\lambda = 0$ for the interpolator) controls the noise amplification. RMT provides sharp bounds on the resolvent trace and quadratic forms.
Spiked covariance model: When $\Sigma$ has a few large eigenvalues (signal) and many small ones (noise), the sample eigenvalues undergo phase transitions. The BBP (Baik-Ben Arous-Peche) transition determines when signal eigenvalues are detectable above the noise bulk.

Common Confusions

Watch Out

Benign overfitting does not mean all overfitting is harmless

Benign overfitting requires specific conditions on the data covariance spectrum. In low dimensions, with a peaked covariance spectrum, or with insufficient overparameterization, overfitting is catastrophic. The theory identifies when overfitting is benign, not that it is always benign. Most textbook examples of overfitting (polynomial regression in 1D, small neural networks) are in the catastrophic regime.

Watch Out

Benign overfitting does not mean regularization is useless

Even when overfitting is benign, ridge regression with optimal $\lambda$ can outperform the min-norm interpolator. Benign overfitting says the interpolator generalizes well, not that it generalizes optimally. The practical implication is that interpolation is not catastrophic in the overparameterized regime, but explicit regularization can still help.

Watch Out

The effective rank is about the tail, not the full spectrum

The condition for benign overfitting involves the effective rank of the covariance eigenvalues beyond the signal subspace, not the full spectrum. A dataset can have low overall effective rank (signal concentrated in a few directions) but high tail effective rank (many small noise eigenvalues). It is the tail that determines whether noise fitting is harmless.

Watch Out

Benign overfitting in linear models does not directly explain deep networks

The Bartlett et al. result is for linear regression. Neural networks are not linear. Extensions via the neural tangent kernel and random features provide partial bridges, but a complete theory of benign overfitting for deep networks does not exist. The linear result provides the right intuition and identifies the relevant quantities (effective rank, covariance spectrum), but the precise conditions for deep networks remain an active research area.

Summary

Benign overfitting: zero training error + good test error (interpolation is harmless)
Bartlett et al. (2020) identify two effective ranks of the tail spectrum: $r_k(\Sigma) = \sum_{j>k}\lambda_j / \lambda_{k+1}$ (bias-controlling) and $R_k(\Sigma) = (\sum_{j>k}\lambda_j)^2 / \sum_{j>k}\lambda_j^2$ (variance-controlling)
The theorem requires both $r_k(\Sigma) \gtrsim n$ and $R_k(\Sigma) \gtrsim n$
High variance rank $R_k(\Sigma)$ means noise is spread across many directions, each too weak to harm predictions
Mallinar et al. (2022) refine the picture: most real overfitting is tempered, not strictly benign
Catastrophic overfitting (Mallinar sense, not the Wong-Rice-Kolter adversarial sense): low tail variance rank, noise concentrates in few directions
The transition from catastrophic to benign/tempered overfitting explains the second descent in double descent
Requires $d \gg n$ (overparameterization) and a heavy, flat-ish tail spectrum
Linear theory provides intuition; deep network theory is still incomplete

Exercises

ExerciseCore

Problem

Consider a covariance matrix $\Sigma$ with eigenvalues $\lambda_1 = 10$ , $\lambda_2 = 5$ , and $\lambda_j = 0.01$ for $j = 3, \ldots, 1000$ . Compute both effective ranks $r_2(\Sigma)$ and $R_2(\Sigma)$ of the tail (eigenvalues beyond the top 2). Is this a benign overfitting regime for $n = 100$ samples?

ExerciseAdvanced

Problem

Now consider a covariance with power-law decay: $\lambda_j = j^{-\alpha}$ for $j = 1, \ldots, d$ with $d = 10000$ and $n = 100$ . For what values of $\alpha$ is overfitting benign? Compute the variance rank $R_{10}(\Sigma)$ for $\alpha \in \{0.5, 1.0, 2.0\}$ .

ExerciseResearch

Problem

The benign overfitting theory for linear regression requires the noise component $X^\top(XX^\top)^{-1}\epsilon$ to be small in prediction norm. For a neural network in the neural tangent kernel (NTK) regime, the analogous object is $\Phi^\top(\Phi\Phi^\top)^{-1}\epsilon$ , where $\Phi$ is the feature matrix from the NTK. What conditions on the NTK spectrum would you need for benign overfitting, and why might these conditions be harder to verify than in the linear case?

References

Canonical:

Bartlett, Long, Lugosi, Tsigler, "Benign Overfitting in Linear Regression" (PNAS 117(48), 2020, arXiv 1906.11300). Definition 1 introduces the two effective ranks $r_k(\Sigma)$ and $R_k(\Sigma)$ ; Theorem 4 gives the bias/variance decomposition used on this page.
Belkin, Hsu, Ma, Mandal, "Reconciling Modern Machine Learning Practice and the Classical Bias-Variance Trade-Off" (PNAS 116(32), 2019). Original empirical double-descent evidence.

Current:

Tsigler and Bartlett, "Benign Overfitting in Ridge Regression" (JMLR 24, 2023). Extends the analysis to ridge and gives matching lower bounds.
Chatterji and Long, "Finite-Sample Analysis of Interpolating Linear Classifiers in the Overparameterized Regime" (JMLR 22, 2021). Logistic / max-margin classification analogue.
Mallinar, Simon, Abedsoltan, Pandit, Belkin, Nakkiran, "Benign, Tempered, or Catastrophic: Toward a Refined Taxonomy of Overfitting" (NeurIPS 2022, arXiv 2207.06569). Argues most real overfitting is tempered, not benign.
Hastie, Montanari, Rosset, Tibshirani, "Surprises in High-Dimensional Ridgeless Least Squares Interpolation" (Annals of Statistics 50(2), 2022). Asymptotic risk of the min-norm interpolator under proportional asymptotics.
Mei and Montanari, "The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve" (Communications on Pure and Applied Mathematics 75(4), 2022, arXiv 1908.05355). Sharp asymptotics for random-features regression and the cleanest derivation of the double descent curve in a non-trivial nonlinear model.
Adlam and Pennington, "The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization" (ICML 2020, arXiv 2008.06786). Random matrix analysis of NTK regression; identifies multiple descent peaks and connects them to spectral structure.
Liang and Rakhlin, "Just Interpolate: Kernel 'Ridgeless' Regression Can Generalize" (Annals of Statistics 48(3), 2020, arXiv 1808.00387). Shows that interpolating kernel estimators can generalize and gives the kernel analogue of the linear benign-overfitting story.
Bach, "Learning Theory from First Principles" (MIT Press, 2024). Self-contained modern textbook treatment of generalization, including the benign / tempered / catastrophic distinction.

Spectral assumptions on natural data:

Simoncelli and Olshausen, "Natural Image Statistics and Neural Representation" (Annual Review of Neuroscience 24, 2001). Establishes that natural image power spectra follow a power law roughly $1/f^\alpha$ with $\alpha \approx 1.6$ to $2.0$ , not exponential decay.

Background on adversarial "catastrophic overfitting":

Wong, Rice, Kolter, "Fast is Better than Free: Revisiting Adversarial Training" (ICLR 2020, arXiv 2001.03994). Uses "catastrophic overfitting" for a distinct phenomenon in single-step adversarial training, not for the trichotomy of this page.

Next Topics

The natural next steps from benign overfitting:

Neural tangent kernel: the infinite-width regime where neural networks become kernel machines and benign overfitting analysis can be extended
Double descent: the full generalization curve that benign overfitting explains in the overparameterized regime

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

6

Ridge Regressionlayer 1 · tier 1
Implicit Bias and Modern Generalizationlayer 4 · tier 1
Neural Tangent Kernel: Lazy Training, Kernel Equivalence, μP, and the Limits of Widthlayer 4 · tier 1
Double Descentlayer 4 · tier 2
Neural Network Optimization Landscapelayer 4 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.