Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Modern Generalization

Benign Overfitting

When interpolation (zero training error) does not hurt generalization: the min-norm interpolator fits noise in harmless directions while preserving signal. Bartlett et al. 2020, effective rank conditions, and why benign overfitting happens in overparameterized but not classical regimes.

AdvancedTier 2Current~65 min

Why This Matters

Classical statistics teaches a clear lesson: fitting the training data perfectly, including its noise, leads to poor generalization. This is overfitting, and it is the central cautionary tale of introductory ML courses.

Modern deep learning contradicts this lesson daily. Models with billions of parameters achieve zero training loss on noisy data and still generalize well. They interpolate the training data (including the noise) and yet their test performance is excellent.

Benign overfitting is the formal study of when and why interpolation is harmless. It provides the sharpest theoretical explanation for the overparameterized generalization puzzle: under specific conditions on the data covariance structure, the minimum-norm interpolating solution fits noise in directions that do not affect predictions on new data.

Mental Model

Imagine fitting a curve through noisy data points in 2 dimensions. With exactly as many parameters as data points, the curve must contort wildly to pass through every point, amplifying noise into large oscillations. This is catastrophic overfitting.

Now imagine the same data in 1000 dimensions. The model has 1000 parameters for the same number of data points. It can still interpolate every point, but now it has immense freedom in how it interpolates. The minimum-norm solution uses this freedom to fit the noise in the "extra" 998 dimensions that have nothing to do with the signal. The noise component is spread so thinly across so many irrelevant directions that it barely affects predictions on new data.

The key insight: in high dimensions, noise can hide in harmless directions.

The Setup

Consider linear regression with nn training samples in Rd\mathbb{R}^d where d>nd > n:

yi=xiw+ϵi,i=1,,ny_i = x_i^\top w^* + \epsilon_i, \quad i = 1, \ldots, n

where wRdw^* \in \mathbb{R}^d is the true signal, xiN(0,Σ)x_i \sim \mathcal{N}(0, \Sigma) with population covariance Σ\Sigma, and ϵi\epsilon_i is independent noise with E[ϵi]=0\mathbb{E}[\epsilon_i] = 0, Var(ϵi)=σ2\text{Var}(\epsilon_i) = \sigma^2.

Since d>nd > n, the system Xw=yXw = y is underdetermined and has infinitely many solutions. Gradient descent from zero initialization converges to the minimum-norm interpolator, which is the Moore-Penrose pseudoinverse solution w^=X+y\hat{w} = X^+ y. When XX has full row rank (generic for d>nd > n), the pseudoinverse has the closed form:

w^=X+y=X(XX)1y\hat{w} = X^+ y = X^\top(XX^\top)^{-1}y

Equivalently, w^\hat{w} is the ridgeless limit limλ0(XX+λI)1Xy\lim_{\lambda \downarrow 0} (X^\top X + \lambda I)^{-1} X^\top y of ridge regression. Under the benign overfitting conditions stated below, the excess risk of w^\hat{w} converges to zero as nn \to \infty despite interpolating every noisy label.

Definition

Benign Overfitting

Benign overfitting occurs when a model perfectly interpolates the training data (achieving zero training error, including fitting the noise) while maintaining low test error. The excess risk R(w^)R(w)R(\hat{w}) - R(w^*) converges to zero (or a small value) even though w^\hat{w} interpolates every noisy training label.

Definition

Catastrophic Overfitting

Catastrophic overfitting occurs when interpolation does harm generalization: the model fits the noise in a way that corrupts predictions on new data. The excess risk is large, typically diverging as noise variance increases.

Note. The term is used in two disjoint senses in the literature. Mallinar et al. (2022) use it for the benign/tempered/catastrophic trichotomy of interpolators studied here. Wong, Rice, and Kolter (2020, arXiv 2001.03994) use "catastrophic overfitting" for a different phenomenon: the sudden collapse of robust accuracy during single-step adversarial training. On this page the term refers exclusively to the Mallinar et al. usage.

Definition

Effective Ranks of a Covariance (Bartlett et al. 2020)

Let the population covariance Σ\Sigma have eigenvalues λ1λ2>0\lambda_1 \geq \lambda_2 \geq \cdots > 0. Following Bartlett, Long, Lugosi, Tsigler (2020, Definition 1), two distinct effective ranks govern the behavior of the min-norm interpolator:

rk(Σ)=i>kλiλk+1(bias-controlling)r_k(\Sigma) = \frac{\sum_{i > k} \lambda_i}{\lambda_{k+1}} \qquad \text{(bias-controlling)}

Rk(Σ)=(i>kλi)2i>kλi2(variance-controlling)R_k(\Sigma) = \frac{\left(\sum_{i > k} \lambda_i\right)^2}{\sum_{i > k} \lambda_i^2} \qquad \text{(variance-controlling)}

The quantity rk(Σ)r_k(\Sigma) measures how heavy the tail is relative to the leading tail eigenvalue λk+1\lambda_{k+1}, and governs the bias of the interpolator. The quantity Rk(Σ)R_k(\Sigma) is the squared ratio of the 1\ell_1 to 2\ell_2 norm of the tail spectrum, and governs the variance. In general Rk(Σ)rk(Σ)R_k(\Sigma) \leq r_k(\Sigma), with equality when all tail eigenvalues are equal. The two conditions rk(Σ)bnr_k(\Sigma) \geq b n and Rk(Σ)cnR_k(\Sigma) \geq c n must hold simultaneously for benign overfitting.

The question is: what distinguishes benign from catastrophic overfitting?

The Bartlett et al. 2020 Result

Theorem

Benign Overfitting in Linear Regression

Statement

Let the eigenvalues of the population covariance Σ\Sigma be λ1λ2>0\lambda_1 \geq \lambda_2 \geq \cdots > 0, and let rk(Σ)r_k(\Sigma) and Rk(Σ)R_k(\Sigma) be the two effective ranks defined above. Assume there exist constants b,c>0b, c > 0 and an index k0k^* \geq 0 such that both conditions hold:

rk(Σ)bn(bias condition)r_{k^*}(\Sigma) \geq b \cdot n \qquad \text{(bias condition)}

Rk(Σ)cn(variance condition)R_{k^*}(\Sigma) \geq c \cdot n \qquad \text{(variance condition)}

Then the excess risk of the minimum-norm interpolator satisfies, with high probability:

R(w^)R(w)wΣ2(kn+nrk(Σ))bias+σ2(kn+nRk(Σ))varianceR(\hat{w}) - R(w^*) \lesssim \underbrace{\|w^*\|_\Sigma^2 \left(\sqrt{\frac{k^*}{n}} + \frac{n}{r_{k^*}(\Sigma)}\right)}_{\text{bias}} + \underbrace{\sigma^2 \left(\frac{k^*}{n} + \frac{n}{R_{k^*}(\Sigma)}\right)}_{\text{variance}}

Benign overfitting occurs when both rk(Σ)/nr_{k^*}(\Sigma) / n \to \infty and Rk(Σ)/nR_{k^*}(\Sigma) / n \to \infty: the tail is simultaneously heavy enough (bias) and flat enough (variance) that both terms vanish. If either condition fails, interpolation is not benign.

Intuition

The covariance spectrum splits into two parts: the first kk^* eigenvalues carry the signal, and the remaining eigenvalues form the "tail." The minimum-norm interpolator fits the noise using directions in the tail. Two properties of the tail matter. Its size relative to the top eigenvalue (captured by rk(Σ)r_{k^*}(\Sigma)) determines whether the interpolator can recover the signal directions, i.e. controls the bias. Its flatness (captured by Rk(Σ)R_{k^*}(\Sigma)) determines whether noise is spread across many directions or concentrated in a few, i.e. controls the variance. Both properties must be strong for benign overfitting.

High variance effective rank in the tail means many nearly-equal small eigenvalues. The noise hides in these many low-variance directions, each too weak to distort predictions.

Proof Sketch

Decompose the excess risk into bias and variance. The bias comes from the minimum-norm interpolator not perfectly recovering ww^* in the top kk^* eigendirections, and is controlled by k/nk^*/n together with n/rk(Σ)n/r_{k^*}(\Sigma) times the signal energy. The variance comes from fitting the noise ϵ\epsilon: the contribution of the noise component of w^\hat{w} to test error is of order σ2(k/n+n/Rk(Σ))\sigma^2 (k^*/n + n / R_{k^*}(\Sigma)). The key step uses random matrix theory to show that the resolvent (XX)1(XX^\top)^{-1} projects the noise into the high-dimensional tail, where it is diluted by the variance effective rank Rk(Σ)R_{k^*}(\Sigma). The proof requires careful control of the eigenvalues of the sample covariance XXXX^\top via non-asymptotic random matrix bounds.

Why It Matters

This is the first rigorous result characterizing exactly when interpolation is benign. Both effective rank conditions are checkable (in principle) for any data distribution. The result explains why overparameterized models generalize: high-dimensional data whose covariance has a heavy, flat tail satisfies both conditions. It also explains why the classical regime fails: with dnd \leq n there is no high-dimensional tail to absorb the noise, so Rk(Σ)/nR_k(\Sigma) / n cannot grow.

Failure Mode

The result is exact for linear regression with Gaussian features. For nonlinear models (neural networks), the analysis does not directly apply. There are extensions to kernel regression and random features models, but the precise conditions for benign overfitting in deep networks remain an open problem. Also, the result requires sub-Gaussian noise: heavy-tailed noise distributions may violate the conditions.

The Effective Rank Condition

Proposition

Variance Effective Rank Determines Benign vs Catastrophic

Statement

Fix kk so that the bias effective rank rk(Σ)=j>kλj/λk+1r_k(\Sigma) = \sum_{j>k} \lambda_j / \lambda_{k+1} already satisfies rk(Σ)nr_k(\Sigma) \gtrsim n. The variance effective rank of the tail is:

Rk(Σ)=(j>kλj)2j>kλj2R_k(\Sigma) = \frac{\left(\sum_{j > k} \lambda_j\right)^2}{\sum_{j > k} \lambda_j^2}

Conditional on the bias condition, the overfitting regime is determined by Rk(Σ)R_k(\Sigma):

  • Benign when Rk(Σ)/nR_k(\Sigma) / n \to \infty: the noise variance component σ2n/Rk(Σ)0\sigma^2 n / R_k(\Sigma) \to 0.
  • Catastrophic when Rk(Σ)/n0R_k(\Sigma) / n \to 0: the noise variance component σ2n/Rk(Σ)\sigma^2 n / R_k(\Sigma) \to \infty.
  • Tempered when Rk(Σ)/ncR_k(\Sigma) / n \to c for a constant c>0c > 0: the noise contributes a non-trivial but bounded amount to the excess risk (see Mallinar et al. 2022).

The variance rank Rk(Σ)R_k(\Sigma) is maximized when all tail eigenvalues are equal (flat spectrum), giving Rk(Σ)=dkR_k(\Sigma) = d - k. It is minimized when one tail eigenvalue dominates (peaked spectrum), giving Rk(Σ)1R_k(\Sigma) \approx 1.

Intuition

The effective rank counts the "effective number of dimensions" in the tail. A flat spectrum means many equally important directions, so noise gets diluted across all of them. A peaked spectrum means a few dominant directions, so noise concentrates and damages predictions.

For benign overfitting, you need dknd - k \gg n (many more tail dimensions than samples) and a relatively flat tail spectrum (so the effective rank is close to dkd - k, not much smaller).

Why It Matters

This gives a concrete diagnostic: examine the eigenvalue spectrum of your data covariance. If the top eigenvalues capture the signal and the remaining eigenvalues are many and roughly equal, benign overfitting is expected. If the spectrum has a long, slowly decaying tail (power law), the effective rank may not grow fast enough for benign overfitting. A word of caution on a common claim: natural images do not have a rapidly decaying spectrum. Simoncelli and Olshausen (2001) document that natural image power spectra follow a power law 1/fα\sim 1/f^\alpha with α1.6\alpha \approx 1.6 to 2.02.0, which translates to a slow polynomial decay of covariance eigenvalues rather than exponential decay. Under the Mallinar et al. (2022) taxonomy this typically places natural images in the tempered regime rather than the benign one. Exact benign overfitting requires a tail heavy enough that Rk(Σ)/nR_k(\Sigma) / n \to \infty, which most naturally occurring spectra only approach, not satisfy.

Why Benign Overfitting Happens in Overparameterized Regimes

In the classical regime (d<nd < n), there is no interpolation: the model cannot perfectly fit the training data (in general). Overfitting in this regime means fitting noise at the expense of increasing test error. The bias-variance tradeoff applies straightforwardly.

In the overparameterized regime (dnd \gg n), interpolation is possible, and the minimum-norm solution has a specific structure:

w^=X(XX)1Xwsignal component+X(XX)1ϵnoise component\hat{w} = \underbrace{X^\top(XX^\top)^{-1}X w^*}_{\text{signal component}} + \underbrace{X^\top(XX^\top)^{-1}\epsilon}_{\text{noise component}}

The signal component recovers ww^* projected onto the row space of XX. The noise component is X(XX)1ϵX^\top(XX^\top)^{-1}\epsilon. The key question is whether the noise component corrupts test predictions.

For a new test point xtestx_{\text{test}}, the noise contribution to the prediction is:

xtestX(XX)1ϵx_{\text{test}}^\top X^\top(XX^\top)^{-1}\epsilon

When dnd \gg n and the tail spectrum is flat, xtestx_{\text{test}} and the noise-fitting directions are nearly orthogonal (in high dimensions, random vectors are nearly orthogonal). The inner product is small, and the noise does not affect the prediction.

This is the geometric essence of benign overfitting: in high dimensions, the model interpolates noise in directions that are orthogonal to the directions that matter for prediction.

Connection to Double Descent

Benign overfitting explains the second descent in the double descent curve:

  1. At the interpolation threshold (dnd \approx n): The variance effective rank of the tail is small (Rk(Σ)1R_k(\Sigma) \approx 1). The noise component is concentrated in a few directions. Overfitting is catastrophic. Test error peaks.

  2. Far past the threshold (dnd \gg n): The variance effective rank of the tail is large (Rk(Σ)nR_k(\Sigma) \gg n). The noise is spread thinly. Overfitting is benign. Test error decreases.

The transition from catastrophic to benign overfitting as d/nd/n increases produces the second descent. In this sense, double descent is the phase transition from catastrophic to benign overfitting along the overparameterization axis.

Benign, Tempered, or Catastrophic: The Refined Taxonomy

Mallinar, Simon, Abedsoltan, Pandit, Belkin, and Nakkiran (2022, arXiv 2207.06569) argue that the benign/catastrophic dichotomy is too coarse for real models. They propose a trichotomy based on how excess risk scales with label noise:

  • Benign: excess risk from noise is o(σ2)o(\sigma^2) as nn \to \infty. The interpolator is asymptotically as good as the Bayes predictor. Rare in practice.
  • Tempered: excess risk is Θ(σ2)\Theta(\sigma^2) but bounded by a constant multiple, so interpolation is suboptimal but not ruinous. Empirically the typical regime for kernel regression, random features, and many deep networks on real data.
  • Catastrophic: excess risk from noise diverges in σ2\sigma^2 or in nn. The interpolator is useless.

The taxonomy is driven by the tail of the relevant kernel or covariance spectrum. A power-law tail λjjα\lambda_j \sim j^{-\alpha} on kernel eigenvalues typically yields a tempered regime, not a benign one, because Rk(Σ)/nR_k(\Sigma)/n stays bounded rather than diverging. The practical takeaway: most "benign overfitting" observed in deep learning is more precisely tempered overfitting. Interpolation is not ruinous, but it is not free either, and explicit regularization still pays.

Connection to Random Matrix Theory

The proof of benign overfitting relies on precise control of the eigenvalues of the sample covariance matrix XXX^\top X (or equivalently XXXX^\top). The key tools from random matrix theory:

  • Marchenko-Pastur law: Describes the limiting spectral distribution of XX/nXX^\top / n when d/nγd/n \to \gamma. Determines how sample eigenvalues relate to population eigenvalues.
  • Resolvent estimates: The resolvent (XX+λI)1(XX^\top + \lambda I)^{-1} (at λ=0\lambda = 0 for the interpolator) controls the noise amplification. RMT provides sharp bounds on the resolvent trace and quadratic forms.
  • Spiked covariance model: When Σ\Sigma has a few large eigenvalues (signal) and many small ones (noise), the sample eigenvalues undergo phase transitions. The BBP (Baik-Ben Arous-Peche) transition determines when signal eigenvalues are detectable above the noise bulk.

Common Confusions

Watch Out

Benign overfitting does not mean all overfitting is harmless

Benign overfitting requires specific conditions on the data covariance spectrum. In low dimensions, with a peaked covariance spectrum, or with insufficient overparameterization, overfitting is catastrophic. The theory identifies when overfitting is benign, not that it is always benign. Most textbook examples of overfitting (polynomial regression in 1D, small neural networks) are in the catastrophic regime.

Watch Out

Benign overfitting does not mean regularization is useless

Even when overfitting is benign, ridge regression with optimal λ\lambda can outperform the min-norm interpolator. Benign overfitting says the interpolator generalizes well, not that it generalizes optimally. The practical implication is that interpolation is not catastrophic in the overparameterized regime, but explicit regularization can still help.

Watch Out

The effective rank is about the tail, not the full spectrum

The condition for benign overfitting involves the effective rank of the covariance eigenvalues beyond the signal subspace, not the full spectrum. A dataset can have low overall effective rank (signal concentrated in a few directions) but high tail effective rank (many small noise eigenvalues). It is the tail that determines whether noise fitting is harmless.

Watch Out

Benign overfitting in linear models does not directly explain deep networks

The Bartlett et al. result is for linear regression. Neural networks are not linear. Extensions via the neural tangent kernel and random features provide partial bridges, but a complete theory of benign overfitting for deep networks does not exist. The linear result provides the right intuition and identifies the relevant quantities (effective rank, covariance spectrum), but the precise conditions for deep networks remain an active research area.

Summary

  • Benign overfitting: zero training error + good test error (interpolation is harmless)
  • Bartlett et al. (2020) identify two effective ranks of the tail spectrum: rk(Σ)=j>kλj/λk+1r_k(\Sigma) = \sum_{j>k}\lambda_j / \lambda_{k+1} (bias-controlling) and Rk(Σ)=(j>kλj)2/j>kλj2R_k(\Sigma) = (\sum_{j>k}\lambda_j)^2 / \sum_{j>k}\lambda_j^2 (variance-controlling)
  • The theorem requires both rk(Σ)nr_k(\Sigma) \gtrsim n and Rk(Σ)nR_k(\Sigma) \gtrsim n
  • High variance rank Rk(Σ)R_k(\Sigma) means noise is spread across many directions, each too weak to harm predictions
  • Mallinar et al. (2022) refine the picture: most real overfitting is tempered, not strictly benign
  • Catastrophic overfitting (Mallinar sense, not the Wong-Rice-Kolter adversarial sense): low tail variance rank, noise concentrates in few directions
  • The transition from catastrophic to benign/tempered overfitting explains the second descent in double descent
  • Requires dnd \gg n (overparameterization) and a heavy, flat-ish tail spectrum
  • Linear theory provides intuition; deep network theory is still incomplete

Exercises

ExerciseCore

Problem

Consider a covariance matrix Σ\Sigma with eigenvalues λ1=10\lambda_1 = 10, λ2=5\lambda_2 = 5, and λj=0.01\lambda_j = 0.01 for j=3,,1000j = 3, \ldots, 1000. Compute both effective ranks r2(Σ)r_2(\Sigma) and R2(Σ)R_2(\Sigma) of the tail (eigenvalues beyond the top 2). Is this a benign overfitting regime for n=100n = 100 samples?

ExerciseAdvanced

Problem

Now consider a covariance with power-law decay: λj=jα\lambda_j = j^{-\alpha} for j=1,,dj = 1, \ldots, d with d=10000d = 10000 and n=100n = 100. For what values of α\alpha is overfitting benign? Compute the variance rank R10(Σ)R_{10}(\Sigma) for α{0.5,1.0,2.0}\alpha \in \{0.5, 1.0, 2.0\}.

ExerciseResearch

Problem

The benign overfitting theory for linear regression requires the noise component X(XX)1ϵX^\top(XX^\top)^{-1}\epsilon to be small in prediction norm. For a neural network in the neural tangent kernel (NTK) regime, the analogous object is Φ(ΦΦ)1ϵ\Phi^\top(\Phi\Phi^\top)^{-1}\epsilon, where Φ\Phi is the feature matrix from the NTK. What conditions on the NTK spectrum would you need for benign overfitting, and why might these conditions be harder to verify than in the linear case?

References

Canonical:

  • Bartlett, Long, Lugosi, Tsigler, "Benign Overfitting in Linear Regression" (PNAS 117(48), 2020, arXiv 1906.11300). Definition 1 introduces the two effective ranks rk(Σ)r_k(\Sigma) and Rk(Σ)R_k(\Sigma); Theorem 4 gives the bias/variance decomposition used on this page.
  • Belkin, Hsu, Ma, Mandal, "Reconciling Modern Machine Learning Practice and the Classical Bias-Variance Trade-Off" (PNAS 116(32), 2019). Original empirical double-descent evidence.

Current:

  • Tsigler and Bartlett, "Benign Overfitting in Ridge Regression" (JMLR 24, 2023). Extends the analysis to ridge and gives matching lower bounds.
  • Chatterji and Long, "Finite-Sample Analysis of Interpolating Linear Classifiers in the Overparameterized Regime" (JMLR 22, 2021). Logistic / max-margin classification analogue.
  • Mallinar, Simon, Abedsoltan, Pandit, Belkin, Nakkiran, "Benign, Tempered, or Catastrophic: Toward a Refined Taxonomy of Overfitting" (NeurIPS 2022, arXiv 2207.06569). Argues most real overfitting is tempered, not benign.
  • Hastie, Montanari, Rosset, Tibshirani, "Surprises in High-Dimensional Ridgeless Least Squares Interpolation" (Annals of Statistics 50(2), 2022). Asymptotic risk of the min-norm interpolator under proportional asymptotics.

Spectral assumptions on natural data:

  • Simoncelli and Olshausen, "Natural Image Statistics and Neural Representation" (Annual Review of Neuroscience 24, 2001). Establishes that natural image power spectra follow a power law roughly 1/fα1/f^\alpha with α1.6\alpha \approx 1.6 to 2.02.0, not exponential decay.

Background on adversarial "catastrophic overfitting":

  • Wong, Rice, Kolter, "Fast is Better than Free: Revisiting Adversarial Training" (ICLR 2020, arXiv 2001.03994). Uses "catastrophic overfitting" for a distinct phenomenon in single-step adversarial training, not for the trichotomy of this page.

Next Topics

The natural next steps from benign overfitting:

  • Neural tangent kernel: the infinite-width regime where neural networks become kernel machines and benign overfitting analysis can be extended
  • Double descent: the full generalization curve that benign overfitting explains in the overparameterized regime

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics