Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Modern Generalization

Implicit Bias and Modern Generalization

Why classical generalization theory breaks for overparameterized models: the random labels experiment, the interpolation threshold, implicit bias of gradient descent, double descent, and the frontier of understanding why deep learning works.

AdvancedTier 1Current~90 min

Why This Matters

Implicit bias of gradient descent: among all interpolating solutions, GD finds the minimum-norm one

||w||^2 contoursinterpolation manifold (train loss = 0)other solutionsw(0) initmin-norm solutionGD pathParameter w2Parameter w1Many interpolating solutions exist. GD picks the minimum-norm one.

This is the most important open question in machine learning theory: why do overparameterized models generalize?

Classical statistical learning theory (VC dimension, Rademacher complexity, uniform convergence) says that models with more parameters than training examples should overfit catastrophically. Neural networks routinely have millions or billions of parameters, trained on far fewer examples, yet they generalize well to unseen data. The bounds predicted by classical theory are not just loose. They are vacuous, predicting generalization gaps larger than 1 (meaningless for bounded loss).

Something fundamental is missing from the classical picture. This topic explores what we know about that missing piece.

Mental Model

Classical theory says: fitting the training data perfectly (interpolation) is dangerous. More model complexity means more overfitting risk. The optimal strategy is to find the sweet spot where the model is complex enough to capture the signal but simple enough not to memorize the noise.

Modern deep learning says: fit the training data perfectly (zero training loss), use a model with far more parameters than data points, and somehow generalize well anyway. This contradicts the classical story completely.

The resolution involves two key ideas:

  1. Not all interpolating solutions are equal. Among the infinitely many ways to interpolate the data, gradient descent finds a particular one with special properties (minimum norm, maximum margin).
  2. The bias-variance tradeoff has a second act. Beyond the classical overfitting peak, test error can decrease again as model complexity increases further.

The Experiment That Changed Everything

Zhang et al. (2017): Random Labels

The experiment is simple and devastating:

  1. Take a standard image classification dataset (CIFAR-10, ImageNet).
  2. Replace all labels with random labels (uniformly random class assignments, completely independent of the images).
  3. Train a standard deep network (ResNet, Inception) on this randomized data.

Result: The network achieves zero training error: it perfectly memorizes the random label assignments. Training takes longer than with true labels, but the network eventually fits every single random label.

Why this is devastating: The network has enough capacity to memorize pure noise. Any complexity measure that depends only on the hypothesis class H\mathcal{H} (like VC dimension) must be at least as large as the training set size nn. otherwise the class could not shatter nn points with arbitrary labels. But if the complexity measure is n\geq n, then the generalization bound O(complexity/n)O(\sqrt{\text{complexity}/n}) is at least O(1)O(1), which is vacuous.

The same network architecture that memorizes random labels also generalizes well on real labels. The hypothesis class is the same in both cases. So the hypothesis class alone cannot explain generalization. something about the interaction between the data and the algorithm must be responsible.

Where Classical Theory Breaks

Definition

Vacuous Bound

A generalization bound is vacuous if it predicts a generalization gap larger than the trivial maximum. For loss bounded in [0,1][0, 1], a bound predicting a gap >1> 1 provides no information. You already knew the gap was at most 1 without any theory.

For a typical deep network trained on CIFAR-10:

  • VC dimension of the hypothesis class: at least Ω(n)\Omega(n) (since it can fit random labels on nn points). Plugging into the standard VClogn/n\sqrt{\text{VC}\log n / n} bound gives O(logn)O(\sqrt{\log n}), which exceeds 1 for all practical nn and is therefore vacuous.
  • Rademacher complexity: for the full hypothesis class (all functions representable by the network), also Ω(1)\Omega(1). vacuous.
  • Parameter counting: billions of parameters, thousands or millions of training examples. Classical bounds grow with the number of parameters, completely uninformative.

The issue is not that these bounds are loose by a constant factor. They are qualitatively wrong. They cannot distinguish between a network that memorizes random labels and one that learns genuine patterns.

The Interpolation Threshold

Definition

Interpolation Threshold

The interpolation threshold is the model complexity at which the model first becomes expressive enough to perfectly fit (interpolate) the training data. achieving zero training error. For a linear model with dd features and nn data points, the interpolation threshold is approximately d=nd = n.

Classical theory focuses on the regime d<nd < n (underparameterized). In this regime, the bias-variance tradeoff behaves as expected:

  • Low complexity (dnd \ll n): high bias, low variance. Underfitting.
  • Increasing complexity: bias decreases, variance increases. Test error improves then worsens.
  • At the interpolation threshold (dnd \approx n): variance explodes. The model overfits maximally. Test error is worst here.

Classical theory says: stop here. Do not increase complexity beyond this point. Regularize.

But what happens if you keep going?

The Implicit Bias of Gradient Descent

When the model has more parameters than data points (d>nd > n), there are infinitely many parameter vectors that perfectly interpolate the training data. Gradient descent does not find an arbitrary interpolating solution. It finds a specific one.

Theorem

Implicit Bias of GD for Linear Models

Statement

Consider the linear regression problem minwXwy2\min_w \|Xw - y\|^2 where XRn×dX \in \mathbb{R}^{n \times d} with d>nd > n and XX has rank nn (so the training data can be perfectly interpolated). Gradient descent initialized at w0=0w_0 = 0 converges to:

w=XT(XXT)1y=argminw:Xw=yw2w^* = X^T(XX^T)^{-1}y = \arg\min_{w: Xw = y} \|w\|_2

This is the minimum 2\ell_2-norm interpolating solution. the pseudoinverse solution w=Xyw^* = X^\dagger y.

Intuition

Gradient descent always moves in the direction of L=XT(Xwy)-\nabla L = -X^T(Xw - y), which lies in the row space of XX. Since w0=0w_0 = 0, and every update is in the row space of XX, the final solution ww^* lies entirely in the row space of XX. Among all solutions to Xw=yXw = y, the one in the row space of XX is the one with minimum norm (because any component in the null space of XX would add to the norm without affecting the predictions).

Proof Sketch

The gradient is L(w)=XT(Xwy)\nabla L(w) = X^T(Xw - y), which lies in the column space of XTX^T = row space of XX. Since w0=0w_0 = 0 and every gradient step adds a vector in the row space of XX:

wt=s=0t1ηsXT(Xwsy)rowspace(X)w_t = -\sum_{s=0}^{t-1} \eta_s X^T(Xw_s - y) \in \text{rowspace}(X)

By induction, wtw_t stays in the row space for all tt. As tt \to \infty (with appropriate step size), wtw_t converges to the unique element of {w:Xw=y}rowspace(X)\{w : Xw = y\} \cap \text{rowspace}(X), which is XyX^\dagger y, the minimum-norm solution.

Why It Matters

This theorem reveals that gradient descent has a built-in preference. an implicit regularizer: even without any explicit regularization term. The algorithm does not just find any interpolating solution; it finds the one with smallest norm. For linear models, minimum norm corresponds to maximum smoothness, which promotes generalization. This implicit regularization is invisible to classical theory, which only analyzes the hypothesis class and ignores the algorithm.

Failure Mode

This result is exact only for linear models and gradient descent from zero initialization. For nonlinear models (neural networks), the implicit bias is more complex and depends on the architecture, activation functions, initialization scheme, and learning rate. The min-norm characterization does not directly apply to deep networks, but the qualitative insight. that GD finds "simple" interpolators. appears to hold more broadly.

Implicit Bias for Classification

For linearly separable binary classification with logistic loss, gradient descent exhibits a different but equally remarkable implicit bias. Assume the data is linearly separable (no bias term, through-origin classifier) and the step size is small enough to guarantee convergence. Gradient descent on the logistic loss ilog(1+exp(yiwTxi))\sum_i \log(1 + \exp(-y_i w^T x_i)) converges in direction (not magnitude, since the loss approaches zero as w\|w\| \to \infty) to the max-margin classifier:

wtwtwSVMwSVM\frac{w_t}{\|w_t\|} \to \frac{w_{\text{SVM}}}{\|w_{\text{SVM}}\|}

where wSVM=argmaxw:yiwTxi1w1w_{\text{SVM}} = \arg\max_{w: y_i w^T x_i \geq 1} \|w\|^{-1} is the hard-margin SVM solution. Soudry et al. (2018) work in the through-origin setting with no bias term; the general form with bias bb requires yi(wTxi+b)1y_i(w^T x_i + b) \geq 1 and a separate analysis.

The convergence in direction is very slow: Soudry, Hoffer, Nacson, Gunasekar, Srebro (JMLR 2018, arXiv:1710.10345, Theorem 3) prove a rate of O(1/logt)O(1/\log t). In practice this means the implicit bias takes an enormous number of iterations to fully manifest, even though the training loss drops rapidly.

This means gradient descent on logistic regression implicitly solves the SVM problem without any explicit margin maximization. The regularization is entirely a consequence of the optimization dynamics.

The Double Descent Curve

Proposition

Double Descent in Random-Features Linear Regression

Statement

In random-features linear regression with i.i.d. label noise of variance σ2\sigma^2, under the Marchenko-Pastur spectral limit (with n,dn, d \to \infty and d/nγd/n \to \gamma), the expected test risk of the minimum-norm interpolator exhibits a peak at the interpolation threshold d=nd = n (i.e., γ=1\gamma = 1) and decreases again for d>nd > n. Concretely, the variance component of the risk is of order:

varianceσ2dnd(d<n),varianceσ2ndn(d>n),\text{variance} \sim \frac{\sigma^2 d}{n - d} \quad (d < n), \qquad \text{variance} \sim \frac{\sigma^2 n}{d - n} \quad (d > n),

which diverges as dnd \to n from either side. These explicit formulas require i.i.d. isotropic Gaussian features. For structured features with non-isotropic covariance, Bartlett, Long, Lugosi, Tsigler (2020) show that the divergence at d=nd = n can be mild or even absent: the shape of the risk curve depends on the spectrum of Σ\Sigma. This is the linear-model instance of the "double descent" shape (Hastie, Montanari, Rosset, Tibshirani 2022; Bartlett, Long, Lugosi, Tsigler 2020). A broader empirical double descent for deep networks (Belkin et al. 2019, Nakkiran et al. 2020) is documented but is an empirical observation, not a consequence of this result.

Intuition

At d=nd = n, the feature matrix XX becomes nearly singular almost surely. The minimum-norm interpolator inverts small singular values, amplifying noise. Away from d=nd = n, either there is spare data (d<nd < n, OLS is stable) or spare parameters (d>nd > n, the noise component is spread across many directions and each direction contributes less to predictions).

Proof Sketch

Sketch of the linear case only. Let XRn×dX \in \mathbb{R}^{n \times d} have i.i.d. rows and let w^\hat{w} be OLS when d<nd < n and the minimum-norm interpolator XyX^\dagger y when d>nd > n. Decompose the expected test risk into bias2^2 and variance. The variance is

E[w^E[w^]Σ2]=σ2tr(ΣCov(w^)),\mathbb{E}\bigl[\|\hat{w} - \mathbb{E}[\hat{w}]\|_\Sigma^2\bigr] = \sigma^2 \operatorname{tr}\bigl(\Sigma \cdot \text{Cov}(\hat{w})\bigr),

which under the Marchenko-Pastur law has closed form dominated by σ2γ/1γ\sigma^2 \gamma / |1 - \gamma| for isotropic features. This diverges at γ=1\gamma = 1 and decays on both sides. Bias behaves smoothly through γ=1\gamma = 1 under mild assumptions, so the risk shape is dictated by the variance. See Hastie et al. (2022), Theorems 1-3, for the full derivation; this sketch covers only the linear case, not deep networks.

Why It Matters

This result is a provable instance of risk non-monotonicity under the minimum-norm interpolator. It disproves the universal claim that more parameters must harm generalization and provides a concrete mechanism (variance driven by ill-conditioning at d=nd = n) that can be stated, proved, and checked.

Failure Mode

The closed-form result is specific to linear models with i.i.d. Gaussian-like features in the asymptotic Marchenko-Pastur regime. It does not prove double descent for neural networks. Empirical double-descent curves for deep networks (width-wise, epoch-wise, sample-wise) are observations that extend this picture by analogy, not by theorem. Regularization (ridge, weight decay, early stopping) flattens or removes the interpolation peak even in the linear case.

Grokking

Grokking is an empirical phenomenon (not a theorem) of delayed generalization. Training accuracy reaches 100% early, test accuracy stays at chance, and then, long after training loss has converged, test accuracy transitions sharply to near 100%. The gap between memorization and generalization can span many orders of magnitude in training steps.

Originally reported by Power, Burda, Edwards, Babuschkin, and Misra (2022), "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets" (arXiv:2201.02177), on small algorithmic tasks such as modular arithmetic. The effect typically requires weight decay. Without weight decay, the network tends to stay in the memorization regime indefinitely.

The standard interpretation is a slow transition from a memorization circuit to a generalization circuit. Weight decay continues to shrink parameters after training loss has saturated, and this drift eventually favors a lower-norm circuit that happens to implement the true function. Nanda, Chan, Lieberum, Smith, and Steinhardt (2023), "Progress Measures for Grokking via Mechanistic Interpretability" (arXiv:2301.05217), gave a concrete mechanistic account for modular addition, identifying a discrete Fourier circuit that forms during the grokking transition and tracks it with interpretable progress measures.

Grokking is a data point, not a law. It is observed reliably on small algorithmic datasets and intermittently on other tasks. The mechanism it illustrates (implicit bias continuing to act after training loss plateaus) is consistent with the broader implicit-bias picture above, but the quantitative transition time is not predicted by current theory.

Benign Overfitting: A Preview

The phenomenon of interpolation without harm is called benign overfitting. It occurs when:

  1. The model perfectly interpolates the training data (including noise).
  2. Despite memorizing noise, the model still generalizes well on test data.

This happens when the "noise component" of the interpolating solution is spread thinly across many dimensions. It corrupts the predictions only slightly. The conditions for benign overfitting have been worked out precisely for linear regression by Bartlett, Long, Lugosi, Tsigler, "Benign Overfitting in Linear Regression" (PNAS 2020). The precise conditions are stated in terms of two effective-rank quantities of the covariance Σ\Sigma with eigenvalues λ1λ2\lambda_1 \geq \lambda_2 \geq \ldots:

rk(Σ)=i>kλiλk+1,Rk(Σ)=(i>kλi)2i>kλi2.r_k(\Sigma) = \frac{\sum_{i > k} \lambda_i}{\lambda_{k+1}}, \qquad R_k(\Sigma) = \frac{\left(\sum_{i > k} \lambda_i\right)^2}{\sum_{i > k} \lambda_i^2}.

The first, rkr_k, captures how concentrated the leading signal directions are (small rkr_k means a fast-decaying head). The second, RkR_k, measures how spread out the tail is (large RkR_k means the tail has many comparable small eigenvalues). Benign overfitting requires rk(Σ)r_k(\Sigma) small relative to nn (the head is low-dimensional so signal is learnable) and Rk(Σ)R_k(\Sigma) large (the tail absorbs noise across many directions). We do not derive these conditions here; the point is that benign overfitting has precise spectral conditions that can be checked.

What This Means for Practice

The implicit bias perspective suggests a shift in how we think about generalization:

Old view (classical): Generalization \approx complexity control. Restrict the hypothesis class. Add regularization. Use the simplest model that fits the data.

New view (modern): Generalization \approx algorithm + architecture + data. The optimization algorithm (GD/SGD) implicitly selects "simple" solutions from a vast hypothesis class. The architecture (convolutions, skip connections) builds in useful inductive biases. The data structure (natural images have low intrinsic dimension) enables good interpolation.

This does not mean classical theory is wrong. It is correct but incomplete. It accurately describes the underparameterized regime. The new theory extends the picture to the overparameterized regime where modern deep learning lives.

Common Confusions

Watch Out

Implicit bias does not mean GD always finds the best solution

The minimum-norm interpolator is not always the best predictor. In some settings, explicit regularization (ridge regression with carefully tuned λ\lambda) outperforms the implicit minimum-norm bias of GD. Implicit bias is a description of what GD does, not a guarantee that what it does is optimal. The remarkable empirical observation is that for deep networks on natural data, the implicit bias often works surprisingly well.

Watch Out

Double descent is about model complexity, not just parameters

The x-axis in the double descent curve is "effective model complexity," not raw parameter count. You can observe double descent by varying width, depth, training epochs (epoch-wise double descent), or even the number of data points. The interpolation threshold is the critical regime where the model barely has enough capacity to fit the data. This is where overfitting peaks.

Watch Out

Overparameterization does not always help

Double descent shows that more parameters can help, not that they always do. With terrible architecture choices, poor initialization, or adversarial data, overparameterization can fail. The theory describes behavior under specific conditions (gradient descent, reasonable architecture, natural data distributions). It is not a blank check to make models arbitrarily large.

Watch Out

Classical bounds are not wrong. They are just not tight

VC dimension and Rademacher complexity bounds are mathematically correct. They provide valid upper bounds on generalization error. The problem is that these bounds are extremely loose for overparameterized models. They give vacuous results because they only depend on the hypothesis class and ignore the algorithm. Tighter bounds (PAC-Bayes, compression-based, algorithm- dependent) can sometimes give non-vacuous results, but even these are far from explaining the full picture.

Summary

  • Zhang et al. (2017): networks can memorize random labels, proving that classical complexity measures (VC, Rademacher) are vacuous for deep learning
  • The interpolation threshold (dnd \approx n) is where overfitting is worst
  • Gradient descent has implicit bias: for linear models, it finds the minimum-norm interpolator (regression) or max-margin classifier (logistic)
  • Double descent: test error decreases, increases (classical peak), then decreases again past interpolation
  • Benign overfitting: interpolating noise is harmless when the noise component is spread across many small eigenvalue directions
  • Generalization in modern ML depends on the algorithm-architecture-data interaction, not just hypothesis class complexity

Exercises

ExerciseCore

Problem

Consider a linear regression problem with XRn×dX \in \mathbb{R}^{n \times d} where d=2nd = 2n (twice as many features as samples). Assume XX has full row rank. Write down the minimum-norm interpolating solution ww^* and explain why w\|w^*\| tends to be smaller when dd is larger (all else being equal).

ExerciseAdvanced

Problem

Explain the double descent phenomenon in terms of the bias-variance decomposition. Specifically, for the minimum-norm interpolator in linear regression with random features:

(a) Why does variance diverge as dnd \to n from below? (b) Why does variance decrease as dd increases past nn? (c) What happens to bias in the overparameterized regime?

ExerciseAdvanced

Problem

Why does the random labels experiment of Zhang et al. invalidate uniform convergence bounds (VC, Rademacher) as an explanation for generalization of deep networks? Be precise about what the experiment proves and what it does not prove.

References

Canonical:

  • Zhang, Bengio, Hardt, Recht, Vinyals, "Understanding Deep Learning Requires Rethinking Generalization" (ICLR 2017, arXiv:1611.03530)
  • Belkin, Hsu, Ma, Mandal, "Reconciling modern machine learning practice and the bias-variance trade-off" (PNAS 2019)
  • Bartlett, Long, Lugosi, Tsigler, "Benign overfitting in linear regression" (PNAS 2020)

Current:

  • Nakkiran, Kaplun, Bansal, Yang, Barak, Sutskever, "Deep Double Descent" (ICLR 2020)
  • Soudry, Hoffer, Nacson, Gunasekar, Srebro, "The implicit bias of gradient descent on separable data" (JMLR 2018)
  • Power, Burda, Edwards, Babuschkin, Misra, "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets" (arXiv:2201.02177, 2022)
  • Nanda, Chan, Lieberum, Smith, Steinhardt, "Progress Measures for Grokking via Mechanistic Interpretability" (ICLR 2023, arXiv:2301.05217)
  • Chizat, Oyallon, Bach, "On Lazy Training in Differentiable Programming" (NeurIPS 2019, arXiv:1812.07956)
  • Zhang, Bengio, Hardt, Recht, Vinyals, "Understanding Deep Learning (Still) Requires Rethinking Generalization" (Communications of the ACM 2021)
  • Nagarajan, Kolter, "Uniform Convergence May Be Unable to Explain Generalization in Deep Learning" (NeurIPS 2019, arXiv:1902.04742)

Next Topics

The natural next steps from this topic:

  • Double descent: detailed analysis of the double descent curve, epoch-wise double descent, and the role of regularization
  • Benign overfitting: precise conditions under which interpolation is harmless, and the eigenvalue decay requirements
  • Neural tangent kernel: the infinite-width limit where neural networks become kernel machines, connecting deep learning to classical kernel theory

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics