Implicit Bias and Modern Generalization

Sneiderman, Robby

Modern Generalization

Implicit Bias and Modern Generalization

Why classical generalization theory breaks for overparameterized models: the random labels experiment, the interpolation threshold, implicit bias of gradient descent, double descent, and the frontier of understanding why deep learning works.

AdvancedTier 1CurrentSupporting~90 min

Prerequisites

Gradient Descent Variants Linear Regression VC Dimension Rademacher Complexity

Start 8-question practice · 4 available Prereq Map

Learning position

Read this page in the graph.

modern-generalization | layer 4 | tier 1. This page has 14 direct prerequisites and 6 published dependents.

Open Atlas Prerequisites Leads to

What next

Double Descent

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Implicit bias of gradient descent: among all interpolating solutions, GD finds the minimum-norm one

This is the most important open question in machine learning theory: why do overparameterized models generalize?

Classical statistical learning theory (VC dimension, Rademacher complexity, uniform convergence) says that models with more parameters than training examples should overfit catastrophically. Neural networks routinely have millions or billions of parameters, trained on far fewer examples, yet they generalize well to unseen data. The bounds predicted by classical theory are not just loose. They are vacuous, predicting generalization gaps larger than 1 (meaningless for bounded loss).

Something fundamental is missing from the classical picture. This topic explores what we know about that missing piece.

Mental Model

Classical theory says: fitting the training data perfectly (interpolation) is dangerous. More model complexity means more overfitting risk. The optimal strategy is to find the sweet spot where the model is complex enough to capture the signal but simple enough not to memorize the noise.

Modern deep learning says: fit the training data perfectly (zero training loss), use a model with far more parameters than data points, and somehow generalize well anyway. This contradicts the classical story completely.

The resolution involves two ideas, each of which has a clean theorem in a restricted setting and a much messier story for actual deep networks:

Not all interpolating solutions are equal. For overparameterized linear regression with $d > n$ , gradient descent from zero initialization converges to the minimum- $\ell_2$ -norm interpolator (clean theorem below). For deep networks, which solution gradient descent finds depends on architecture, optimizer (SGD vs Adam vs muon), learning rate, batch size, normalization (batch / layer / none), initialization scheme, and step-size schedule. There is no closed-form characterization that survives all of these knobs; the "GD picks a special interpolator" statement is a guiding intuition, not a theorem about deep nets.
The bias-variance tradeoff has a second act. Beyond the classical overfitting peak, test error can decrease again as model complexity increases further (double descent). This is again clean for linear / random-features regression and observed empirically for deeper models; the linear story is not a complete explanation of the deep one.

The Experiment That Changed Everything

Zhang et al. (2017): Random Labels

The experiment is simple and devastating:

Take a standard image classification dataset (CIFAR-10, ImageNet).
Replace all labels with random labels (uniformly random class assignments, completely independent of the images).
Train a standard deep network (ResNet, Inception) on this randomized data.

Result: The network achieves zero training error: it perfectly memorizes the random label assignments. Training takes longer than with true labels, but the network eventually fits every single random label.

Why this is devastating: The network has enough capacity to memorize pure noise. Any complexity measure that depends only on the hypothesis class $\mathcal{H}$ (like VC dimension) must be at least as large as the training set size $n$ . otherwise the class could not shatter $n$ points with arbitrary labels. But if the complexity measure is $\geq n$ , then the generalization bound $O(\sqrt{\text{complexity}/n})$ is at least $O(1)$ , which is vacuous.

The same network architecture that memorizes random labels also generalizes well on real labels. The hypothesis class is the same in both cases. So the hypothesis class alone cannot explain generalization. something about the interaction between the data and the algorithm must be responsible.

Where Classical Theory Breaks

Definition

Vacuous Bound

A generalization bound is vacuous if and only if it predicts a generalization gap larger than the trivial maximum. For loss bounded in $[0, 1]$ , a bound predicting a gap $> 1$ provides no information. You already knew the gap was at most 1 without any theory.

For a typical deep network trained on CIFAR-10:

VC dimension of the hypothesis class: at least $\Omega(n)$ (since it can fit random labels on $n$ points). Plugging into the standard $\sqrt{\text{VC}\log n / n}$ bound gives $O(\sqrt{\log n})$ , which exceeds 1 for all practical $n$ and is therefore vacuous.
Rademacher complexity: for the full hypothesis class (all functions representable by the network), also $\Omega(1)$ . vacuous.
Parameter counting: billions of parameters, thousands or millions of training examples. Classical bounds grow with the number of parameters, completely uninformative.

The issue is not that these bounds are loose by a constant factor. They are qualitatively wrong. They cannot distinguish between a network that memorizes random labels and one that learns genuine patterns.

The Interpolation Threshold

Definition

Interpolation Threshold

The interpolation threshold is the model complexity at which the model first becomes expressive enough to perfectly fit (interpolate) the training data. achieving zero training error. For a linear model with $d$ features and $n$ data points, the interpolation threshold is approximately $d = n$ .

Classical theory focuses on the regime $d < n$ (underparameterized). In this regime, the bias-variance tradeoff behaves as expected:

Low complexity ( $d \ll n$ ): high bias, low variance. Underfitting.
Increasing complexity: bias decreases, variance increases. Test error improves then worsens.
At the interpolation threshold ( $d \approx n$ ): variance explodes. The model overfits maximally. Test error is worst here.

Classical theory says: stop here. Do not increase complexity beyond this point. Regularize.

But what happens if you keep going?

The Implicit Bias of Gradient Descent

In the cleanest setting — overparameterized linear regression with $d > n$ — there are infinitely many parameter vectors that interpolate the training data, and gradient descent from zero initialization picks a specific one. The result and assumptions are stated below; for nonlinear models the analogous statement is much weaker (see the failure-mode block).

Theorem

Implicit Bias of GD for Linear Models

Statement

Consider the linear regression problem $\min_w \|Xw - y\|^2$ where $X \in \mathbb{R}^{n \times d}$ with $d > n$ and $X$ has rank $n$ (so the training data can be perfectly interpolated). Gradient descent initialized at $w_0 = 0$ converges to:

$w^* = X^T(XX^T)^{-1}y = \arg\min_{w: Xw = y} \|w\|_2$

This is the minimum $\ell_2$ -norm interpolating solution. the pseudoinverse solution $w^* = X^\dagger y$ .

Intuition

Gradient descent always moves in the direction of $-\nabla L = -X^T(Xw - y)$ , which lies in the row space of $X$ . Since $w_0 = 0$ , and every update is in the row space of $X$ , the final solution $w^*$ lies entirely in the row space of $X$ . Among all solutions to $Xw = y$ , the one in the row space of $X$ is the one with minimum norm (because any component in the null space of $X$ would add to the norm without affecting the predictions).

Proof Sketch

The gradient is $\nabla L(w) = X^T(Xw - y)$ , which lies in the column space of $X^T$ = row space of $X$ . Since $w_0 = 0$ and every gradient step adds a vector in the row space of $X$ :

$w_t = -\sum_{s=0}^{t-1} \eta_s X^T(Xw_s - y) \in \text{rowspace}(X)$

By induction, $w_t$ stays in the row space for all $t$ . As $t \to \infty$ (with appropriate step size), $w_t$ converges to the unique element of $\{w : Xw = y\} \cap \text{rowspace}(X)$ , which is $X^\dagger y$ , the minimum-norm solution.

Why It Matters

For linear regression, gradient descent has an implicit regularizer: it picks the minimum-norm interpolator without any explicit penalty in the loss. For this restricted setting, classical theory (which analyzes only the hypothesis class) misses the effect entirely. Whether an analogous "implicit regularization" picture explains generalization in deep networks is an open question — the failure-mode block below summarizes what is and isn't known.

Failure Mode

This result is exact only for linear models and gradient descent from zero initialization. For nonlinear models (neural networks), the implicit bias is more complex and depends on the architecture, activation functions, initialization scheme, and learning rate. The min-norm characterization does not directly apply to deep networks, but the qualitative insight. that GD finds "simple" interpolators. appears to hold more broadly.

report a correction →

Implicit Bias for Classification

For linearly separable binary classification with logistic loss, gradient descent exhibits a different but equally remarkable implicit bias. Assume the data is linearly separable (no bias term, through-origin classifier) and the step size is small enough to guarantee convergence. Gradient descent on the logistic loss $\sum_i \log(1 + \exp(-y_i w^T x_i))$ converges in direction (not magnitude, since the loss approaches zero as $\|w\| \to \infty$ ) to the max-margin classifier:

$\frac{w_t}{\|w_t\|} \to \frac{w_{\text{SVM}}}{\|w_{\text{SVM}}\|}$

where $w_{\text{SVM}} = \arg\max_{w: y_i w^T x_i \geq 1} \|w\|^{-1}$ is the hard-margin SVM solution. Soudry et al. (2018) work in the through-origin setting with no bias term; the general form with bias $b$ requires $y_i(w^T x_i + b) \geq 1$ and a separate analysis.

The convergence in direction is very slow: Soudry, Hoffer, Nacson, Gunasekar, Srebro (JMLR 2018, arXiv:1710.10345, Theorem 3) prove a rate of $O(1/\log t)$ . In practice this means the implicit bias takes an enormous number of iterations to fully manifest, even though the training loss drops rapidly.

This means gradient descent on logistic regression implicitly solves the SVM problem without any explicit margin maximization. The regularization is entirely a consequence of the optimization dynamics.

The Double Descent Curve

Proposition

Double Descent in Random-Features Linear Regression

Statement

In random-features linear regression with i.i.d. label noise of variance $\sigma^2$ , under the Marchenko-Pastur spectral limit (with $n, d \to \infty$ and $d/n \to \gamma$ ), the expected test risk of the minimum-norm interpolator exhibits a peak at the interpolation threshold $d = n$ (i.e., $\gamma = 1$ ) and decreases again for $d > n$ . Concretely, the variance component of the risk is of order:

$\text{variance} \sim \frac{\sigma^2 d}{n - d} \quad (d < n), \qquad \text{variance} \sim \frac{\sigma^2 n}{d - n} \quad (d > n),$

which diverges as $d \to n$ from either side. These explicit formulas require i.i.d. isotropic Gaussian features. For structured features with non-isotropic covariance, Bartlett, Long, Lugosi, Tsigler (2020) show that the divergence at $d = n$ can be mild or even absent: the shape of the risk curve depends on the spectrum of $\Sigma$ . This is the linear-model instance of the "double descent" shape (Hastie, Montanari, Rosset, Tibshirani 2022; Bartlett, Long, Lugosi, Tsigler 2020). A broader empirical double descent for deep networks (Belkin et al. 2019, Nakkiran et al. 2020) is documented but is an empirical observation, not a consequence of this result.

Intuition

At $d = n$ , the feature matrix $X$ becomes nearly singular almost surely. The minimum-norm interpolator inverts small singular values, amplifying noise. Away from $d = n$ , either there is spare data ( $d < n$ , OLS is stable) or spare parameters ( $d > n$ , the noise component is spread across many directions and each direction contributes less to predictions).

Proof Sketch

Sketch of the linear case only. Let $X \in \mathbb{R}^{n \times d}$ have i.i.d. rows and let $\hat{w}$ be OLS when $d < n$ and the minimum-norm interpolator $X^\dagger y$ when $d > n$ . Decompose the expected test risk into bias $^2$ and variance. The variance is

$\mathbb{E}\bigl[\|\hat{w} - \mathbb{E}[\hat{w}]\|_\Sigma^2\bigr] = \sigma^2 \operatorname{tr}\bigl(\Sigma \cdot \text{Cov}(\hat{w})\bigr),$

which under the Marchenko-Pastur law has closed form dominated by $\sigma^2 \gamma / |1 - \gamma|$ for isotropic features. This diverges at $\gamma = 1$ and decays on both sides. Bias behaves smoothly through $\gamma = 1$ under mild assumptions, so the risk shape is dictated by the variance. See Hastie et al. (2022), Theorems 1-3, for the full derivation; this sketch covers only the linear case, not deep networks.

Why It Matters

This result is a provable instance of risk non-monotonicity under the minimum-norm interpolator. It disproves the universal claim that more parameters must harm generalization and provides a concrete mechanism (variance driven by ill-conditioning at $d = n$ ) that can be stated, proved, and checked.

Failure Mode

The closed-form result is specific to linear models with i.i.d. Gaussian-like features in the asymptotic Marchenko-Pastur regime. It does not prove double descent for neural networks. Empirical double-descent curves for deep networks (width-wise, epoch-wise, sample-wise) are observations that extend this picture by analogy, not by theorem. Regularization (ridge, weight decay, early stopping) flattens or removes the interpolation peak even in the linear case.

report a correction →

Grokking

Grokking is an empirical phenomenon (not a theorem) of delayed generalization. Training accuracy reaches 100% early, test accuracy stays at chance, and then, long after training loss has converged, test accuracy transitions sharply to near 100%. The gap between memorization and generalization can span many orders of magnitude in training steps.

Originally reported by Power, Burda, Edwards, Babuschkin, and Misra (2022), "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets" (arXiv:2201.02177), on small algorithmic tasks such as modular arithmetic. The effect typically requires weight decay. Without weight decay, the network tends to stay in the memorization regime indefinitely.

The standard interpretation is a slow transition from a memorization circuit to a generalization circuit. Weight decay continues to shrink parameters after training loss has saturated, and this drift eventually favors a lower-norm circuit that happens to implement the true function. Nanda, Chan, Lieberum, Smith, and Steinhardt (2023), "Progress Measures for Grokking via Mechanistic Interpretability" (arXiv:2301.05217), gave a concrete mechanistic account for modular addition, identifying a discrete Fourier circuit that forms during the grokking transition and tracks it with interpretable progress measures.

Grokking is a data point, not a law. It is observed reliably on small algorithmic datasets and intermittently on other tasks. The mechanism it illustrates (implicit bias continuing to act after training loss plateaus) is consistent with the broader implicit-bias picture above, but the quantitative transition time is not predicted by current theory.

Benign Overfitting: A Preview

The phenomenon of interpolation without harm is called benign overfitting. It occurs when:

The model perfectly interpolates the training data (including noise).
Despite memorizing noise, the model still generalizes well on test data.

This happens when the "noise component" of the interpolating solution is spread thinly across many dimensions. It corrupts the predictions only slightly. The conditions for benign overfitting have been worked out precisely for linear regression by Bartlett, Long, Lugosi, Tsigler, "Benign Overfitting in Linear Regression" (PNAS 2020). The precise conditions are stated in terms of two effective-rank quantities of the covariance $\Sigma$ with eigenvalues $\lambda_1 \geq \lambda_2 \geq \ldots$ :

$r_k(\Sigma) = \frac{\sum_{i > k} \lambda_i}{\lambda_{k+1}}, \qquad R_k(\Sigma) = \frac{\left(\sum_{i > k} \lambda_i\right)^2}{\sum_{i > k} \lambda_i^2}.$

The first, $r_k$ , captures how concentrated the leading signal directions are (small $r_k$ means a fast-decaying head). The second, $R_k$ , measures how spread out the tail is (large $R_k$ means the tail has many comparable small eigenvalues). Benign overfitting requires $r_k(\Sigma)$ small relative to $n$ (the head is low-dimensional so signal is learnable) and $R_k(\Sigma)$ large (the tail absorbs noise across many directions). We do not derive these conditions here; the point is that benign overfitting has precise spectral conditions that can be checked.

What This Means for Practice

The implicit bias perspective suggests a shift in how we think about generalization:

Old view (classical): Generalization $\approx$ complexity control. Restrict the hypothesis class. Add regularization. Use the simplest model that fits the data.

New view (modern): Generalization $\approx$ algorithm + architecture + data. The optimization algorithm (GD/SGD) implicitly selects "simple" solutions from a vast hypothesis class. The architecture (convolutions, skip connections) builds in useful inductive biases. The data structure (natural images have low intrinsic dimension) enables good interpolation.

This does not mean classical theory is wrong. It is correct but incomplete. It accurately describes the underparameterized regime. The new theory extends the picture to the overparameterized regime where modern deep learning lives.

Common Confusions

Watch Out

Implicit bias does not mean GD always finds the best solution

The minimum-norm interpolator is not always the best predictor. In some settings, explicit regularization (ridge regression with carefully tuned $\lambda$ ) outperforms the implicit minimum-norm bias of GD. Implicit bias is a description of what GD does, not a guarantee that what it does is optimal. The remarkable empirical observation is that for deep networks on natural data, the implicit bias often works surprisingly well.

Watch Out

Double descent is about model complexity, not just parameters

The x-axis in the double descent curve is "effective model complexity," not raw parameter count. You can observe double descent by varying width, depth, training epochs (epoch-wise double descent), or even the number of data points. The interpolation threshold is the critical regime where the model barely has enough capacity to fit the data. This is where overfitting peaks.

Watch Out

Overparameterization does not always help

Double descent shows that more parameters can help, not that they always do. With terrible architecture choices, poor initialization, or adversarial data, overparameterization can fail. The theory describes behavior under specific conditions (gradient descent, reasonable architecture, natural data distributions). It is not a blank check to make models arbitrarily large.

Watch Out

Classical bounds are not wrong. They are just not tight

VC dimension and Rademacher complexity bounds are mathematically correct. They provide valid upper bounds on generalization error. The problem is that these bounds are extremely loose for overparameterized models. They give vacuous results because they only depend on the hypothesis class and ignore the algorithm. Tighter bounds (PAC-Bayes, compression-based, algorithm- dependent) can sometimes give non-vacuous results, but even these are far from explaining the full picture.

Summary

Zhang et al. (2017): networks can memorize random labels, proving that classical complexity measures (VC, Rademacher) are vacuous for deep learning
The interpolation threshold ( $d \approx n$ ) is where overfitting is worst
Gradient descent has implicit bias: for linear models, it finds the minimum-norm interpolator (regression) or max-margin classifier (logistic)
Double descent: test error decreases, increases (classical peak), then decreases again past interpolation
Benign overfitting: interpolating noise is harmless when the noise component is spread across many small eigenvalue directions
Generalization in modern ML depends on the algorithm-architecture-data interaction, not just hypothesis class complexity

Exercises

ExerciseCore

Problem

Consider a linear regression problem with $X \in \mathbb{R}^{n \times d}$ where $d = 2n$ (twice as many features as samples). Assume $X$ has full row rank. Write down the minimum-norm interpolating solution $w^*$ and explain why $\|w^*\|$ tends to be smaller when $d$ is larger (all else being equal).

ExerciseAdvanced

Problem

Explain the double descent phenomenon in terms of the bias-variance decomposition. Specifically, for the minimum-norm interpolator in linear regression with random features:

(a) Why does variance diverge as $d \to n$ from below? (b) Why does variance decrease as $d$ increases past $n$ ? (c) What happens to bias in the overparameterized regime?

ExerciseAdvanced

Problem

Why does the random labels experiment of Zhang et al. invalidate uniform convergence bounds (VC, Rademacher) as an explanation for generalization of deep networks? Be precise about what the experiment proves and what it does not prove.

References

Canonical:

Zhang, Bengio, Hardt, Recht, Vinyals, "Understanding Deep Learning Requires Rethinking Generalization" (ICLR 2017, arXiv:1611.03530)
Belkin, Hsu, Ma, Mandal, "Reconciling modern machine learning practice and the bias-variance trade-off" (PNAS 2019)
Bartlett, Long, Lugosi, Tsigler, "Benign overfitting in linear regression" (PNAS 2020)
Hastie, Montanari, Rosset, Tibshirani, "Surprises in High-Dimensional Ridgeless Least Squares Interpolation" (Annals of Statistics 2022, arXiv:1903.08560). The reference treatment of double descent under Marchenko-Pastur asymptotics; provides the explicit variance formulas used in this page.

Current:

Nakkiran, Kaplun, Bansal, Yang, Barak, Sutskever, "Deep Double Descent" (ICLR 2020)
Soudry, Hoffer, Nacson, Gunasekar, Srebro, "The implicit bias of gradient descent on separable data" (JMLR 2018)
Power, Burda, Edwards, Babuschkin, Misra, "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets" (arXiv:2201.02177, 2022)
Nanda, Chan, Lieberum, Smith, Steinhardt, "Progress Measures for Grokking via Mechanistic Interpretability" (ICLR 2023, arXiv:2301.05217)
Chizat, Oyallon, Bach, "On Lazy Training in Differentiable Programming" (NeurIPS 2019, arXiv:1812.07956)
Zhang, Bengio, Hardt, Recht, Vinyals, "Understanding Deep Learning (Still) Requires Rethinking Generalization" (Communications of the ACM 2021)
Nagarajan, Kolter, "Uniform Convergence May Be Unable to Explain Generalization in Deep Learning" (NeurIPS 2019, arXiv:1902.04742)
Allen-Zhu, Li, "What Can ResNet Learn Efficiently, Going Beyond Kernels?" (NeurIPS 2019, arXiv:1905.10337). One of the cleanest separations showing feature learning provably outperforms any kernel method on a structured task, anchoring the "rich vs lazy regime" debate.
Cohen, Kaur, Li, Kolter, Talwalkar, "Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability" (ICLR 2021, arXiv:2103.00065). Empirical and theoretical evidence that practical SGD operates at sharpness $\approx 2/\eta$ , outside the regime of standard convergence analysis. A central data point for any future implicit-bias theory.
Foret, Kleiner, Mobahi, Neyshabur, "Sharpness-Aware Minimization for Efficiently Improving Generalization" (ICLR 2021, arXiv:2010.01412). Operationalizes the sharpness-generalization link and produces measurable improvements; relevant to flatness-based explanations of implicit bias.

Next Topics

The natural next steps from this topic:

Double descent: detailed analysis of the double descent curve, epoch-wise double descent, and the role of regularization
Benign overfitting: precise conditions under which interpolation is harmless, and the eigenvalue decay requirements
Neural tangent kernel: the infinite-width limit where neural networks become kernel machines, connecting deep learning to classical kernel theory

Last reviewed: April 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

14

Gradient Descent Variantslayer 1 · tier 1
Linear Regressionlayer 1 · tier 1
VC Dimensionlayer 2 · tier 1
Algorithmic Stabilitylayer 3 · tier 1
PAC-Bayes Boundslayer 3 · tier 1

Derived topics

6

Continuous-Time Gradient Flow (SLT View)layer 3 · tier 1
Neural Tangent Kernel: Lazy Training, Kernel Equivalence, μP, and the Limits of Widthlayer 4 · tier 1
Benign Overfittinglayer 4 · tier 2
Double Descentlayer 4 · tier 2
Grokkinglayer 4 · tier 2

+1 more on the derived-topics page.

Graph-backed continuations

Double Descent Benign Overfitting Neural Tangent Kernel: Lazy Training, Kernel Equivalence, μP, and the Limits of Width Grokking Open Problems in ML Theory