Modern Generalization
Continuous-Time Gradient Flow (SLT View)
Gradient flow as the step-size-to-zero limit of gradient descent: an ODE on weight space. On least squares it converges to the minimum-norm OLS solution; early stopping is implicit ridge regularization; on overparameterized two-layer networks the mean-field limit yields the global-optimum convergence theorems of Mei-Montanari and Chizat-Bach.
Prerequisites
Why This Matters
Gradient descent is iterative: pick a step size and update . Taking and re-scaling time gives an ODE, called gradient flow. The discrete dynamics with finite step size approximate the continuous flow up to error per step, and at the population level the dynamics often have cleaner analytic structure in the limit than in the discretized version.
For least squares regression the continuous flow converges to the minimum-norm OLS solution from any initialization. The path is explicit: , and the prediction at finite is exactly the ridge prediction with inversely related to . Early stopping is implicit ridge regularization with a known calibration. For overparameterized networks the same continuous-time view leads to the mean-field limit of Mei, Montanari, and Nguyen (2018) and Chizat and Bach (2018), where the training dynamics on infinite-width networks converge to the global optimum under appropriate initialization and loss conditions.
The reason this earns its own page on a site that already has gradient flow and vanishing gradients (which covers deep-learning gradient pathology) and neural tangent kernel (which covers the linearization at initialization): the statistical-learning-theory version of gradient flow is a different object from either. The DL-pathology page is about skip connections and batch norm; NTK is about a specific infinite-width linearization. The SLT version is about the limiting ODE itself and its connection to classical estimators (ridge, smoothing splines) and to modern overparameterization theory.
This is also the topic of week 9 of Ryan Tibshirani's Spring 2023 statistical learning course at Berkeley. The course-note presentation emphasizes the least-squares case and the implicit-ridge equivalence.
Quick Version
| Object | Form |
|---|---|
| Gradient flow ODE | |
| Squared loss | |
| Explicit solution | |
| (zero init) | minimum-norm OLS: |
| Effective ridge | satisfies in the SVD basis on each component |
| Early stopping ridge | exact on the prediction surface; precise statement in Ali-Kolter-Tibshirani 2019 |
| Mean-field two-layer NN (Mei-Montanari) | population gradient flow on width- networks converges to global optimum under appropriate hypotheses |
| NTK regime (Jacot-Gabriel-Hongler 2018) | network gradient flow at lazy initialization is gradient flow on kernel ridge regression |
The implicit-ridge calibration: at time , the prediction equals the prediction of ridge regression at a specific . The exact mapping is per-eigenvalue: each singular value of contributes a factor to the flow shrinkage, versus for ridge.
Formal Setup
Gradient Flow
Given a differentiable loss and an initial condition , the gradient flow is the unique solution of the ODE with . Existence and uniqueness on require to have a Lipschitz gradient (which holds for least squares and for smooth-loss neural networks with suitable activations).
Gradient Flow for Least Squares
For the squared loss , , so the gradient flow ODE is . This is a linear ODE with constant coefficient matrix .
The linearity is what makes the least-squares case fully solvable in closed form. The solution can be written down explicitly via matrix exponentials and decomposed in the SVD basis.
Convergence to Minimum-Norm OLS
Gradient Flow on Least Squares Converges to Minimum-Norm OLS
Statement
The gradient flow with has the explicit solution As , the minimum -norm element of . In the overparameterized regime with of full row rank, this is the unique minimum-norm interpolator .
Intuition
Decompose the design in its SVD, with . In the basis , the gradient flow decouples into independent one-dimensional ODEs: where , .
For the equation is exponentially relaxing toward at rate . The solution is . For the derivative is zero, so stays at its initial value. With initialization at zero, the components in the null space of stay zero forever, which is exactly the minimum-norm constraint.
Why It Matters
This is the statistical-learning-theory headline result on gradient flow. Three implications. (i) Plain gradient descent on overparameterized least squares converges to a specific solution (min-norm OLS), not to an arbitrary interpolator. The choice of "which" interpolator gradient descent finds is determined by the geometry of the loss and the initialization, not by an explicit regularizer. (ii) Min-norm OLS is the limit of ridge regression. The fixed point of gradient flow at zero initialization and the limit of ridge at zero penalty are the same estimator, and the path of gradient flow at intermediate times matches the ridge path at intermediate . (iii) The implicit-regularization phenomenon at scale (modern overparameterized networks generalize despite zero training loss) has its first clean mathematical instantiation here: gradient flow regularizes implicitly by selecting the min-norm solution, even with no explicit penalty.
The connection to early stopping makes the implicit-ridge interpretation operational. Ali, Kolter, Tibshirani (2019) prove that the gradient-flow prediction at finite tracks the ridge prediction at a specific uniformly. Early stopping is therefore a quantitatively correct regularization mechanism, not just a heuristic.
Failure Mode
The result depends on (i) the gradient flow staying linear (immediate for least squares; nonlinear losses break the closed-form expression but the qualitative picture survives under convexity), (ii) the initialization landing in the row space of for the min-norm conclusion (deviations bias the limit). For nonzero initialization, the limit is where projects onto the kernel of .
Optional ProofExplicit SVD-basis solution and matrix-exponential identityShow
Following Ali, Kolter, Tibshirani (2019) and Wainwright (2019) Ch 14.
Let be the thin SVD with and . Write . The ODE becomes On the components where , the equation is with . The solution with is . On the null-space components (), so .
In matrix form, Take : each factor goes to for . The limit is , the minimum-norm OLS.
Early Stopping as Implicit Ridge
Gradient Flow at Time t Equals Ridge at Specific Lambda(t)
Statement
The gradient flow prediction at time , , satisfies The ridge prediction at , , satisfies For each singular value , setting via gives the exact correspondence. Globally, in regimes where is moderate, in leading order, and the maximum deviation between the flow path and the ridge path on the prediction surface satisfies a uniform bound proved in Ali-Kolter-Tibshirani 2019 Theorem 1.
Intuition
Per-singular-value, the two estimators apply different shrinkage functions:
- gradient flow:
- ridge: .
For small (or large ), both functions are linear in their arguments with slope or respectively. Matching the slopes gives . For larger both functions saturate at . The deviation is bounded by a function of .
Why It Matters
The implicit-ridge equivalence converts a procedural object (gradient flow stopped at time ) into an analytic object (ridge with explicit ). Risk analysis, generalization bounds, and bias-variance decomposition for early-stopped gradient flow all reduce to the corresponding statements for ridge regression. Patil, LeJeune, Wei, Rakhlin (2024) extend this to high-dimensional proportional asymptotics: the cross-validation properties of early stopping are derived from the corresponding ridge properties via the resolvent. See also ridge resolvents.
Failure Mode
The exact per-coordinate equivalence is specific to least squares. For logistic regression and other smooth losses, the qualitative picture ("early stopping resembles implicit regularization") holds but the quantitative correspondence is approximate and depends on the linearization quality. For nonlinear neural networks at non-NTK parametrization, the connection to ridge is at best heuristic.
Mean-Field Limit for Two-Layer Networks
For an overparameterized two-layer network with hidden units and a specific scaling, the gradient flow on the population loss converges, as , to a partial differential equation on the distribution of hidden units. The PDE has the form of a continuity equation where is the "first variation" of the loss as a functional of the distribution. Mei, Montanari, and Nguyen (2018) and Chizat and Bach (2018) prove that under appropriate hypotheses on the activation, the distribution converges to a global minimizer of the population loss. For two-layer ReLU networks with symmetric initialization, the limit is the actual global optimum, not just a local one.
This is the second SLT-flagship statement on continuous-time gradient flow: discrete GD on a finite-width network can get stuck at local minima or saddles, but the mean-field continuous-time limit on infinite-width networks finds the global minimum. The result does not extend to deeper networks straightforwardly; current frontier work (Yang and Hu 2021, Geiger et al. 2020) studies the analogous limit for deep networks.
The mean-field regime should not be confused with the NTK regime (Jacot, Gabriel, Hongler 2018). NTK scales the initialization so that the network's behaviour is linear in its weights to leading order, and gradient flow reduces to gradient flow on kernel ridge regression. Mean-field scales differently so that the network's behaviour is non-trivially nonlinear and the limiting dynamics is on the distribution of features. The two scalings give different limits, and the parametrization choice determines which regime applies.
Implementation Notes
For least squares, gradient flow is rarely implemented as an ODE. The closed-form solution is faster and exact: costs via SVD once and per evaluation of .
Early stopping on real implementations uses discrete gradient descent with small step size . The connection to gradient flow holds with where is the iteration count, with error per step. For practical to the gradient-flow ODE and the discrete dynamics agree to four significant figures.
For neural networks, simulation of gradient flow per se is not done; the ODE is too expensive. The theoretical statements about mean-field and NTK limits inform what to expect from discrete training but the algorithm shipped is always discrete SGD with finite step size.
Canonical Example
Gradient flow path on an overparameterized regression problem
Take observations from with iid , a known sparse vector with 5 nonzero entries, .
Run gradient descent with step size initialized at zero. Compare for matching each iteration via the per-coordinate calibration.
| Iter | Equivalent | ||
|---|---|---|---|
| 100 | 1.0 | 0.85 | 0.83 |
| 500 | 0.2 | 0.42 | 0.40 |
| 2000 | 0.05 | 0.28 | 0.27 |
| 10000 | 0.01 | 0.22 | 0.22 |
| (min-norm) | 0.21 | 0.21 |
The early-stopping risk matches the ridge risk at the calibrated to within at every checkpoint. As both converge to the min-norm OLS solution, which has MSE roughly versus the truth.
The sparse target is not recovered exactly by either flow or ridge: both have the -minimization geometry, which spreads the signal over many coordinates. Lasso (with geometry) does better here but is not the gradient-flow limit of any natural smooth loss.
Common Confusions
Gradient flow is not gradient descent
Gradient flow is the continuous-time ODE . Gradient descent is the discrete-time scheme . They agree to order per step. The continuous version is an analytic object with closed-form solutions in special cases (least squares); the discrete version is what runs in code. For step-size analysis, see stochastic gradient descent convergence.
The minimum-norm limit is at zero initialization
The claim that gradient flow converges to the min-norm OLS depends on . Starting from a nonzero initialization changes the limit by the kernel component of the initialization. In practice neural networks initialize at small random values, which is close to zero, and the qualitative picture survives; but the formal statement needs zero or in-row-space initialization.
NTK and mean-field describe different limits
Both involve hidden units, both involve gradient flow, both give global convergence in special cases. The difference is the parametrization scaling. NTK scales weights so the network is approximately linear in its weights and the dynamics is gradient flow on a kernel ridge regression. Mean-field scales differently and the dynamics is genuinely nonlinear. NTK gives lazy training; mean-field gives "rich" feature learning. Both are legitimate large-width theories; real networks are typically in between.
Exercises
Problem
For with symmetric positive definite, solve the gradient flow with . Show that exponentially with rate equal to the smallest eigenvalue of .
Problem
Show that for the least-squares gradient flow, the residual norm is non-increasing in and converges to the residual norm at the OLS solution (or zero, if has full row rank). Derive the precise rate of convergence in terms of the singular values of .
Problem
For a two-layer ReLU network with hidden units, scalar output, and the squared loss, write the mean-field continuity equation for the distribution of hidden units in the limit . Identify the "first variation" of the loss and discuss the conditions under which the mean-field flow converges to a global minimizer.
References
Canonical SLT view (the headline papers):
- Ali, A., Kolter, J. Z., Tibshirani, R. J. (2019). "A Continuous-Time View of Early Stopping for Least Squares Regression." AISTATS 2019. The early-stopping-as-ridge equivalence in full quantitative form.
- Wainwright, M. J. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge (2019). Ch 14 "Models with low-dimensional structure", §14.4 "Early stopping in gradient descent" (pp. 481-487).
Mean-field two-layer:
- Mei, S., Montanari, A., Nguyen, P.-M. (2018). "A Mean Field View of the Landscape of Two-Layer Neural Networks." Proceedings of the National Academy of Sciences 115(33), E7665-E7671.
- Chizat, L. and Bach, F. (2018). "On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport." NeurIPS 2018. Wasserstein-gradient-flow viewpoint.
- Rotskoff, G. M. and Vanden-Eijnden, E. (2018). "Trainability and Accuracy of Neural Networks: An Interacting Particle System Approach." arXiv:1805.00915. Independent derivation.
NTK regime (the comparison limit):
- Jacot, A., Gabriel, F., Hongler, C. (2018). "Neural Tangent Kernel: Convergence and Generalization in Neural Networks." NeurIPS 2018. The lazy-training limit.
Early stopping and implicit regularization:
- Yao, Y., Rosasco, L., Caponnetto, A. (2007). "On Early Stopping in Gradient Descent Learning." Constructive Approximation 26(2), 289-315. Foundational paper on early stopping as regularization.
- Raskutti, G., Wainwright, M. J., Yu, B. (2014). "Early Stopping and Non-parametric Regression: An Optimal Data-Dependent Stopping Rule." Journal of Machine Learning Research 15, 335-366. The stopping-rule version.
- Patil, P., LeJeune, D., Wei, Y., Rakhlin, A. (2024). "Failures and Successes of Cross-Validation for Early-Stopped Gradient Descent in High-Dimensional Least Squares." arXiv:2402.16793. The proportional-asymptotic theory.
Statistical learning textbook background (light):
- Hastie, Tibshirani, Friedman. The Elements of Statistical Learning, 2nd ed. Springer (2009). Ch 11 "Neural Networks", §11.5 "Some Issues in Training Neural Networks" (pp. 397-400). Brief practical discussion of early stopping; does not develop the SLT continuous-time view.
Next Topics
- Neural tangent kernel: the lazy-training limit, a different infinite-width regime.
- Benign overfitting: the limiting risk of min-norm OLS, the endpoint.
- Double descent: test MSE along the gradient-flow path under overparameterization.
- Ridge resolvents: the static counterpart; the equivalence between gradient flow at and ridge at is mediated by the resolvent.
Last reviewed: May 13, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
5- Matrix Operations and Propertieslayer 0A · tier 1
- Ridge Regressionlayer 1 · tier 1
- Stochastic Gradient Descent Convergencelayer 2 · tier 1
- Implicit Bias and Modern Generalizationlayer 4 · tier 1
- Neural Tangent Kernel: Lazy Training, Kernel Equivalence, μP, and the Limits of Widthlayer 4 · tier 1
Derived topics
0No published topic currently declares this as a prerequisite.