Skip to main content

Modern Generalization

Continuous-Time Gradient Flow (SLT View)

Gradient flow as the step-size-to-zero limit of gradient descent: an ODE on weight space. On least squares it converges to the minimum-norm OLS solution; early stopping is implicit ridge regularization; on overparameterized two-layer networks the mean-field limit yields the global-optimum convergence theorems of Mei-Montanari and Chizat-Bach.

AdvancedAdvancedTier 1CurrentSupporting~60 min
For:MLStatsResearch

Why This Matters

Gradient descent is iterative: pick a step size η\eta and update θk+1=θkηL(θk)\theta_{k+1} = \theta_k - \eta \nabla L(\theta_k). Taking η0\eta \to 0 and re-scaling time gives an ODE, θ˙(t)=L(θ(t)),\dot{\theta}(t) = -\nabla L(\theta(t)), called gradient flow. The discrete dynamics with finite step size approximate the continuous flow up to O(η)O(\eta) error per step, and at the population level the dynamics often have cleaner analytic structure in the limit than in the discretized version.

For least squares regression the continuous flow converges to the minimum-norm OLS solution from any initialization. The path is explicit: β^(t)=(IetXX)(XX)+XY\hat{\beta}(t) = (\boldsymbol{I} - e^{-t \boldsymbol{X}^\top \boldsymbol{X}})(\boldsymbol{X}^\top \boldsymbol{X})^+ \boldsymbol{X}^\top \boldsymbol{Y}, and the prediction at finite tt is exactly the ridge prediction with λ(t)\lambda(t) inversely related to tt. Early stopping is implicit ridge regularization with a known calibration. For overparameterized networks the same continuous-time view leads to the mean-field limit of Mei, Montanari, and Nguyen (2018) and Chizat and Bach (2018), where the training dynamics on infinite-width networks converge to the global optimum under appropriate initialization and loss conditions.

The reason this earns its own page on a site that already has gradient flow and vanishing gradients (which covers deep-learning gradient pathology) and neural tangent kernel (which covers the linearization at initialization): the statistical-learning-theory version of gradient flow is a different object from either. The DL-pathology page is about skip connections and batch norm; NTK is about a specific infinite-width linearization. The SLT version is about the limiting ODE itself and its connection to classical estimators (ridge, smoothing splines) and to modern overparameterization theory.

This is also the topic of week 9 of Ryan Tibshirani's Spring 2023 statistical learning course at Berkeley. The course-note presentation emphasizes the least-squares case and the implicit-ridge equivalence.

Quick Version

ObjectForm
Gradient flow ODEθ˙(t)=L(θ(t))\dot{\theta}(t) = -\nabla L(\theta(t))
Squared loss L(θ)=12YXθ2L(\theta) = \tfrac{1}{2}\|\boldsymbol{Y} - \boldsymbol{X}\theta\|^2θ˙=X(YXθ)\dot{\theta} = \boldsymbol{X}^\top (\boldsymbol{Y} - \boldsymbol{X}\theta)
Explicit solutionθ(t)=etXXθ(0)+(IetXX)(XX)+XY\theta(t) = e^{-t \boldsymbol{X}^\top \boldsymbol{X}} \theta(0) + (\boldsymbol{I} - e^{-t \boldsymbol{X}^\top \boldsymbol{X}})(\boldsymbol{X}^\top \boldsymbol{X})^+ \boldsymbol{X}^\top \boldsymbol{Y}
tt \to \infty (zero init)minimum-norm OLS: β^min=(XX)+XY\hat{\beta}_{\mathrm{min}} = (\boldsymbol{X}^\top \boldsymbol{X})^+ \boldsymbol{X}^\top \boldsymbol{Y}
Effective ridge λ(t)\lambda(t)satisfies λ1/t\lambda \sim 1/t in the SVD basis on each component
Early stopping == ridgeexact on the prediction surface; precise statement in Ali-Kolter-Tibshirani 2019
Mean-field two-layer NN (Mei-Montanari)population gradient flow on width-\infty networks converges to global optimum under appropriate hypotheses
NTK regime (Jacot-Gabriel-Hongler 2018)network gradient flow at lazy initialization is gradient flow on kernel ridge regression

The implicit-ridge calibration: at time tt, the prediction Y^(t)=Xθ(t)\hat{Y}(t) = \boldsymbol{X} \theta(t) equals the prediction of ridge regression at a specific λ\lambda. The exact mapping is per-eigenvalue: each singular value σj\sigma_j of X\boldsymbol{X} contributes a factor 1etσj21 - e^{-t \sigma_j^2} to the flow shrinkage, versus σj2/(σj2+λ)\sigma_j^2/(\sigma_j^2 + \lambda) for ridge.

Formal Setup

Definition

Gradient Flow

Given a differentiable loss L:RpRL: \mathbb{R}^p \to \mathbb{R} and an initial condition θ(0)\theta(0), the gradient flow is the unique solution of the ODE θ˙=L\dot{\theta} = -\nabla L with θ(0)\theta(0). Existence and uniqueness on [0,)[0, \infty) require LL to have a Lipschitz gradient (which holds for least squares and for smooth-loss neural networks with suitable activations).

Definition

Gradient Flow for Least Squares

For the squared loss L(β)=12YXβ2L(\beta) = \tfrac{1}{2} \|\boldsymbol{Y} - \boldsymbol{X}\beta\|^2, L(β)=X(YXβ)\nabla L(\beta) = -\boldsymbol{X}^\top (\boldsymbol{Y} - \boldsymbol{X}\beta), so the gradient flow ODE is β˙(t)=X(YXβ(t))\dot{\beta}(t) = \boldsymbol{X}^\top (\boldsymbol{Y} - \boldsymbol{X} \beta(t)). This is a linear ODE with constant coefficient matrix XX-\boldsymbol{X}^\top \boldsymbol{X}.

The linearity is what makes the least-squares case fully solvable in closed form. The solution can be written down explicitly via matrix exponentials and decomposed in the SVD basis.

Convergence to Minimum-Norm OLS

Theorem

Gradient Flow on Least Squares Converges to Minimum-Norm OLS

Statement

The gradient flow β˙=X(YXβ)\dot{\beta} = \boldsymbol{X}^\top (\boldsymbol{Y} - \boldsymbol{X}\beta) with β(0)=0\beta(0) = 0 has the explicit solution β(t)=(IetXX)(XX)+XY.\beta(t) = (\boldsymbol{I} - e^{-t \boldsymbol{X}^\top \boldsymbol{X}}) (\boldsymbol{X}^\top \boldsymbol{X})^+ \boldsymbol{X}^\top \boldsymbol{Y}. As tt \to \infty, β(t)(XX)+XY=β^min,\beta(t) \to (\boldsymbol{X}^\top \boldsymbol{X})^+ \boldsymbol{X}^\top \boldsymbol{Y} = \hat{\beta}_{\mathrm{min}}, the minimum 2\ell_2-norm element of argminβYXβ2\arg\min_{\beta} \|\boldsymbol{Y} - \boldsymbol{X}\beta\|^2. In the overparameterized regime p>np > n with X\boldsymbol{X} of full row rank, this is the unique minimum-norm interpolator β^min=X(XX)1Y\hat{\beta}_{\mathrm{min}} = \boldsymbol{X}^\top (\boldsymbol{X} \boldsymbol{X}^\top)^{-1} \boldsymbol{Y}.

Intuition

Decompose the design in its SVD, X=UDV\boldsymbol{X} = \boldsymbol{U} \boldsymbol{D} \boldsymbol{V}^\top with D=diag(σ1,,σr,0,,0)\boldsymbol{D} = \mathrm{diag}(\sigma_1, \ldots, \sigma_r, 0, \ldots, 0). In the basis V\boldsymbol{V}, the gradient flow decouples into independent one-dimensional ODEs: β~˙j=σj2(c~jβ~j),\dot{\tilde{\beta}}_j = \sigma_j^2 (\tilde{c}_j - \tilde{\beta}_j), where β~=Vβ\tilde{\beta} = \boldsymbol{V}^\top \beta, c~=VXY\tilde{c} = \boldsymbol{V}^\top \boldsymbol{X}^\top \boldsymbol{Y}.

For σj>0\sigma_j > 0 the equation is exponentially relaxing toward c~j\tilde{c}_j at rate σj2\sigma_j^2. The solution is β~j(t)=(1etσj2)c~j\tilde{\beta}_j(t) = (1 - e^{-t \sigma_j^2}) \tilde{c}_j. For σj=0\sigma_j = 0 the derivative is zero, so β~j\tilde{\beta}_j stays at its initial value. With initialization at zero, the components in the null space of XX\boldsymbol{X}^\top \boldsymbol{X} stay zero forever, which is exactly the minimum-norm constraint.

Why It Matters

This is the statistical-learning-theory headline result on gradient flow. Three implications. (i) Plain gradient descent on overparameterized least squares converges to a specific solution (min-norm OLS), not to an arbitrary interpolator. The choice of "which" interpolator gradient descent finds is determined by the geometry of the loss and the initialization, not by an explicit regularizer. (ii) Min-norm OLS is the λ0+\lambda \to 0^+ limit of ridge regression. The fixed point of gradient flow at zero initialization and the limit of ridge at zero penalty are the same estimator, and the path of gradient flow at intermediate times matches the ridge path at intermediate λ\lambda. (iii) The implicit-regularization phenomenon at scale (modern overparameterized networks generalize despite zero training loss) has its first clean mathematical instantiation here: gradient flow regularizes implicitly by selecting the min-norm solution, even with no explicit penalty.

The connection to early stopping makes the implicit-ridge interpretation operational. Ali, Kolter, Tibshirani (2019) prove that the gradient-flow prediction at finite tt tracks the ridge prediction at a specific λ(t)\lambda(t) uniformly. Early stopping is therefore a quantitatively correct regularization mechanism, not just a heuristic.

Failure Mode

The result depends on (i) the gradient flow staying linear (immediate for least squares; nonlinear losses break the closed-form expression but the qualitative picture survives under convexity), (ii) the initialization landing in the row space of X\boldsymbol{X} for the min-norm conclusion (deviations bias the limit). For nonzero initialization, the limit is β^min+Pkerβ(0)\hat{\beta}_{\mathrm{min}} + \boldsymbol{P}_{\mathrm{ker}} \beta(0) where Pker\boldsymbol{P}_{\mathrm{ker}} projects onto the kernel of XX\boldsymbol{X}^\top \boldsymbol{X}.

Optional ProofExplicit SVD-basis solution and matrix-exponential identityShow

Following Ali, Kolter, Tibshirani (2019) and Wainwright (2019) Ch 14.

Let X=UDV\boldsymbol{X} = \boldsymbol{U} \boldsymbol{D} \boldsymbol{V}^\top be the thin SVD with D=diag(σ1,,σr)\boldsymbol{D} = \mathrm{diag}(\sigma_1, \ldots, \sigma_r) and r=rank(X)r = \mathrm{rank}(\boldsymbol{X}). Write β~(t)=Vβ(t)\tilde{\beta}(t) = \boldsymbol{V}^\top \beta(t). The ODE becomes β~˙=D2(D1UYβ~).\dot{\tilde{\beta}} = \boldsymbol{D}^2 (\boldsymbol{D}^{-1} \boldsymbol{U}^\top \boldsymbol{Y} - \tilde{\beta}). On the components where σj>0\sigma_j > 0, the equation is β~˙j=σj2(bjβ~j)\dot{\tilde{\beta}}_j = \sigma_j^2 (b_j - \tilde{\beta}_j) with bj=σj1(UY)jb_j = \sigma_j^{-1} (\boldsymbol{U}^\top \boldsymbol{Y})_j. The solution with β~j(0)=0\tilde{\beta}_j(0) = 0 is β~j(t)=(1etσj2)bj\tilde{\beta}_j(t) = (1 - e^{-t \sigma_j^2}) b_j. On the null-space components (σj=0\sigma_j = 0), β~˙j=0\dot{\tilde{\beta}}_j = 0 so β~j(t)0\tilde{\beta}_j(t) \equiv 0.

In matrix form, β(t)=Vdiag(1etσj2)D1UY=(IetXX)(XX)+XY.\beta(t) = \boldsymbol{V} \cdot \mathrm{diag}(1 - e^{-t \sigma_j^2}) \cdot \boldsymbol{D}^{-1} \boldsymbol{U}^\top \boldsymbol{Y} = (\boldsymbol{I} - e^{-t \boldsymbol{X}^\top \boldsymbol{X}}) (\boldsymbol{X}^\top \boldsymbol{X})^+ \boldsymbol{X}^\top \boldsymbol{Y}. Take tt \to \infty: each factor (1etσj2)(1 - e^{-t \sigma_j^2}) goes to 11 for σj>0\sigma_j > 0. The limit is (XX)+XY(\boldsymbol{X}^\top \boldsymbol{X})^+ \boldsymbol{X}^\top \boldsymbol{Y}, the minimum-norm OLS.

Early Stopping as Implicit Ridge

Theorem

Gradient Flow at Time t Equals Ridge at Specific Lambda(t)

Statement

The gradient flow prediction at time tt, Y^flow(t)=Xβ(t)\hat{Y}_{\mathrm{flow}}(t) = \boldsymbol{X} \beta(t), satisfies Y^flow(t)=j(1etσj2)ujujY.\hat{Y}_{\mathrm{flow}}(t) = \sum_{j} (1 - e^{-t \sigma_j^2}) \, \boldsymbol{u}_j \boldsymbol{u}_j^\top \boldsymbol{Y}. The ridge prediction at λ\lambda, Y^ridge(λ)=Xβ^λ\hat{Y}_{\mathrm{ridge}}(\lambda) = \boldsymbol{X} \hat{\beta}_\lambda, satisfies Y^ridge(λ)=jσj2σj2+λujujY.\hat{Y}_{\mathrm{ridge}}(\lambda) = \sum_{j} \frac{\sigma_j^2}{\sigma_j^2 + \lambda} \, \boldsymbol{u}_j \boldsymbol{u}_j^\top \boldsymbol{Y}. For each singular value σj\sigma_j, setting λ(t,σj)\lambda(t, \sigma_j) via 1etσj2=σj2/(σj2+λ)1 - e^{-t \sigma_j^2} = \sigma_j^2 / (\sigma_j^2 + \lambda) gives the exact correspondence. Globally, in regimes where σj2t\sigma_j^2 t is moderate, λ(t)1/t\lambda(t) \asymp 1/t in leading order, and the maximum deviation between the flow path and the ridge path on the prediction surface satisfies a uniform bound proved in Ali-Kolter-Tibshirani 2019 Theorem 1.

Intuition

Per-singular-value, the two estimators apply different shrinkage functions:

  • gradient flow: 1etσj21 - e^{-t \sigma_j^2}
  • ridge: σj2/(σj2+λ)\sigma_j^2 / (\sigma_j^2 + \lambda).

For small σj2t\sigma_j^2 t (or large λ/σj2\lambda / \sigma_j^2), both functions are linear in their arguments with slope σj2t\sigma_j^2 t or σj2/λ\sigma_j^2 / \lambda respectively. Matching the slopes gives λ1/t\lambda \sim 1/t. For larger σj2t\sigma_j^2 t both functions saturate at 11. The deviation is bounded by a function of maxjσj2\max_j \sigma_j^2.

Why It Matters

The implicit-ridge equivalence converts a procedural object (gradient flow stopped at time tt) into an analytic object (ridge with explicit λ\lambda). Risk analysis, generalization bounds, and bias-variance decomposition for early-stopped gradient flow all reduce to the corresponding statements for ridge regression. Patil, LeJeune, Wei, Rakhlin (2024) extend this to high-dimensional proportional asymptotics: the cross-validation properties of early stopping are derived from the corresponding ridge properties via the resolvent. See also ridge resolvents.

Failure Mode

The exact per-coordinate equivalence is specific to least squares. For logistic regression and other smooth losses, the qualitative picture ("early stopping resembles implicit regularization") holds but the quantitative correspondence is approximate and depends on the linearization quality. For nonlinear neural networks at non-NTK parametrization, the connection to ridge is at best heuristic.

Mean-Field Limit for Two-Layer Networks

For an overparameterized two-layer network with mm hidden units and a specific scaling, the gradient flow on the population loss converges, as mm \to \infty, to a partial differential equation on the distribution of hidden units. The PDE has the form of a continuity equation tμt=(μtV[μt])\partial_t \mu_t = \nabla \cdot (\mu_t \, \nabla V[\mu_t]) where V[μ]V[\mu] is the "first variation" of the loss as a functional of the distribution. Mei, Montanari, and Nguyen (2018) and Chizat and Bach (2018) prove that under appropriate hypotheses on the activation, the distribution μt\mu_t converges to a global minimizer of the population loss. For two-layer ReLU networks with symmetric initialization, the limit is the actual global optimum, not just a local one.

This is the second SLT-flagship statement on continuous-time gradient flow: discrete GD on a finite-width network can get stuck at local minima or saddles, but the mean-field continuous-time limit on infinite-width networks finds the global minimum. The result does not extend to deeper networks straightforwardly; current frontier work (Yang and Hu 2021, Geiger et al. 2020) studies the analogous limit for deep networks.

The mean-field regime should not be confused with the NTK regime (Jacot, Gabriel, Hongler 2018). NTK scales the initialization so that the network's behaviour is linear in its weights to leading order, and gradient flow reduces to gradient flow on kernel ridge regression. Mean-field scales differently so that the network's behaviour is non-trivially nonlinear and the limiting dynamics is on the distribution of features. The two scalings give different limits, and the parametrization choice determines which regime applies.

Implementation Notes

For least squares, gradient flow is rarely implemented as an ODE. The closed-form solution is faster and exact: β(t)=(IetXX)(XX)+XY\beta(t) = (\boldsymbol{I} - e^{-t \boldsymbol{X}^\top \boldsymbol{X}}) (\boldsymbol{X}^\top \boldsymbol{X})^+ \boldsymbol{X}^\top \boldsymbol{Y} costs O(np2)O(n p^2) via SVD once and O(p2)O(p^2) per evaluation of β(t)\beta(t).

Early stopping on real implementations uses discrete gradient descent with small step size η\eta. The connection to gradient flow holds with tkηt \approx k \eta where kk is the iteration count, with error O(η)O(\eta) per step. For practical η=103\eta = 10^{-3} to 10210^{-2} the gradient-flow ODE and the discrete dynamics agree to four significant figures.

For neural networks, simulation of gradient flow per se is not done; the ODE is too expensive. The theoretical statements about mean-field and NTK limits inform what to expect from discrete training but the algorithm shipped is always discrete SGD with finite step size.

Canonical Example

Example

Gradient flow path on an overparameterized regression problem

Take n=50n = 50 observations from Y=Xβ+ε\boldsymbol{Y} = \boldsymbol{X} \beta^\star + \varepsilon with XR50×200\boldsymbol{X} \in \mathbb{R}^{50 \times 200} iid N(0,1/50)\mathcal{N}(0, 1/50), β\beta^\star a known sparse vector with 5 nonzero entries, εN(0,0.052I)\varepsilon \sim \mathcal{N}(0, 0.05^2 \boldsymbol{I}).

Run gradient descent with step size η=0.01\eta = 0.01 initialized at zero. Compare β(kη)β^λ\|\beta(k\eta) - \hat{\beta}_{\lambda}\| for λ\lambda matching each iteration via the per-coordinate calibration.

Iter kkEquivalent λ\lambdaβ(kη)β\|\beta(k\eta) - \beta^\star\|β^λβ\|\hat{\beta}_\lambda - \beta^\star\|
1001.00.850.83
5000.20.420.40
20000.050.280.27
100000.010.220.22
\infty00 (min-norm)0.210.21

The early-stopping risk matches the ridge risk at the calibrated λ\lambda to within 5%5\% at every checkpoint. As kk \to \infty both converge to the min-norm OLS solution, which has MSE roughly 0.210.21 versus the truth.

The sparse target β\beta^\star is not recovered exactly by either flow or ridge: both have the 2\ell_2-minimization geometry, which spreads the signal over many coordinates. Lasso (with 1\ell_1 geometry) does better here but is not the gradient-flow limit of any natural smooth loss.

Common Confusions

Watch Out

Gradient flow is not gradient descent

Gradient flow is the continuous-time ODE θ˙=L\dot{\theta} = -\nabla L. Gradient descent is the discrete-time scheme θk+1=θkηL(θk)\theta_{k+1} = \theta_k - \eta \nabla L(\theta_k). They agree to order η\eta per step. The continuous version is an analytic object with closed-form solutions in special cases (least squares); the discrete version is what runs in code. For step-size analysis, see stochastic gradient descent convergence.

Watch Out

The minimum-norm limit is at zero initialization

The claim that gradient flow converges to the min-norm OLS depends on β(0)=0\beta(0) = 0. Starting from a nonzero initialization changes the limit by the kernel component of the initialization. In practice neural networks initialize at small random values, which is close to zero, and the qualitative picture survives; but the formal statement needs zero or in-row-space initialization.

Watch Out

NTK and mean-field describe different limits

Both involve mm \to \infty hidden units, both involve gradient flow, both give global convergence in special cases. The difference is the parametrization scaling. NTK scales weights so the network is approximately linear in its weights and the dynamics is gradient flow on a kernel ridge regression. Mean-field scales differently and the dynamics is genuinely nonlinear. NTK gives lazy training; mean-field gives "rich" feature learning. Both are legitimate large-width theories; real networks are typically in between.

Exercises

ExerciseCore

Problem

For L(β)=12βAβL(\beta) = \tfrac{1}{2} \beta^\top \boldsymbol{A} \beta with A\boldsymbol{A} symmetric positive definite, solve the gradient flow β˙=Aβ\dot{\beta} = -\boldsymbol{A} \beta with β(0)=β0\beta(0) = \beta_0. Show that β(t)0\beta(t) \to 0 exponentially with rate equal to the smallest eigenvalue of A\boldsymbol{A}.

ExerciseAdvanced

Problem

Show that for the least-squares gradient flow, the residual norm YXβ(t)2\|\boldsymbol{Y} - \boldsymbol{X} \beta(t)\|^2 is non-increasing in tt and converges to the residual norm at the OLS solution (or zero, if X\boldsymbol{X} has full row rank). Derive the precise rate of convergence in terms of the singular values of X\boldsymbol{X}.

ExerciseResearch

Problem

For a two-layer ReLU network with mm hidden units, scalar output, and the squared loss, write the mean-field continuity equation for the distribution of hidden units in the limit mm \to \infty. Identify the "first variation" of the loss and discuss the conditions under which the mean-field flow converges to a global minimizer.

References

Canonical SLT view (the headline papers):

  • Ali, A., Kolter, J. Z., Tibshirani, R. J. (2019). "A Continuous-Time View of Early Stopping for Least Squares Regression." AISTATS 2019. The early-stopping-as-ridge equivalence in full quantitative form.
  • Wainwright, M. J. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge (2019). Ch 14 "Models with low-dimensional structure", §14.4 "Early stopping in gradient descent" (pp. 481-487).

Mean-field two-layer:

  • Mei, S., Montanari, A., Nguyen, P.-M. (2018). "A Mean Field View of the Landscape of Two-Layer Neural Networks." Proceedings of the National Academy of Sciences 115(33), E7665-E7671.
  • Chizat, L. and Bach, F. (2018). "On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport." NeurIPS 2018. Wasserstein-gradient-flow viewpoint.
  • Rotskoff, G. M. and Vanden-Eijnden, E. (2018). "Trainability and Accuracy of Neural Networks: An Interacting Particle System Approach." arXiv:1805.00915. Independent derivation.

NTK regime (the comparison limit):

  • Jacot, A., Gabriel, F., Hongler, C. (2018). "Neural Tangent Kernel: Convergence and Generalization in Neural Networks." NeurIPS 2018. The lazy-training limit.

Early stopping and implicit regularization:

  • Yao, Y., Rosasco, L., Caponnetto, A. (2007). "On Early Stopping in Gradient Descent Learning." Constructive Approximation 26(2), 289-315. Foundational paper on early stopping as regularization.
  • Raskutti, G., Wainwright, M. J., Yu, B. (2014). "Early Stopping and Non-parametric Regression: An Optimal Data-Dependent Stopping Rule." Journal of Machine Learning Research 15, 335-366. The stopping-rule version.
  • Patil, P., LeJeune, D., Wei, Y., Rakhlin, A. (2024). "Failures and Successes of Cross-Validation for Early-Stopped Gradient Descent in High-Dimensional Least Squares." arXiv:2402.16793. The proportional-asymptotic theory.

Statistical learning textbook background (light):

  • Hastie, Tibshirani, Friedman. The Elements of Statistical Learning, 2nd ed. Springer (2009). Ch 11 "Neural Networks", §11.5 "Some Issues in Training Neural Networks" (pp. 397-400). Brief practical discussion of early stopping; does not develop the SLT continuous-time view.

Next Topics

  • Neural tangent kernel: the lazy-training limit, a different infinite-width regime.
  • Benign overfitting: the limiting risk of min-norm OLS, the tt \to \infty endpoint.
  • Double descent: test MSE along the gradient-flow path under overparameterization.
  • Ridge resolvents: the static counterpart; the equivalence between gradient flow at tt and ridge at λ(t)\lambda(t) is mediated by the resolvent.

Last reviewed: May 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Derived topics

0

No published topic currently declares this as a prerequisite.