Modern Generalization
Neural Tangent Kernel
In the infinite-width limit, neural networks trained with gradient descent behave like kernel regression with a specific kernel: the Neural Tangent Kernel: connecting deep learning to classical kernel theory.
Why This Matters
The Neural Tangent Kernel is one of the most cleanly stated theoretical results in modern deep learning. It shows that in a specific limit (infinite width with NTK parameterization), training a neural network with gradient descent is mathematically equivalent to kernel regression with a particular kernel. This gave the field a rigorous connection between deep networks and classical kernel theory, with explicit convergence rates and generalization bounds.
The NTK result is also, in large part, a negative result about its own regime. The lazy-training limit it describes is precisely the regime where networks do not learn features: the representation stays at initialization. Real finite-width networks operate outside the NTK regime when feature learning matters, which is why mean-field and maximal-update (μP) parameterizations are needed to describe practical training. NTK is best understood as a clean boundary case, not an explanation of why finite networks generalize.
Mental Model
Consider a neural network with parameters . At initialization , linearize the network around :
This is just a first-order Taylor expansion. The key insight of NTK theory: when the network is sufficiently wide, the parameters barely move during training (relative to their scale), so this linearization is accurate throughout training. The linearized model is a kernel method with kernel determined by .
The Neural Tangent Kernel
Neural Tangent Kernel
For a neural network with parameters , the Neural Tangent Kernel is:
This is the inner product of the gradients of the network output with respect to all parameters, evaluated at inputs and .
At initialization , this defines . The kernel depends on the architecture (depth, width, activation function) and the initialization distribution.
Lazy Training Regime
A neural network is in the lazy training regime if the parameters stay close to their initialization throughout training:
In this regime, the NTK remains approximately constant, and the network dynamics are well-approximated by the linearized model. The name "lazy" reflects that the features (the gradients ) do not change. only the linear combination of features is learned.
Core Theoretical Results
NTK Convergence at Infinite Width
Statement
Consider a fully connected network with hidden layers of width , with NTK parameterization. As :
- The random kernel converges in probability to a deterministic kernel
- The limiting kernel depends only on the architecture (depth, activation function) and is independent of the random initialization
The limiting kernel can be computed recursively layer by layer.
Intuition
At infinite width, the law of large numbers kicks in: each layer computes a sum of many independent random terms, which concentrates around its expectation. The randomness of initialization washes out, leaving a deterministic kernel that depends only on the architecture.
Proof Sketch
Proceed by induction on depth. At each layer, the pre-activations are sums of independent terms (one per neuron in the previous layer). By the CLT, these converge to a Gaussian process as . The kernel of this GP at layer is determined recursively by the kernel at layer and the activation function, via the formula: where the covariance of is determined by .
Why It Matters
This shows that infinitely wide neural networks at initialization are Gaussian processes with a specific kernel. Combined with the constancy result below, this means training such networks is equivalent to kernel regression. a fully solved problem.
Failure Mode
The finite-width NTK deviates from the infinite-width limit at rate (Arora et al. 2019, arXiv:1904.11955, Cor. 6.2; Lee et al. 2019, Thm. 2.2). Quantitatively, at the typical relative error is 2 to 5 percent (Arora et al. 2019, Table 1); at it exceeds 10 percent. For practically sized networks, the infinite-width approximation has non-negligible error.
NTK Stays Constant During Training
Statement
For a network with width trained by gradient flow on the squared loss, as :
- The NTK stays equal to throughout training:
- The training loss converges to zero exponentially: where is the smallest eigenvalue of on the training data
- The trained network is equivalent to kernel regression with kernel
Intuition
When the width is enormous, each parameter contributes a tiny amount to the output. Training changes each parameter by a tiny amount. The gradient features barely change, so the kernel stays constant. The dynamics become linear in the function space, and the solution is exactly kernel regression.
Proof Sketch
Under NTK parameterization, each individual weight is at initialization. During gradient flow, each parameter moves by per coordinate, giving aggregate Frobenius-norm displacement that is in total but relative to , so the relative movement vanishes as (Chizat, Oyallon, Bach 2019).
The change in the NTK is second-order in the parameter displacement: where is the Hessian of in parameter space. Under NTK parameterization, and , so the product is (Lee et al. 2019, "Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent", arXiv:1902.06720).
With a constant kernel, the training dynamics in function space are linear: , which converges exponentially at rate .
Why It Matters
This is the central result of NTK theory. It says: infinitely wide networks trained with gradient descent are exactly kernel methods. This immediately imports decades of kernel theory: RKHS norm bounds, generalization guarantees, spectral analysis of convergence rates.
Failure Mode
The constancy of the NTK is precisely the lazy training limitation. If the kernel does not change, the network does not learn new features. It only learns a linear combination of the initial random features. This is why NTK theory cannot explain the success of feature learning in practice.
The Lazy Regime vs. The Rich Regime
This is the critical conceptual distinction:
| Property | Lazy Regime (NTK) | Rich Regime (Feature Learning) |
|---|---|---|
| Parameters move | Very little | Substantially |
| Features | Fixed at initialization | Learned during training |
| Equivalent to | Kernel regression | Multiple infinite-width limits (mean-field, muP). Less clean than NTK |
| Width | Very large | Practical |
| Theory | Well-understood | Active research |
| Performance | Often suboptimal | Often state-of-the-art |
Real neural networks that achieve state-of-the-art performance are typically in the rich regime: they learn hierarchical features that are very different from their random initialization. NTK theory describes a regime where this feature learning is suppressed.
The NTK for Specific Architectures
NTK for a two-layer ReLU network
For a two-layer network with ReLU activation , the infinite-width NTK admits a closed form via the Cho-Saul arccosine kernels. Let and . Define
The first-layer NNGP kernel is , and the NTK is
Sanity check: for unit vectors with , , , giving and , so , matching Jacot-Gabriel-Hongler 2018 Prop. 2 and Arora et al. 2019 Thm. 3.1. This kernel is universal (dense in continuous functions) and positive definite on distinct points.
Common Confusions
NTK does not describe practical neural networks
The NTK regime requires width to be extremely large. often unrealistically so. Practical networks (GPT, ResNets, etc.) are not in the lazy regime. They learn features. NTK is a theoretical tool for understanding one extreme of neural network behavior, not a description of how practical networks work.
Constant NTK means no feature learning
A common point of confusion: "NTK theory proves neural networks are kernel methods." More precisely, NTK theory proves that infinitely wide networks in the lazy regime are kernel methods. The interesting behavior of practical networks. feature learning, representation learning, transfer. happens precisely when the NTK changes during training.
NTK parameterization vs standard parameterization
The NTK result requires a specific parameterization (scaling by ) that differs from the standard (mean field) parameterization. Different parameterizations lead to qualitatively different infinite-width limits. The mean field parameterization leads to feature learning even at infinite width. PyTorch default initialization is not NTK parameterization. Practitioners applying NTK results to standard-init models are making an unstated approximation.
Finite width matters quantitatively
At , the finite-width NTK typically differs from the infinite-width limit by 2 to 5 percent. At , the error exceeds 10 percent. Practical transformer MLP widths run 1024 to 8192, so the infinite-width approximation carries non-negligible error. "Infinite-width predictions" for real models are first-order approximations, not exact characterizations. (Arora et al. 2019, Table 1; Lee et al. 2019.)
Summary
- The NTK is
- At infinite width, the NTK converges to a deterministic kernel and stays constant during training
- Infinite-width networks trained with GD are equivalent to kernel regression with the NTK
- This is the lazy regime: features are fixed, only the linear readout is learned
- Real networks learn features (rich regime), which NTK theory does not capture
- NTK was a major theoretical advance but is incomplete as a theory of deep learning
Exercises
Problem
Warm-up: For a linear model with , compute the NTK . Then, for a two-layer linear network with and , compute the NTK at a fixed . Show that depth changes the NTK even for linear networks, and explain what this implies about the role of depth in NTK theory.
Problem
Suppose you have a two-layer network of width and the NTK on the training points has minimum eigenvalue . The training loss at time under gradient flow satisfies . If and , how many time units until the loss reaches ?
Problem
Explain why the NTK framework cannot account for the empirical observation that deeper networks learn increasingly abstract features at higher layers. What property of the NTK regime prevents this?
Related Comparisons
References
Canonical (NTK):
- Jacot, Gabriel, Hongler, "Neural Tangent Kernel: Convergence and Generalization in Neural Networks" (NeurIPS 2018), Prop. 2, Cor. 2
- Lee et al., "Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent" (NeurIPS 2019, arXiv:1902.06720), Thm. 2.2
- Arora et al., "On Exact Computation with an Infinitely Wide Neural Net" (NeurIPS 2019, arXiv:1904.11955), Thm. 3.1, Cor. 6.2
- Du et al., "Gradient Descent Finds Global Minima of Deep Neural Networks" (ICML 2019)
- Allen-Zhu, Li, Song, "A Convergence Theory for Deep Learning via Over-Parameterization" (ICML 2019)
Mean-field primary sources:
- Mei, Montanari, Nguyen, "A Mean Field View of the Landscape of Two-Layer Neural Networks" (PNAS 2018, arXiv:1804.06561)
- Rotskoff, Vanden-Eijnden, "Parameters as Interacting Particles: Long Time Convergence and Asymptotic Error Scaling of Neural Networks" (arXiv:1805.00915)
- Chizat, Bach, "On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport" (NeurIPS 2018, arXiv:1805.09545)
Lazy vs. rich and muP:
- Chizat, Oyallon, Bach, "On Lazy Training in Differentiable Programming" (NeurIPS 2019)
- Yang, Hu, "Tensor Programs IV: Feature Learning in Infinite-Width Neural Networks" (ICML 2021)
Next Topics
- Kernels and RKHS: the classical kernel theory that NTK connects to
- Implicit bias: what inductive bias does gradient descent impose?
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Kernels and Reproducing Kernel Hilbert SpacesLayer 3
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Rademacher ComplexityLayer 3
- Empirical Risk MinimizationLayer 2
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- VC DimensionLayer 2
- Implicit Bias and Modern GeneralizationLayer 4
- Gradient Descent VariantsLayer 1
- Linear RegressionLayer 1
- Maximum Likelihood EstimationLayer 0B
Builds on This
- Lazy vs Feature LearningLayer 4
- Mean Field TheoryLayer 4