Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Modern Generalization

Neural Tangent Kernel

In the infinite-width limit, neural networks trained with gradient descent behave like kernel regression with a specific kernel: the Neural Tangent Kernel: connecting deep learning to classical kernel theory.

AdvancedTier 2Current~70 min

Why This Matters

The Neural Tangent Kernel is one of the most cleanly stated theoretical results in modern deep learning. It shows that in a specific limit (infinite width with NTK parameterization), training a neural network with gradient descent is mathematically equivalent to kernel regression with a particular kernel. This gave the field a rigorous connection between deep networks and classical kernel theory, with explicit convergence rates and generalization bounds.

The NTK result is also, in large part, a negative result about its own regime. The lazy-training limit it describes is precisely the regime where networks do not learn features: the representation stays at initialization. Real finite-width networks operate outside the NTK regime when feature learning matters, which is why mean-field and maximal-update (μP) parameterizations are needed to describe practical training. NTK is best understood as a clean boundary case, not an explanation of why finite networks generalize.

Mental Model

Consider a neural network f(x;θ)f(x; \theta) with parameters θ\theta. At initialization θ0\theta_0, linearize the network around θ0\theta_0:

f(x;θ)f(x;θ0)+θf(x;θ0)(θθ0)f(x; \theta) \approx f(x; \theta_0) + \nabla_\theta f(x; \theta_0)^\top (\theta - \theta_0)

This is just a first-order Taylor expansion. The key insight of NTK theory: when the network is sufficiently wide, the parameters θ\theta barely move during training (relative to their scale), so this linearization is accurate throughout training. The linearized model is a kernel method with kernel determined by θf\nabla_\theta f.

The Neural Tangent Kernel

Definition

Neural Tangent Kernel

For a neural network f(x;θ)f(x; \theta) with parameters θ\theta, the Neural Tangent Kernel is:

Θ(x,x)=θf(x;θ),  θf(x;θ)=kf(x;θ)θkf(x;θ)θk\Theta(x, x') = \left\langle \nabla_\theta f(x; \theta), \; \nabla_\theta f(x'; \theta) \right\rangle = \sum_{k} \frac{\partial f(x; \theta)}{\partial \theta_k} \frac{\partial f(x'; \theta)}{\partial \theta_k}

This is the inner product of the gradients of the network output with respect to all parameters, evaluated at inputs xx and xx'.

At initialization θ0\theta_0, this defines Θ0(x,x)\Theta_0(x, x'). The kernel depends on the architecture (depth, width, activation function) and the initialization distribution.

Definition

Lazy Training Regime

A neural network is in the lazy training regime if the parameters stay close to their initialization throughout training:

θtθ0=o(θ0)\|\theta_t - \theta_0\| = o(\|\theta_0\|)

In this regime, the NTK Θt(x,x)Θ0(x,x)\Theta_t(x, x') \approx \Theta_0(x, x') remains approximately constant, and the network dynamics are well-approximated by the linearized model. The name "lazy" reflects that the features (the gradients θf\nabla_\theta f) do not change. only the linear combination of features is learned.

Core Theoretical Results

Theorem

NTK Convergence at Infinite Width

Statement

Consider a fully connected network with LL hidden layers of width mm, with NTK parameterization. As mm \to \infty:

  1. The random kernel Θ0(x,x)\Theta_0(x, x') converges in probability to a deterministic kernel Θ(x,x)\Theta^*(x, x')
  2. The limiting kernel Θ\Theta^* depends only on the architecture (depth, activation function) and is independent of the random initialization

The limiting kernel can be computed recursively layer by layer.

Intuition

At infinite width, the law of large numbers kicks in: each layer computes a sum of many independent random terms, which concentrates around its expectation. The randomness of initialization washes out, leaving a deterministic kernel that depends only on the architecture.

Proof Sketch

Proceed by induction on depth. At each layer, the pre-activations are sums of mm independent terms (one per neuron in the previous layer). By the CLT, these converge to a Gaussian process as mm \to \infty. The kernel of this GP at layer ll is determined recursively by the kernel at layer l1l-1 and the activation function, via the formula: K(l)(x,x)=E(u,v)N[σ(u)σ(v)]K^{(l)}(x,x') = \mathbb{E}_{(u,v) \sim \mathcal{N}}[\sigma(u)\sigma(v)] where the covariance of (u,v)(u,v) is determined by K(l1)K^{(l-1)}.

Why It Matters

This shows that infinitely wide neural networks at initialization are Gaussian processes with a specific kernel. Combined with the constancy result below, this means training such networks is equivalent to kernel regression. a fully solved problem.

Failure Mode

The finite-width NTK deviates from the infinite-width limit at rate O(1/m)O(1/\sqrt{m}) (Arora et al. 2019, arXiv:1904.11955, Cor. 6.2; Lee et al. 2019, Thm. 2.2). Quantitatively, at m=1024m = 1024 the typical relative error is 2 to 5 percent (Arora et al. 2019, Table 1); at m=128m = 128 it exceeds 10 percent. For practically sized networks, the infinite-width approximation has non-negligible error.

Theorem

NTK Stays Constant During Training

Statement

For a network with width mm trained by gradient flow on the squared loss, as mm \to \infty:

  1. The NTK Θt\Theta_t stays equal to Θ0\Theta_0 throughout training: suptΘtΘ00\sup_t \|\Theta_t - \Theta_0\| \to 0
  2. The training loss converges to zero exponentially: L(t)L(0)e2λmintL(t) \leq L(0) \cdot e^{-2\lambda_{\min} t} where λmin\lambda_{\min} is the smallest eigenvalue of Θ\Theta^* on the training data
  3. The trained network is equivalent to kernel regression with kernel Θ\Theta^*

Intuition

When the width is enormous, each parameter contributes a tiny amount to the output. Training changes each parameter by a tiny amount. The gradient features θf(x;θ)\nabla_\theta f(x; \theta) barely change, so the kernel stays constant. The dynamics become linear in the function space, and the solution is exactly kernel regression.

Proof Sketch

Under NTK parameterization, each individual weight is O(1/m)O(1/\sqrt{m}) at initialization. During gradient flow, each parameter moves by O(1/m)O(1/m) per coordinate, giving aggregate Frobenius-norm displacement θtθ0\|\theta_t - \theta_0\| that is O(1)O(1) in total but O(1/m)O(1/\sqrt{m}) relative to θ0\|\theta_0\|, so the relative movement vanishes as mm \to \infty (Chizat, Oyallon, Bach 2019).

The change in the NTK is second-order in the parameter displacement: ΘtΘ0Hfθtθ02\|\Theta_t - \Theta_0\| \leq \|H_f\| \cdot \|\theta_t - \theta_0\|^2 where HfH_f is the Hessian of ff in parameter space. Under NTK parameterization, Hfop=O(1/m)\|H_f\|_{\mathrm{op}} = O(1/\sqrt{m}) and θtθ02=O(1)\|\theta_t - \theta_0\|^2 = O(1), so the product is O(1/m)0O(1/\sqrt{m}) \to 0 (Lee et al. 2019, "Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent", arXiv:1902.06720).

With a constant kernel, the training dynamics in function space are linear: f˙=Θ(fy)\dot{f} = -\Theta^* \cdot (f - y), which converges exponentially at rate 2λmin(Θ)2\lambda_{\min}(\Theta^*).

Why It Matters

This is the central result of NTK theory. It says: infinitely wide networks trained with gradient descent are exactly kernel methods. This immediately imports decades of kernel theory: RKHS norm bounds, generalization guarantees, spectral analysis of convergence rates.

Failure Mode

The constancy of the NTK is precisely the lazy training limitation. If the kernel does not change, the network does not learn new features. It only learns a linear combination of the initial random features. This is why NTK theory cannot explain the success of feature learning in practice.

The Lazy Regime vs. The Rich Regime

This is the critical conceptual distinction:

PropertyLazy Regime (NTK)Rich Regime (Feature Learning)
Parameters moveVery littleSubstantially
FeaturesFixed at initializationLearned during training
Equivalent toKernel regressionMultiple infinite-width limits (mean-field, muP). Less clean than NTK
WidthVery largePractical
TheoryWell-understoodActive research
PerformanceOften suboptimalOften state-of-the-art

Real neural networks that achieve state-of-the-art performance are typically in the rich regime: they learn hierarchical features that are very different from their random initialization. NTK theory describes a regime where this feature learning is suppressed.

The NTK for Specific Architectures

Example

NTK for a two-layer ReLU network

For a two-layer network f(x)=1mj=1majσ(wjx)f(x) = \frac{1}{\sqrt{m}}\sum_{j=1}^m a_j \sigma(w_j^\top x) with ReLU activation σ(z)=max(0,z)\sigma(z) = \max(0, z), the infinite-width NTK admits a closed form via the Cho-Saul arccosine kernels. Let u=xx/(xx)u = x^\top x' / (\|x\|\|x'\|) and α=arccos(u)\alpha = \arccos(u). Define

κ0(u)=1π(πarccos(u)),κ1(u)=1π(1u2+(πarccos(u))u).\kappa_0(u) = \frac{1}{\pi}(\pi - \arccos(u)), \qquad \kappa_1(u) = \frac{1}{\pi}\left(\sqrt{1 - u^2} + (\pi - \arccos(u))\, u\right).

The first-layer NNGP kernel is K(1)(x,x)=xxκ1(u)K^{(1)}(x, x') = \|x\|\|x'\| \cdot \kappa_1(u), and the NTK is

Θ(x,x)=xx[κ1(u)+uκ0(u)]=xxπ[sinα+(πα)cosα]+xxcosαπ(πα).\Theta^*(x, x') = \|x\|\|x'\| \left[\kappa_1(u) + u \cdot \kappa_0(u)\right] = \frac{\|x\|\|x'\|}{\pi}\left[\sin\alpha + (\pi - \alpha)\cos\alpha\right] + \frac{\|x\|\|x'\|\cos\alpha}{\pi}(\pi - \alpha).

Sanity check: for unit vectors with x=xx = x', α=0\alpha = 0, u=1u = 1, giving κ1(1)=1\kappa_1(1) = 1 and κ0(1)=1\kappa_0(1) = 1, so Θ(x,x)=2\Theta^*(x, x) = 2, matching Jacot-Gabriel-Hongler 2018 Prop. 2 and Arora et al. 2019 Thm. 3.1. This kernel is universal (dense in continuous functions) and positive definite on distinct points.

Common Confusions

Watch Out

NTK does not describe practical neural networks

The NTK regime requires width to be extremely large. often unrealistically so. Practical networks (GPT, ResNets, etc.) are not in the lazy regime. They learn features. NTK is a theoretical tool for understanding one extreme of neural network behavior, not a description of how practical networks work.

Watch Out

Constant NTK means no feature learning

A common point of confusion: "NTK theory proves neural networks are kernel methods." More precisely, NTK theory proves that infinitely wide networks in the lazy regime are kernel methods. The interesting behavior of practical networks. feature learning, representation learning, transfer. happens precisely when the NTK changes during training.

Watch Out

NTK parameterization vs standard parameterization

The NTK result requires a specific parameterization (scaling by 1/m1/\sqrt{m}) that differs from the standard (mean field) parameterization. Different parameterizations lead to qualitatively different infinite-width limits. The mean field parameterization leads to feature learning even at infinite width. PyTorch default initialization is not NTK parameterization. Practitioners applying NTK results to standard-init models are making an unstated approximation.

Watch Out

Finite width matters quantitatively

At m=1024m = 1024, the finite-width NTK typically differs from the infinite-width limit by 2 to 5 percent. At m=128m = 128, the error exceeds 10 percent. Practical transformer MLP widths run 1024 to 8192, so the infinite-width approximation carries non-negligible error. "Infinite-width predictions" for real models are first-order approximations, not exact characterizations. (Arora et al. 2019, Table 1; Lee et al. 2019.)

Summary

  • The NTK is Θ(x,x)=θf(x),θf(x)\Theta(x, x') = \langle \nabla_\theta f(x), \nabla_\theta f(x') \rangle
  • At infinite width, the NTK converges to a deterministic kernel and stays constant during training
  • Infinite-width networks trained with GD are equivalent to kernel regression with the NTK
  • This is the lazy regime: features are fixed, only the linear readout is learned
  • Real networks learn features (rich regime), which NTK theory does not capture
  • NTK was a major theoretical advance but is incomplete as a theory of deep learning

Exercises

ExerciseCore

Problem

Warm-up: For a linear model f(x;w)=wxf(x; w) = w^\top x with wRdw \in \mathbb{R}^d, compute the NTK Θ(x,x)\Theta(x, x'). Then, for a two-layer linear network f(x;W,v)=vWxf(x; W, v) = v^\top W x with WRm×dW \in \mathbb{R}^{m \times d} and vRmv \in \mathbb{R}^m, compute the NTK at a fixed (W,v)(W, v). Show that depth changes the NTK even for linear networks, and explain what this implies about the role of depth in NTK theory.

ExerciseAdvanced

Problem

Suppose you have a two-layer network of width mm and the NTK on the nn training points has minimum eigenvalue λmin\lambda_{\min}. The training loss at time tt under gradient flow satisfies L(t)L(0)e2λmintL(t) \leq L(0) e^{-2\lambda_{\min}t}. If λmin=0.01\lambda_{\min} = 0.01 and L(0)=1.0L(0) = 1.0, how many time units until the loss reaches 10610^{-6}?

ExerciseAdvanced

Problem

Explain why the NTK framework cannot account for the empirical observation that deeper networks learn increasingly abstract features at higher layers. What property of the NTK regime prevents this?

Related Comparisons

References

Canonical (NTK):

  • Jacot, Gabriel, Hongler, "Neural Tangent Kernel: Convergence and Generalization in Neural Networks" (NeurIPS 2018), Prop. 2, Cor. 2
  • Lee et al., "Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent" (NeurIPS 2019, arXiv:1902.06720), Thm. 2.2
  • Arora et al., "On Exact Computation with an Infinitely Wide Neural Net" (NeurIPS 2019, arXiv:1904.11955), Thm. 3.1, Cor. 6.2
  • Du et al., "Gradient Descent Finds Global Minima of Deep Neural Networks" (ICML 2019)
  • Allen-Zhu, Li, Song, "A Convergence Theory for Deep Learning via Over-Parameterization" (ICML 2019)

Mean-field primary sources:

  • Mei, Montanari, Nguyen, "A Mean Field View of the Landscape of Two-Layer Neural Networks" (PNAS 2018, arXiv:1804.06561)
  • Rotskoff, Vanden-Eijnden, "Parameters as Interacting Particles: Long Time Convergence and Asymptotic Error Scaling of Neural Networks" (arXiv:1805.00915)
  • Chizat, Bach, "On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport" (NeurIPS 2018, arXiv:1805.09545)

Lazy vs. rich and muP:

  • Chizat, Oyallon, Bach, "On Lazy Training in Differentiable Programming" (NeurIPS 2019)
  • Yang, Hu, "Tensor Programs IV: Feature Learning in Infinite-Width Neural Networks" (ICML 2021)

Next Topics

  • Kernels and RKHS: the classical kernel theory that NTK connects to
  • Implicit bias: what inductive bias does gradient descent impose?

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This