Skip to main content

Paper breakdown

Neural Tangent Kernel: Convergence and Generalization in Neural Networks

Arthur Jacot, Franck Gabriel, and Clément Hongler · 2018 · NeurIPS 2018

Proves that, in the infinite-width limit and at the right initialisation scale, gradient-flow training of a neural network is equivalent to kernel regression with a fixed deterministic kernel. Gives the first proof of global convergence for a non-convex deep-network training procedure.

Overview

Jacot, Gabriel, and Hongler (2018) gave deep-network training its first satisfactory theoretical limit. Take a fully-connected network with LL hidden layers, scale the per-layer width nn to infinity at a particular initialisation (the so-called NTK parameterisation), and run gradient flow on a squared loss. The paper proves two things. The empirical neural tangent kernel — the kernel induced by the gradient of the network output with respect to its parameters — converges in probability to a deterministic limit Θ\Theta_\infty, and stays constant during training. As a result, the network's training trajectory in function space is identical to gradient flow on the squared loss for kernel regression with kernel Θ\Theta_\infty.

This linearisation was a turning point. Before NTK, deep-network training was treated as an empirical art with no clean theoretical handle. After NTK, the infinite-width regime — also called the "lazy" regime — became a tractable model in which questions about convergence, generalisation, and inductive bias have closed-form answers.

Mathematical Contributions

The NTK definition

Let f(x;θ)Rf(x; \theta) \in \mathbb{R} be the scalar output of a network with parameters θRP\theta \in \mathbb{R}^P. The empirical neural tangent kernel is:

Θ^(x,x)=θf(x;θ)θf(x;θ)\hat{\Theta}(x, x') = \nabla_\theta f(x; \theta)^\top \nabla_\theta f(x'; \theta)

It is the inner product, in parameter space, of the parameter-gradients of the function evaluated at two inputs. It depends on the current θ\theta, so generically it is a random object that evolves with training.

Convergence to a deterministic limit

The paper proves, under the NTK parameterisation (each weight matrix is initialised iid Gaussian and scaled by 1/n1/\sqrt{n} where nn is the layer width), that as nn \to \infty:

Θ^(x,x)Θ(x,x)\hat{\Theta}(x, x') \to \Theta_\infty(x, x')

in probability at initialisation, and the limit kernel is recursive:

Θ(L)(x,x)=Θ(L1)(x,x)Σ˙(L)(x,x)+Σ(L)(x,x)\Theta_\infty^{(L)}(x, x') = \Theta_\infty^{(L-1)}(x, x') \cdot \dot{\Sigma}^{(L)}(x, x') + \Sigma^{(L)}(x, x')

where Σ(L)\Sigma^{(L)} is the NNGP kernel (the Gaussian-process kernel of the network output at initialisation) and Σ˙(L)\dot{\Sigma}^{(L)} is the derivative covariance under the activation. Both are computable in closed form for ReLU and other standard non-linearities.

Constancy of the kernel during training

The second result is that, under the same width scaling and on a finite training set, the empirical NTK does not move during training: Θ^(x,x;θ(t))Θ^(x,x;θ(0))\hat{\Theta}(x, x'; \theta(t)) \approx \hat{\Theta}(x, x'; \theta(0)) for all tt. The argument is that parameter changes during training are O(1/n)O(1/\sqrt{n}) per parameter while the kernel's sensitivity to those changes is also O(1/n)O(1/\sqrt{n}), so the deviation is O(1/n)O(1/n) and vanishes in the infinite-width limit.

Equivalence to kernel regression

With a constant deterministic NTK Θ\Theta_\infty and squared loss, gradient flow on the network parameters induces the function-space ODE:

f˙(x,t)=Θ(x,X)(f(X,t)y)\dot{f}(x, t) = -\Theta_\infty(x, X)\, \big(f(X, t) - y\big)

where XX is the training matrix and yy the labels. This is the gradient-flow ODE for kernel regression with kernel Θ\Theta_\infty and squared loss. In the limit tt \to \infty, the network's learned function on the training set is the kernel-regression solution:

f(x)=Θ(x,X)Θ(X,X)1yf^*(x) = \Theta_\infty(x, X)\, \Theta_\infty(X, X)^{-1}\, y

Convergence is global (no local minima problem) because kernel regression is a convex quadratic in function space. The non-convexity of the parameter-space loss is invisible from the function-space view.

What this does and does not explain

The NTK theorem explains why training converges and predicts the function the converged network represents. It does not, on its own, explain why deep networks generalise better than the equivalent kernel method. In the NTK regime, a deep network is asymptotically equivalent to a fixed kernel — there is no representation learning. The generalisation gap between NTK predictions and actual deep networks at finite width is the gap that the implicit-bias and feature-learning literatures address.

Connections to TheoremPath Topics

Why It Matters Now

The NTK regime is not where modern large models operate. The "lazy" regime in which the kernel stays constant requires a particular initialisation scaling, and in practice — especially at large scale — feature learning happens, the kernel changes, and the linearised analysis becomes a poor approximation. So the NTK is not a recipe for predicting GPT-4.

It still matters for three reasons.

First, NTK gave the field its first global-convergence guarantee for a deep, non-convex training procedure. The proof technique — track the parameter-space dynamics, lift to function space, exploit convexity in function space — appears throughout the analysis of overparameterised models, including in two-layer mean-field networks, in transformer-stylised models, and in some convergence analyses of SGD.

Second, the NTK is now a baseline kernel. It is a well-defined, computable kernel for any architecture, and "does the deep network beat its NTK?" is a legitimate test for whether feature learning is happening. When the answer is no, the deep network is doing kernel regression with extra steps, which is a useful warning.

Third, the framing — that an infinite-width neural network in a particular limit is a fixed kernel — connected deep learning to the older kernel-method literature in a way that everyone could agree on. That bridge let statistical-learning-theory tools (Rademacher bounds, capacity arguments) be applied to deep networks at all, even if only in this limit.

References

Canonical:

  • Jacot, A., Gabriel, F., & Hongler, C. (2018). "Neural Tangent Kernel: Convergence and Generalization in Neural Networks." NeurIPS. arXiv:1806.07572.

Direct precursors:

  • Neal, R. M. (1996). Bayesian Learning for Neural Networks. PhD thesis, University of Toronto. The infinite-width-as-Gaussian-process observation.
  • Lee, J. et al. (2018). "Deep Neural Networks as Gaussian Processes." ICLR. arXiv:1711.00165. Modern NNGP treatment.

Convergence and generalisation follow-ups:

  • Du, S. S. et al. (2019). "Gradient Descent Provably Optimizes Over-parameterized Neural Networks." ICLR. arXiv:1810.02054.
  • Allen-Zhu, Z., Li, Y., & Song, Z. (2019). "A Convergence Theory for Deep Learning via Over-Parameterization." ICML. arXiv:1811.03962.
  • Arora, S., Du, S. S., Hu, W., Li, Z., & Wang, R. (2019). "Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks." ICML. arXiv:1901.08584.

Feature-learning critique:

  • Chizat, L., Oyallon, E., & Bach, F. (2019). "On Lazy Training in Differentiable Programming." NeurIPS. arXiv:1812.07956.
  • Mei, S., Montanari, A., & Nguyen, P.-M. (2018). "A Mean Field View of the Landscape of Two-Layer Neural Networks." PNAS. arXiv:1804.06561.
  • Yang, G., & Hu, E. J. (2021). "Tensor Programs IV: Feature Learning in Infinite-Width Neural Networks." ICML. arXiv:2011.14522.

Surveys:

  • Roberts, D. A., Yaida, S., & Hanin, B. (2022). The Principles of Deep Learning Theory. Cambridge. Chapters 4-5.

Connected topics

Last reviewed: May 5, 2026