Paper breakdown
Neural Tangent Kernel: Convergence and Generalization in Neural Networks
Arthur Jacot, Franck Gabriel, and Clément Hongler · 2018 · NeurIPS 2018
Proves that, in the infinite-width limit and at the right initialisation scale, gradient-flow training of a neural network is equivalent to kernel regression with a fixed deterministic kernel. Gives the first proof of global convergence for a non-convex deep-network training procedure.
Overview
Jacot, Gabriel, and Hongler (2018) gave deep-network training its first satisfactory theoretical limit. Take a fully-connected network with hidden layers, scale the per-layer width to infinity at a particular initialisation (the so-called NTK parameterisation), and run gradient flow on a squared loss. The paper proves two things. The empirical neural tangent kernel — the kernel induced by the gradient of the network output with respect to its parameters — converges in probability to a deterministic limit , and stays constant during training. As a result, the network's training trajectory in function space is identical to gradient flow on the squared loss for kernel regression with kernel .
This linearisation was a turning point. Before NTK, deep-network training was treated as an empirical art with no clean theoretical handle. After NTK, the infinite-width regime — also called the "lazy" regime — became a tractable model in which questions about convergence, generalisation, and inductive bias have closed-form answers.
Mathematical Contributions
The NTK definition
Let be the scalar output of a network with parameters . The empirical neural tangent kernel is:
It is the inner product, in parameter space, of the parameter-gradients of the function evaluated at two inputs. It depends on the current , so generically it is a random object that evolves with training.
Convergence to a deterministic limit
The paper proves, under the NTK parameterisation (each weight matrix is initialised iid Gaussian and scaled by where is the layer width), that as :
in probability at initialisation, and the limit kernel is recursive:
where is the NNGP kernel (the Gaussian-process kernel of the network output at initialisation) and is the derivative covariance under the activation. Both are computable in closed form for ReLU and other standard non-linearities.
Constancy of the kernel during training
The second result is that, under the same width scaling and on a finite training set, the empirical NTK does not move during training: for all . The argument is that parameter changes during training are per parameter while the kernel's sensitivity to those changes is also , so the deviation is and vanishes in the infinite-width limit.
Equivalence to kernel regression
With a constant deterministic NTK and squared loss, gradient flow on the network parameters induces the function-space ODE:
where is the training matrix and the labels. This is the gradient-flow ODE for kernel regression with kernel and squared loss. In the limit , the network's learned function on the training set is the kernel-regression solution:
Convergence is global (no local minima problem) because kernel regression is a convex quadratic in function space. The non-convexity of the parameter-space loss is invisible from the function-space view.
What this does and does not explain
The NTK theorem explains why training converges and predicts the function the converged network represents. It does not, on its own, explain why deep networks generalise better than the equivalent kernel method. In the NTK regime, a deep network is asymptotically equivalent to a fixed kernel — there is no representation learning. The generalisation gap between NTK predictions and actual deep networks at finite width is the gap that the implicit-bias and feature-learning literatures address.
Connections to TheoremPath Topics
- Neural tangent kernel — modern presentation including empirical NTK, finite-width corrections, and lazy/feature-learning transitions.
- Feedforward networks and backpropagation — the parameter-gradient computation the NTK depends on.
- Gaussian processes for ML — the related NNGP limit, where the network at initialisation is itself a Gaussian process.
- Implicit bias and modern generalization — what NTK does not capture: the inductive bias from feature learning.
- Training dynamics and loss landscapes — the alternative framing the NTK linearises.
- Mean-field theory — the alternative infinite-width parameterisation where features do learn (Mei-Montanari, Chizat-Bach).
Why It Matters Now
The NTK regime is not where modern large models operate. The "lazy" regime in which the kernel stays constant requires a particular initialisation scaling, and in practice — especially at large scale — feature learning happens, the kernel changes, and the linearised analysis becomes a poor approximation. So the NTK is not a recipe for predicting GPT-4.
It still matters for three reasons.
First, NTK gave the field its first global-convergence guarantee for a deep, non-convex training procedure. The proof technique — track the parameter-space dynamics, lift to function space, exploit convexity in function space — appears throughout the analysis of overparameterised models, including in two-layer mean-field networks, in transformer-stylised models, and in some convergence analyses of SGD.
Second, the NTK is now a baseline kernel. It is a well-defined, computable kernel for any architecture, and "does the deep network beat its NTK?" is a legitimate test for whether feature learning is happening. When the answer is no, the deep network is doing kernel regression with extra steps, which is a useful warning.
Third, the framing — that an infinite-width neural network in a particular limit is a fixed kernel — connected deep learning to the older kernel-method literature in a way that everyone could agree on. That bridge let statistical-learning-theory tools (Rademacher bounds, capacity arguments) be applied to deep networks at all, even if only in this limit.
References
Canonical:
- Jacot, A., Gabriel, F., & Hongler, C. (2018). "Neural Tangent Kernel: Convergence and Generalization in Neural Networks." NeurIPS. arXiv:1806.07572.
Direct precursors:
- Neal, R. M. (1996). Bayesian Learning for Neural Networks. PhD thesis, University of Toronto. The infinite-width-as-Gaussian-process observation.
- Lee, J. et al. (2018). "Deep Neural Networks as Gaussian Processes." ICLR. arXiv:1711.00165. Modern NNGP treatment.
Convergence and generalisation follow-ups:
- Du, S. S. et al. (2019). "Gradient Descent Provably Optimizes Over-parameterized Neural Networks." ICLR. arXiv:1810.02054.
- Allen-Zhu, Z., Li, Y., & Song, Z. (2019). "A Convergence Theory for Deep Learning via Over-Parameterization." ICML. arXiv:1811.03962.
- Arora, S., Du, S. S., Hu, W., Li, Z., & Wang, R. (2019). "Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks." ICML. arXiv:1901.08584.
Feature-learning critique:
- Chizat, L., Oyallon, E., & Bach, F. (2019). "On Lazy Training in Differentiable Programming." NeurIPS. arXiv:1812.07956.
- Mei, S., Montanari, A., & Nguyen, P.-M. (2018). "A Mean Field View of the Landscape of Two-Layer Neural Networks." PNAS. arXiv:1804.06561.
- Yang, G., & Hu, E. J. (2021). "Tensor Programs IV: Feature Learning in Infinite-Width Neural Networks." ICML. arXiv:2011.14522.
Surveys:
- Roberts, D. A., Yaida, S., & Hanin, B. (2022). The Principles of Deep Learning Theory. Cambridge. Chapters 4-5.
Connected topics
Last reviewed: May 5, 2026