What Each Describes
Both NTK and mean-field theory describe the behavior of neural networks in the infinite-width limit. They answer the same question: what happens as width ? But they arrive at qualitatively different answers because they use different scaling of the learning rate and initialization.
NTK (Neural Tangent Kernel) describes the lazy regime. Weights stay close to initialization, the network behaves like a linear model in a fixed feature space, and training dynamics reduce to kernel regression with a deterministic kernel.
Mean-field theory describes the rich regime. Weights move substantially during training, the network learns new features, and the distribution of neurons evolves according to a Wasserstein gradient flow on a space of probability measures.
Side-by-Side Statement
NTK Regime
Consider a two-layer network with learning rate or standard parameterization with . In the limit :
Training dynamics become linear. The NTK converges to a deterministic kernel and stays approximately constant throughout training.
Mean-Field Regime
Consider a two-layer network where is the distribution of neuron parameters. With appropriate scaling (learning rate in mean-field parameterization), training evolves according to:
where is the population risk functional. This is a Wasserstein gradient flow. Features change throughout training.
Where Each Is Stronger
NTK wins on mathematical tractability
The NTK regime reduces neural network training to kernel regression, which is completely solved. You get closed-form expressions for training dynamics, convergence rates, and generalization bounds. The kernel is deterministic at initialization and does not change during training. This makes proofs clean and results precise.
For a network trained with squared loss, the training dynamics become:
where is the NTK Gram matrix. This is a linear ODE with explicit solution.
Mean-field wins on explaining feature learning
Real neural networks learn features. The first layer of a trained vision model learns edge detectors; the first layer of a language model learns token embeddings. NTK theory cannot explain this because in the NTK regime, features are frozen at their random initial values. Mean-field theory captures the evolution of features through the evolving measure .
Empirically, networks trained at practical scales are closer to the mean-field regime than the NTK regime. The NTK approximation requires either very large width or very small learning rates, neither of which matches standard practice.
Where Each Fails
NTK fails to explain why neural networks outperform kernels
If NTK theory were the whole story, a neural network would perform identically to kernel regression with the NTK. But neural networks consistently outperform their corresponding NTK on practical tasks. This gap is precisely what feature learning buys you, and NTK theory misses it entirely.
Mean-field fails on finite-width networks
Mean-field theory requires just as NTK does. The convergence rates and the quality of the approximation at finite width are less well understood than for NTK. The PDE describing the measure evolution is nonlinear and generally does not admit closed-form solutions. Proving global convergence in the mean-field regime requires assumptions (e.g., log-Sobolev inequalities) that are hard to verify for practical architectures.
Both fail for deep networks
Both theories are best understood for two-layer networks. Extensions to deep networks exist but are substantially more complex. For NTK, the kernel changes across layers and depth creates additional challenges. For mean-field, the interaction between layers makes the measure evolution coupled and harder to analyze.
Key Assumptions That Differ
| NTK Regime | Mean-Field Regime | |
|---|---|---|
| Width | ||
| Learning rate | Small: or standard param | in mean-field param |
| Weight movement | ||
| Feature learning | No (kernel is fixed) | Yes (measure evolves) |
| Training dynamics | Linear ODE | Nonlinear PDE (Wasserstein flow) |
| Math tools | Kernel theory, random matrix theory | Optimal transport, PDEs on measure spaces |
The Interpolation: Parameterization Controls the Regime
Parameterization Interpolates Regimes
Statement
Consider the parameterization with learning rate . The effective learning rate per neuron is .
When (small effective rate), the network is in the NTK/lazy regime: weights barely move and the kernel stays fixed.
When (balanced scaling), the network is in the mean-field/rich regime: weights move substantially and features evolve.
Standard parameterization () with gives NTK. Mean-field parameterization () with gives mean-field.
Intuition
The choice of how you scale the output and the learning rate with width determines whether you get lazy or rich behavior. This is not a property of the network architecture but of the training setup. The same architecture can be in either regime depending on the parameterization.
What to Memorize
-
NTK = lazy: Weights barely move, no feature learning, training is kernel regression.
-
Mean-field = rich: Weights move substantially, features are learned, training is a PDE on measure space.
-
The control knob: The ratio of learning rate to output scaling determines the regime. Larger effective learning rate pushes toward mean-field.
-
The practical gap: Real networks at practical width and learning rate are closer to mean-field than NTK. NTK theory is mathematically elegant but does not explain why neural networks outperform kernel methods.
When a Researcher Would Use Each
Proving convergence guarantees
If you need a clean convergence proof for overparameterized networks, the NTK regime is the right tool. The linear dynamics give exponential convergence to zero training loss when the NTK Gram matrix is positive definite, which holds for sufficiently wide networks with distinct training points.
Understanding representation learning
If you want to study how a network learns task-relevant features from data, you need mean-field theory or the related maximal update parameterization (muP). NTK theory cannot capture this phenomenon by construction.
Hyperparameter transfer across widths
The maximal update parameterization (muP), which is closely connected to the mean-field regime, allows hyperparameters tuned on narrow networks to transfer to wider networks. This is a practical consequence of the mean-field scaling. NTK parameterization does not provide this transfer.
Common Confusions
NTK does not mean the network is literally a kernel method
The NTK regime means training dynamics are equivalent to kernel regression. The network itself is still a neural network with nonlinear activations. The point is that in the lazy regime, the nonlinearity is never exploited during training because the weights do not move enough to explore new features.
Mean-field does not mean finite-width networks learn features
Mean-field theory guarantees feature learning in the infinite-width limit with specific parameterization. Whether a finite-width network is in the lazy or rich regime depends on the width, learning rate, initialization scale, and training time. A very wide network with a very small learning rate can be in the lazy regime even at practical scales.
Both theories require infinite width
A common misunderstanding is that NTK is the infinite-width theory and mean-field is the finite-width theory. Both require . They differ in how other quantities (learning rate, initialization) scale with width. Finite-width corrections to both theories are active research areas.