NTK vs. Mean-Field Regime. Lazy Training vs. Feature Learning

What Each Describes

Both NTK and mean-field theory describe the behavior of neural networks in the infinite-width limit. They answer the same question: what happens as width $m \to \infty$ ? But they arrive at qualitatively different answers because they use different scaling of the learning rate and initialization.

NTK (Neural Tangent Kernel) describes the lazy regime. Weights stay close to initialization, the network behaves like a linear model in a fixed feature space, and training dynamics reduce to kernel regression with a deterministic kernel.

Mean-field theory describes the rich regime. Weights move substantially during training, the network learns new features, and the distribution of neurons evolves according to a Wasserstein gradient flow on a space of probability measures.

Side-by-Side Statement

Definition

NTK Regime

Consider a two-layer network $f(x; \theta) = \frac{1}{\sqrt{m}} \sum_{j=1}^m a_j \sigma(w_j^\top x)$ with learning rate $\eta = O(1/m)$ or standard parameterization with $\eta = O(1)$ . In the limit $m \to \infty$ :

$f(x; \theta_t) \approx f(x; \theta_0) + \nabla_\theta f(x; \theta_0)^\top (\theta_t - \theta_0)$

Training dynamics become linear. The NTK $K(x, x') = \nabla_\theta f(x)^\top \nabla_\theta f(x')$ converges to a deterministic kernel and stays approximately constant throughout training.

Definition

Mean-Field Regime

Consider a two-layer network $f(x; \mu) = \int a \sigma(w^\top x) \, d\mu(a, w)$ where $\mu$ is the distribution of neuron parameters. With appropriate scaling (learning rate $\eta = O(1)$ in mean-field parameterization), training evolves $\mu_t$ according to:

$\partial_t \mu_t = \nabla \cdot (\mu_t \nabla_\theta \Phi(\mu_t))$

where $\Phi(\mu)$ is the population risk functional. This is a Wasserstein gradient flow. Features change throughout training.

Where Each Is Stronger

NTK wins on mathematical tractability

The NTK regime reduces neural network training to kernel regression, which is completely solved. You get closed-form expressions for training dynamics, convergence rates, and generalization bounds. The kernel is deterministic at initialization and does not change during training. This makes proofs clean and results precise.

For a network trained with squared loss, the training dynamics become:

$\frac{d}{dt} f_t = -K(f_t - y)$

where $K$ is the NTK Gram matrix. This is a linear ODE with explicit solution.

Mean-field wins on explaining feature learning

Real neural networks learn features. The first layer of a trained vision model learns edge detectors; the first layer of a language model learns token embeddings. NTK theory cannot explain this because in the NTK regime, features are frozen at their random initial values. Mean-field theory captures the evolution of features through the evolving measure $\mu_t$ .

Empirically, networks trained at practical scales are closer to the mean-field regime than the NTK regime. The NTK approximation requires either very large width or very small learning rates, neither of which matches standard practice.

Where Each Fails

NTK fails to explain why neural networks outperform kernels

If NTK theory were the whole story, a neural network would perform identically to kernel regression with the NTK. But neural networks consistently outperform their corresponding NTK on practical tasks. This gap is precisely what feature learning buys you, and NTK theory misses it entirely.

Mean-field fails on finite-width networks

Mean-field theory requires $m \to \infty$ just as NTK does. The convergence rates and the quality of the approximation at finite width are less well understood than for NTK. The PDE describing the measure evolution is nonlinear and generally does not admit closed-form solutions. Proving global convergence in the mean-field regime requires assumptions (e.g., log-Sobolev inequalities) that are hard to verify for practical architectures.

Both fail for deep networks

Both theories are best understood for two-layer networks. Extensions to deep networks exist but are substantially more complex. For NTK, the kernel changes across layers and depth creates additional challenges. For mean-field, the interaction between layers makes the measure evolution coupled and harder to analyze.

Key Assumptions That Differ

	NTK Regime	Mean-Field Regime
Width	$m \to \infty$	$m \to \infty$
Learning rate	Small: $\eta = O(1/m)$ or standard param	$O(1)$ in mean-field param
Weight movement	$\\|\theta_t - \theta_0\\| = O(1/\sqrt{m})$	$\\|\theta_t - \theta_0\\| = O(1)$
Feature learning	No (kernel is fixed)	Yes (measure evolves)
Training dynamics	Linear ODE	Nonlinear PDE (Wasserstein flow)
Math tools	Kernel theory, random matrix theory	Optimal transport, PDEs on measure spaces

The Interpolation: Parameterization Controls the Regime

Proposition

Parameterization Interpolates Regimes

Statement

Consider the parameterization $f(x) = \frac{1}{m^\alpha} \sum_{j=1}^m a_j \sigma(w_j^\top x)$ with learning rate $\eta = m^\beta$ . The effective learning rate per neuron is $\eta_{\text{eff}} = m^{\beta - 2\alpha}$ .

When $\beta - 2\alpha < 0$ (small effective rate), the network is in the NTK/lazy regime: weights barely move and the kernel stays fixed.

When $\beta - 2\alpha = 0$ (balanced scaling), the network is in the mean-field/rich regime: weights move substantially and features evolve.

Standard parameterization ( $\alpha = 1/2$ ) with $\eta = O(1)$ gives NTK. Mean-field parameterization ( $\alpha = 1$ ) with $\eta = O(1)$ gives mean-field.

Intuition

The choice of how you scale the output and the learning rate with width determines whether you get lazy or rich behavior. This is not a property of the network architecture but of the training setup. The same architecture can be in either regime depending on the parameterization.

report a correction →

What to Memorize

NTK = lazy: Weights barely move, no feature learning, training is kernel regression.
Mean-field = rich: Weights move substantially, features are learned, training is a PDE on measure space.
The control knob: The ratio of learning rate to output scaling determines the regime. Larger effective learning rate pushes toward mean-field.
The practical gap: Real networks at practical width and learning rate are closer to mean-field than NTK. NTK theory is mathematically elegant but does not explain why neural networks outperform kernel methods.

When a Researcher Would Use Each

Example

Proving convergence guarantees

If you need a clean convergence proof for overparameterized networks, the NTK regime is the right tool. The linear dynamics give exponential convergence to zero training loss when the NTK Gram matrix is positive definite, which holds for sufficiently wide networks with distinct training points.

Example

Understanding representation learning

If you want to study how a network learns task-relevant features from data, you need mean-field theory or the related maximal update parameterization (muP). NTK theory cannot capture this phenomenon by construction.

Example

Hyperparameter transfer across widths

The maximal update parameterization (muP), which is closely connected to the mean-field regime, allows hyperparameters tuned on narrow networks to transfer to wider networks. This is a practical consequence of the mean-field scaling. NTK parameterization does not provide this transfer.

Common Confusions

Watch Out

NTK does not mean the network is literally a kernel method

The NTK regime means training dynamics are equivalent to kernel regression. The network itself is still a neural network with nonlinear activations. The point is that in the lazy regime, the nonlinearity is never exploited during training because the weights do not move enough to explore new features.

Watch Out

Mean-field does not mean finite-width networks learn features

Mean-field theory guarantees feature learning in the infinite-width limit with specific parameterization. Whether a finite-width network is in the lazy or rich regime depends on the width, learning rate, initialization scale, and training time. A very wide network with a very small learning rate can be in the lazy regime even at practical scales.

Watch Out

Both theories require infinite width

A common misunderstanding is that NTK is the infinite-width theory and mean-field is the finite-width theory. Both require $m \to \infty$ . They differ in how other quantities (learning rate, initialization) scale with width. Finite-width corrections to both theories are active research areas.