Mean Field Theory

Sneiderman, Robby

Modern Generalization

Mean Field Theory

The mean field limit of neural networks: as width goes to infinity under the right scaling, neurons become independent particles whose weight distribution evolves by Wasserstein gradient flow, capturing feature learning that the NTK regime misses.

AdvancedTier 2CurrentSupporting~65 min

Prerequisites

Neural Tangent Kernel Information Geometry

Prereq Map

Learning position

Read this page in the graph.

modern-generalization | layer 4 | tier 2. This page has 2 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Lazy vs Feature Learning

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

The Neural Tangent Kernel shows that infinitely wide networks in the lazy regime behave like kernel methods --- but this is precisely the regime where networks do not learn features. Practical neural networks learn representations, and NTK cannot explain this.

Five-panel infographic on mean-field theory: core intuition (replace many interacting components with the effect of an average field), self-consistency idea via Curie-Weiss / Ising and the equation m = tanh(beta(Jm + h)), why it works in large systems (LLN over interactions, intractable joint reduces to a 1-D mean-field statistic), phase transitions and fixed points with bifurcation diagrams, and where mean-field shows up (variational inference, population dynamics, neural networks at infinite width / NTK / GP limits, mean-field games). — Mean-field theory trades exact microscopic interactions for a self-consistent average effect that captures macroscopic behavior. It underlies variational inference, infinite-width neural network limits, and game-theoretic large-population analysis.

Mean field theory provides an alternative infinite-width limit that does capture feature learning. Under a different parameterization (mean field scaling instead of NTK scaling), the network weights move substantially during training. In the infinite-width limit, individual neurons become independent, and the distribution of weights evolves according to a partial differential equation --- specifically, a Wasserstein gradient flow.

This is the theoretical framework for understanding what happens beyond the kernel regime, making it one of the most important directions in modern deep learning theory.

Mental Model

Think of a two-layer neural network as a collection of $m$ "particles" (neurons), each with a weight vector $w_j$ . Each particle contributes $a_j \sigma(w_j^\top x) / m$ to the output. As $m \to \infty$ , the sum becomes an integral over a probability distribution $\mu$ of weights:

$f(x; \mu) = \int a \, \sigma(w^\top x) \, d\mu(a, w)$

Training the network is equivalent to moving the particles $\{(a_j, w_j)\}_{j=1}^m$ by gradient descent. In the infinite-width limit, this becomes evolving the distribution $\mu$ by a continuous flow. The distribution moves in the direction that decreases the loss fastest --- this is Wasserstein gradient flow.

The key difference from NTK: in NTK, each particle barely moves (order $1/\sqrt{m}$ displacement). In mean field, particles move substantially (order 1 displacement). This substantial movement is what enables feature learning.

Formal Setup

Consider a two-layer neural network:

$f_m(x) = \frac{1}{m}\sum_{j=1}^m a_j \sigma(w_j^\top x)$

where $\theta_j = (a_j, w_j) \in \mathbb{R}^{d+1}$ are the parameters of neuron $j$ , and $\sigma$ is an activation function (e.g., ReLU).

Definition

Mean Field Parameterization

In the mean field parameterization, the network output scales as $1/m$ (one factor of width in the denominator):

$f_m(x) = \frac{1}{m}\sum_{j=1}^m \phi(x; \theta_j)$

where $\phi(x; \theta) = a \sigma(w^\top x)$ is the contribution of a single neuron with parameters $\theta = (a, w)$ .

Compare to NTK parameterization: $f_m(x) = \frac{1}{\sqrt{m}}\sum_{j=1}^m a_j \sigma(w_j^\top x)$ where the output scales as $1/\sqrt{m}$ and both $a_j$ and $w_j$ are trained, but each parameter moves only $O(1/\sqrt{m})$ during training. The distinct setup in which only $a_j$ is trained while the random hidden weights $w_j$ are frozen is the random-features model, not NTK.

The $1/m$ scaling in mean field is crucial: it means each neuron's contribution is order $1/m$ , so the network depends on the distribution of neurons, not on any individual one.

Definition

Empirical Measure of Neurons $μ_{m}$

The empirical measure of the neuron parameters is:

$\mu_m = \frac{1}{m}\sum_{j=1}^m \delta_{\theta_j}$

where $\delta_{\theta_j}$ is a point mass at $\theta_j$ . As $m \to \infty$ , if the neurons are initialized i.i.d. from some distribution $\mu_0$ , then $\mu_m \to \mu_0$ by the law of large numbers. The network output becomes:

$f(x; \mu) = \int \phi(x; \theta) \, d\mu(\theta)$

This is a linear functional of the measure $\mu$ .

Definition

Wasserstein Gradient Flow

The Wasserstein gradient flow is the continuous-time evolution of a probability measure $\mu_t$ that follows the steepest descent direction of a functional $\mathcal{L}(\mu)$ in the Wasserstein-2 metric space:

$\partial_t \mu_t = \nabla \cdot \left(\mu_t \nabla \frac{\delta \mathcal{L}}{\delta \mu}(\mu_t)\right)$

where $\frac{\delta \mathcal{L}}{\delta \mu}$ is the first variation (functional derivative) of $\mathcal{L}$ with respect to $\mu$ .

In the neural network context, $\mathcal{L}(\mu) = L(f(\cdot; \mu))$ is the training loss viewed as a functional of the weight distribution. The first variation at a point $\theta$ is the gradient of the loss with respect to a single neuron's parameters: $\frac{\delta \mathcal{L}}{\delta \mu}(\theta) = \nabla_\theta L_\theta$ evaluated at the current distribution.

Main Theorems

Theorem

Mean Field Limit for Two-Layer Networks

Statement

Consider a two-layer network with $m$ neurons trained by gradient flow on loss $L$ . Under regularity conditions on $\sigma$ and $L$ , as $m \to \infty$ :

The empirical measure $\mu_m(t)$ converges weakly to a deterministic measure $\mu(t)$ for all $t \geq 0$
The limiting measure $\mu(t)$ satisfies the mean field PDE:

$\partial_t \mu_t = \nabla \cdot \left(\mu_t \nabla_\theta \frac{\delta \mathcal{L}}{\delta \mu}(\mu_t, \theta)\right)$

The network output $f_m(x; t)$ converges to $f(x; \mu_t) = \int \phi(x; \theta) \, d\mu_t(\theta)$
Each neuron evolves independently in the limit, following: $\dot{\theta}_j = -\nabla_\theta \frac{\delta \mathcal{L}}{\delta \mu}(\mu_t, \theta_j)$ where $\mu_t$ is the population-level distribution

Intuition

The $1/m$ scaling means each individual neuron has a vanishing effect on the total output. As $m \to \infty$ , changing one neuron does not affect the loss gradient seen by other neurons. This is the "propagation of chaos" phenomenon: interacting particles become independent in the many-particle limit. Each neuron follows its own gradient as if the distribution $\mu_t$ were fixed --- but $\mu_t$ itself evolves self-consistently as the aggregate of all neurons.

This is analogous to how individual gas molecules become effectively independent in the thermodynamic limit, even though they all interact via the mean field.

Proof Sketch

The proof uses propagation of chaos techniques from mathematical physics:

Step 1: Show that the gradient update for neuron $j$ depends on the other neurons only through the empirical measure $\mu_m$ . The gradient is $\nabla_{\theta_j} L = \nabla_\theta \frac{\delta \mathcal{L}}{\delta \mu}(\mu_m, \theta_j)$ .

Step 2: Show that $\mu_m(t)$ concentrates around a deterministic trajectory $\mu(t)$ . This uses the law of large numbers for interacting particle systems: as $m \to \infty$ , the empirical measure of i.i.d. particles undergoing mean-field interactions converges to the solution of the mean field PDE.

Step 3: Verify that the limiting PDE is well-posed (existence and uniqueness of solutions) under the regularity assumptions on $\sigma$ .

Why It Matters

This theorem says that infinitely wide mean-field networks are described by a PDE, not by a kernel. The distribution $\mu_t$ evolves nontrivially during training --- the neurons move to new locations in parameter space. This is feature learning: the features $\sigma(w_j^\top x)$ change during training because the $w_j$ change. NTK theory, by contrast, freezes the features at their initialization.

The mean field limit shows that feature learning is not a finite-width artifact --- it persists at infinite width under the right scaling.

Failure Mode

The regularity conditions on $\sigma$ typically require smoothness, which excludes ReLU. Extensions to ReLU exist but require more delicate analysis. The convergence rate is typically $O(1/\sqrt{m})$ in Wasserstein distance, which is slow. More critical, the mean field limit applies cleanly only to two-layer networks. Extending to deep networks requires multi-layer mean field theories that are still under active development.

report a correction →

Proposition

Training Loss Decreases Along Wasserstein Gradient Flow

Statement

Along the Wasserstein gradient flow $\mu_t$ , the training loss $\mathcal{L}(\mu_t)$ is non-increasing:

$\frac{d}{dt}\mathcal{L}(\mu_t) = -\int \left\|\nabla_\theta \frac{\delta \mathcal{L}}{\delta \mu}(\mu_t, \theta)\right\|^2 d\mu_t(\theta) \leq 0$

The loss decreases at a rate proportional to the expected squared gradient norm under the current weight distribution.

Intuition

This is the infinite-width analogue of "gradient descent decreases the loss." Each neuron moves in its negative gradient direction, and the aggregate effect is a decrease in the loss. The rate of decrease depends on how large the gradients are on average under $\mu_t$ . The flow stops (reaches a critical point) when the gradient is zero $\mu_t$ -almost everywhere.

Proof Sketch

By the chain rule in Wasserstein space:

$\frac{d}{dt}\mathcal{L}(\mu_t) = \int \frac{\delta \mathcal{L}}{\delta \mu}(\mu_t, \theta) \, \partial_t \mu_t(\theta)$

Substituting the mean field PDE and integrating by parts:

$= -\int \left\|\nabla_\theta \frac{\delta \mathcal{L}}{\delta \mu}\right\|^2 d\mu_t \leq 0$

The integration by parts moves the divergence operator $\nabla \cdot$ onto the first variation, producing the squared gradient norm with a negative sign.

Why It Matters

Non-increase of the loss along the flow is the analogue of monotonic loss decrease for finite-dimensional gradient descent. It does not by itself guarantee convergence to a critical point: that requires additional compactness or regularity assumptions (e.g., a Lyapunov argument plus tightness of $\mu_t$ ). Combined with global optimality results for over-parameterized mean field networks under specific structural assumptions, monotone descent is one ingredient in showing that gradient flow on infinitely wide networks finds good solutions. The key advantage over NTK is that whatever progress occurs happens while the features are being learned.

Failure Mode

The critical point reached by the flow may be a saddle point or local minimum, not a global minimum. Global convergence results require additional assumptions (e.g., convexity of the loss functional in Wasserstein space, or specific properties of the activation function).

report a correction →

Mean Field vs. NTK: The Central Comparison

Property	NTK (Lazy Regime)	Mean Field (Rich Regime)
Parameterization	$1/\sqrt{m}$ scaling	$1/m$ scaling
Weight movement	$O(1/\sqrt{m})$ --- vanishing	$O(1)$ --- substantial
Features	Frozen at initialization	Learned during training
Infinite-width limit	Kernel regression (linear)	Wasserstein gradient flow (nonlinear)
Mathematical tool	Kernel theory, RKHS	PDE, optimal transport
Feature learning	No	Yes
Captures practice	Poorly	Better (but still limited)

The fundamental reason for the difference: the $1/\sqrt{m}$ NTK scaling means that each neuron's output change during training is order $1/\sqrt{m}$ , so the linearization of the network is accurate. The $1/m$ mean field scaling means the output depends on the average over $m$ neurons. Each neuron can move substantially (order 1) because its individual contribution to the output is only $1/m$ .

Canonical Examples

Example

Mean field dynamics for a toy problem

Consider fitting a smoothed step $f(x) = \tanh(\beta x)$ for some moderate $\beta$ with a two-layer ReLU network on $[-1, 1]$ . (We use a smoothed step rather than $\mathrm{sign}(x)$ because the standard mean-field theorems assume regularity that the discontinuous target violates.) Under NTK: the random features $\sigma(w_j^\top x)$ are fixed, and the network can only learn a linear combination of these random features. This is like approximating a smooth step using a fixed random basis.

Under mean field: the weight distribution $\mu_t$ evolves so that neurons concentrate their kink locations $-b_j/w_j$ near the transition region of $f$ (with $b_j$ a bias absorbed into the input). The neurons discover that placing features near the steep region is useful. This is feature learning in action: the network adapts its features to the target function, rather than relying on random features.

Common Confusions

Watch Out

Mean field does not mean mean field approximation from physics

In statistical physics, "mean field approximation" means replacing interactions with their average --- an approximation that becomes exact in certain limits. In the neural network context, the mean field limit is not an approximation of a finite-width network. It is the exact limit as width goes to infinity under the $1/m$ scaling. The name comes from the same mathematical structure (propagation of chaos, interacting particle systems), but it is a theorem, not an approximation.

Watch Out

Mean field is not a replacement for NTK --- they describe different scaling regimes

NTK and mean field describe different limits of the same architecture with different parameterizations. Neither is "wrong." The question is which limit better describes the behavior of a practical network. For very wide networks with small learning rate, NTK is relevant. For networks that learn features (which includes most practical networks), the mean field perspective is more informative.

Watch Out

Mean field theory is currently limited to shallow networks

The cleanest mean field results are for two-layer networks. Deep mean field theory exists (e.g., through tensor programs) but is substantially more complex. The infinite-width limit for deep networks depends on the order in which layers are taken to infinity, and different orderings give different limits.

Summary

Mean field parameterization uses $1/m$ scaling; NTK uses $1/\sqrt{m}$
At infinite width, neurons become independent particles (propagation of chaos)
The weight distribution evolves by Wasserstein gradient flow (a PDE)
Mean field captures feature learning: weights move $O(1)$ , not $O(1/\sqrt{m})$
NTK freezes features (lazy regime); mean field learns features (rich regime)
The mean field limit is a PDE, not a kernel --- structurally different mathematical object
Currently best understood for two-layer networks; deep extensions are active research
Mean field is the right framework for understanding why neural networks outperform kernel methods

Exercises

ExerciseAdvanced

Problem

Consider a two-layer network $f_m(x) = \frac{1}{m}\sum_{j=1}^m a_j \sigma(w_j x)$ with scalar input $x \in \mathbb{R}$ . Under the mean field limit, write the network output as an integral over the weight distribution $\mu$ and compute the first variation $\frac{\delta \mathcal{L}}{\delta \mu}(\theta)$ for the squared loss $\mathcal{L}(\mu) = \frac{1}{2}(f(x_0; \mu) - y_0)^2$ at a single data point $(x_0, y_0)$ .

ExerciseAdvanced

Problem

Explain why the NTK parameterization ( $1/\sqrt{m}$ scaling) prevents feature learning in the infinite-width limit, while the mean field parameterization ( $1/m$ scaling) allows it. Consider the magnitude of the gradient update for a single neuron's weight $w_j$ in both cases.

ExerciseResearch

Problem

The mean field PDE is a Wasserstein gradient flow. Explain what it means for the loss functional $\mathcal{L}(\mu)$ to be displacement convex in the Wasserstein-2 metric, and why displacement convexity would guarantee global convergence of the mean field dynamics to the global minimum. Does displacement convexity hold for typical neural network loss functionals?

Related Comparisons

NTK Regime vs. Mean-Field Regime

References

Canonical:

Mei, Montanari, Nguyen, "A Mean Field View of the Landscape of Two-Layer Neural Networks" (PNAS 2018)
Chizat & Bach, "On the Global Convergence of Gradient Descent for Over-Parameterized Models using Optimal Transport" (NeurIPS 2018)

Current:

Rotskoff & Vanden-Eijnden, "Trainability and Accuracy of Artificial Neural Networks" (CPAM 2022)
Yang and Hu, "Tensor Programs" series (2020-2023) --- extends mean field ideas to deep networks via the feature learning limit

Next Topics

Mean field theory connects to:

Neural Tangent Kernel: the contrasting lazy regime that mean field theory goes beyond
Implicit bias and modern generalization: understanding what solutions mean field dynamics converge to

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Neural Tangent Kernel: Lazy Training, Kernel Equivalence, μP, and the Limits of Widthlayer 4 · tier 1
Information Geometrylayer 3 · tier 3

Derived topics

2

Lazy vs Feature Learninglayer 4 · tier 2
Mean-Field Gameslayer 4 · tier 3

Graph-backed continuations

Lazy vs Feature Learning Mean-Field Games