Lazy vs Feature Learning

Sneiderman, Robby

Modern Generalization

Lazy vs Feature Learning

The fundamental dichotomy in neural network training: lazy regime (NTK, kernel-like, weights barely move) versus rich/feature learning regime (weights move substantially, representations emerge).

AdvancedTier 2CurrentSupporting~55 min

Prerequisites

Neural Tangent Kernel Mean Field Theory

Start 8-question practice · 2 available Prereq Map

Learning position

Read this page in the graph.

modern-generalization | layer 4 | tier 2. This page has 2 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Double Descent

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

When you train a neural network, one of two qualitatively different things can happen. Either the weights barely move from initialization and the network behaves like a fixed kernel method (the lazy regime), or the weights move substantially and the network learns new representations of the data (the feature learning regime).

This distinction is not academic. It determines whether deep learning generalizes strictly better than kernel methods, whether depth and architecture actually matter, or whether neural networks are just expensive kernel machines.

Mental Model

Think of a neural network at initialization as a starting point in weight space. Training moves the weights by gradient descent. In the lazy regime, the weights move so little that the network is well approximated by its first-order Taylor expansion around initialization. In the feature learning regime, the weights move far enough that the Taylor approximation breaks down and the network genuinely reorganizes its internal representations.

The question "which regime am I in?" is controlled by two knobs: network width and learning rate (relative to width).

The Lazy Regime

Definition

Lazy Training / NTK Regime

A neural network is in the lazy regime if, throughout training, the parameters $\theta_t$ remain close to their initialization $\theta_0$ in the sense that the function computed by the network is well approximated by its linearization:

$f(\theta_t, x) \approx f(\theta_0, x) + \nabla_\theta f(\theta_0, x)^\top (\theta_t - \theta_0)$

In this regime, the network behaves as a kernel method with the Neural Tangent Kernel $K(x, x') = \nabla_\theta f(\theta_0, x)^\top \nabla_\theta f(\theta_0, x')$ .

In the lazy regime, all parameters can still move, but the network function is well-approximated by its first-order linearization around initialization $\theta_0$ . The tangent features $\nabla_\theta f(\theta_0, x)$ stay nearly fixed throughout training, so the function evolves like kernel gradient descent under the NTK $K(x, x') = \nabla_\theta f(\theta_0, x)^\top \nabla_\theta f(\theta_0, x')$ . This is not the same as freezing all hidden layers and training only the output linear head (random-feature / last-layer-only training): in the lazy regime every layer's weights drift, but each one drifts so little (relative to its initial scale) that the effective features — the tangent features — are unchanged to leading order. Random-feature training and lazy/NTK training coincide as predictors only in special parameterizations; they are conceptually distinct.

When Does the Lazy Regime Arise?

The key result: as width $m \to \infty$ with standard parameterization and learning rate $\eta = O(1)$ , the network enters the lazy regime.

Theorem

Lazy Regime via Large Width

Statement

Consider a fully-connected network of width $m$ with standard parameterization (weights initialized as $W \sim N(0, 1/m)$ , output scaled by $1/\sqrt{m}$ ). For learning rate $\eta = O(1)$ , as $m \to \infty$ , the relative change in each weight during training satisfies:

$\frac{\|\theta_t - \theta_0\|}{\|\theta_0\|} = O(1/\sqrt{m})$

The training dynamics converge to those of kernel gradient descent with the (fixed) NTK at initialization.

Intuition

With standard parameterization, the gradient signal per parameter is $O(1/\sqrt{m})$ . Over $T$ steps of gradient descent with learning rate $\eta = O(1)$ , each weight moves by $O(T/\sqrt{m})$ . Since there are $m$ weights in each layer, the total parameter norm is $O(\sqrt{m})$ , so the relative change is $O(T/m) \to 0$ .

Proof Sketch

Track $\|\theta_t - \theta_0\|^2$ by summing the squared gradient updates. Each gradient has norm $O(1/\sqrt{m})$ per parameter due to the $1/\sqrt{m}$ scaling. Over $T$ steps, the total displacement per parameter is $O(T/\sqrt{m})$ . Show that this displacement is small enough that the Hessian remainder in the Taylor expansion is negligible, so the linearization holds.

Why It Matters

This theorem explains why the NTK theory accurately describes training of very wide networks with standard learning rates. It also reveals the limitation: the NTK regime is a degenerate limit where the network does not learn features.

Failure Mode

The bound requires standard parameterization. With maximal update parameterization (muP), the scaling is chosen so that feature learning persists even at infinite width. The lazy regime is parameterization-dependent, not an intrinsic property of wide networks.

report a correction →

The Feature Learning Regime

Definition

Feature Learning / Rich Regime

A neural network is in the feature learning regime if, during training, the internal representations (hidden layer activations) change substantially from their initial values. The network learns data-dependent features rather than relying on random initialization features.

Feature learning is what separates deep networks from kernel methods. The network discovers useful intermediate representations — edge detectors in vision, syntactic structures in language — that are not present at initialization.

What Controls the Regime?

The two main controls are parameterization and learning rate:

Setting	Parameterization	Learning Rate	Regime
Standard param, moderate LR	Standard ( $1/\sqrt{m}$ scaling)	$\eta = O(1)$	Lazy
Standard param, large LR	Standard ( $1/\sqrt{m}$ scaling)	$\eta = O(m)$	Feature learning
muP	Maximal update ( $1/m$ scaling for certain layers)	$\eta = O(1)$	Feature learning
Mean field	$1/m$ scaling	$\eta = O(1)$	Feature learning

The key insight: it is the ratio of learning rate to width that matters, not either quantity alone.

Why the Lazy Regime Is Limited

Proposition

Feature Learning Separation

Statement

There exist data distributions where:

Any kernel method using the NTK at initialization requires $\Omega(d^k)$ samples for some $k \geq 1$ , where $d$ is the input dimension.
A neural network in the feature learning regime achieves low error with $O(d)$ samples.

The gap can be exponential in the relevant dimension.

Intuition

The NTK is fixed at initialization and reflects random features. If the target function depends on a low-dimensional structure in the data (e.g., a sparse subset of coordinates), the random NTK kernel wastes capacity on all directions equally. A feature-learning network can discover and focus on the relevant directions, achieving a sample complexity that depends on the intrinsic dimension rather than the ambient dimension.

Proof Sketch

Construct a distribution where the label depends on a single direction $v \in \mathbb{R}^d$ via $y = \sigma(v^\top x)$ . Show that the NTK inner product cannot distinguish $v$ from other directions without $\Omega(d)$ samples per direction. A feature-learning network, by contrast, can learn $v$ directly by gradient descent on the first layer, needing only $O(d)$ total samples.

Why It Matters

This is the theoretical justification for why deep learning is not just kernel regression. Feature learning gives neural networks a qualitative advantage on structured problems. which is most real problems.

Failure Mode

The separation requires the data to have exploitable structure. On unstructured (e.g., purely random) problems, kernel methods and feature-learning networks perform comparably.

report a correction →

Connection to Mean Field Theory

The mean field perspective provides an alternative infinite-width limit where feature learning is preserved. Instead of tracking individual neurons, you track the distribution of neurons. In this limit:

The width $m \to \infty$ , but the dynamics are described by a distributional PDE rather than a fixed kernel.
Neurons can move to new locations in activation space, representing genuine feature learning.
The optimization landscape is convex in the space of measures (under certain conditions).

This is the theoretical counterpart to the lazy/NTK limit. The lazy limit is the "linearized" theory; the mean field limit is the "nonlinear" theory.

Common Confusions

Watch Out

Width alone does not determine the regime

A common misconception is that wider networks are always in the lazy regime. This is only true under standard parameterization with $O(1)$ learning rate. With muP or mean-field scaling, infinitely wide networks can still learn features. The regime depends on the parameterization-width-learning rate triple, not width alone.

Watch Out

Lazy regime is not the same as linear models

In the lazy regime, the network is linearized around initialization, but the features at initialization are still nonlinear random features. A lazy network is a kernel method with a specific (NTK) kernel, not a simple linear model. It can still fit complex functions. It just cannot adapt its kernel to the data.

Watch Out

Feature learning does not mean the NTK theory is useless

The NTK theory provides exact predictions for training dynamics, convergence rates, and generalization in the lazy regime. Even for feature-learning networks, the NTK at initialization often provides a useful lower bound on performance. The NTK theory fails only when you ask: does the network do better than the kernel prediction?

Summary

The lazy regime: weights barely move, network is a kernel machine (NTK)
The feature learning regime: weights move substantially, network learns data-dependent representations
Standard parameterization + large width + $O(1)$ learning rate = lazy regime
muP or mean-field parameterization preserves feature learning at large width
Feature learning can be exponentially more sample-efficient than kernel methods on structured problems
The central question: real networks learn features, but the theory of feature learning is much harder than NTK theory

Exercises

ExerciseCore

Problem

Explain in your own words why increasing width with standard parameterization pushes a network into the lazy regime. What happens to the per-parameter gradient magnitude as $m$ grows?

ExerciseAdvanced

Problem

Consider a single hidden layer network $f(x) = \frac{1}{\sqrt{m}} \sum_{j=1}^m a_j \sigma(w_j^\top x)$ with $a_j \in \{+1, -1\}$ fixed and $w_j$ trained. In the lazy regime, what kernel does this correspond to? Write it explicitly.

ExerciseResearch

Problem

The muP (maximal update parameterization) is designed so that feature learning persists at infinite width. What is the key difference in how the learning rate scales with width in muP versus standard parameterization? Why does this allow feature learning?

Related Comparisons

References

Canonical:

Jacot, Gabriel, Hongler, "Neural Tangent Kernel: Convergence and Generalization in Neural Networks" (NeurIPS 2018)
Chizat, Oyallon, Bach, "On Lazy Training in Differentiable Programming" (NeurIPS 2019)

Current:

Yang & Hu, "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer" (muP, 2022)
Ba, Erdogdu, Suzuki et al., various works on feature learning in two-layer networks (2022-2024)

Next Topics

The natural next steps from lazy vs feature learning:

Double descent: how the interpolation threshold interacts with the lazy/rich transition
Implicit bias of gradient descent: what solutions does GD find in the feature learning regime?
Neural scaling laws: how do scaling behaviors differ between regimes?

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Neural Tangent Kernel: Lazy Training, Kernel Equivalence, μP, and the Limits of Widthlayer 4 · tier 1
Mean Field Theorylayer 4 · tier 2

Derived topics

2

Scaling Lawslayer 4 · tier 1
Double Descentlayer 4 · tier 2

Graph-backed continuations

Double Descent Scaling Laws