Modern Generalization
Lazy vs Feature Learning
The fundamental dichotomy in neural network training: lazy regime (NTK, kernel-like, weights barely move) versus rich/feature learning regime (weights move substantially, representations emerge).
Prerequisites
Why This Matters
When you train a neural network, one of two qualitatively different things can happen. Either the weights barely move from initialization and the network behaves like a fixed kernel method (the lazy regime), or the weights move substantially and the network learns new representations of the data (the feature learning regime).
This distinction is not academic. It determines whether deep learning is More powerful than kernel methods. whether depth and architecture actually matter, or whether neural networks are just expensive kernel machines.
Mental Model
Think of a neural network at initialization as a starting point in weight space. Training moves the weights by gradient descent. In the lazy regime, the weights move so little that the network is well approximated by its first-order Taylor expansion around initialization. In the feature learning regime, the weights move far enough that the Taylor approximation breaks down and the network genuinely reorganizes its internal representations.
The question "which regime am I in?" is controlled by two knobs: network width and learning rate (relative to width).
The Lazy Regime
Lazy Training / NTK Regime
A neural network is in the lazy regime if, throughout training, the parameters remain close to their initialization in the sense that the function computed by the network is well approximated by its linearization:
In this regime, the network behaves as a kernel method with the Neural Tangent Kernel .
In the lazy regime, the features (the kernel) are fixed at initialization. The network only learns the output-layer linear combination. This is exactly kernel regression with the NTK.
When Does the Lazy Regime Arise?
The key result: as width with standard parameterization and learning rate , the network enters the lazy regime.
Lazy Regime via Large Width
Statement
Consider a fully-connected network of width with standard parameterization (weights initialized as , output scaled by ). For learning rate , as , the relative change in each weight during training satisfies:
The training dynamics converge to those of kernel gradient descent with the (fixed) NTK at initialization.
Intuition
With standard parameterization, the gradient signal per parameter is . Over steps of gradient descent with learning rate , each weight moves by . Since there are weights in each layer, the total parameter norm is , so the relative change is .
Proof Sketch
Track by summing the squared gradient updates. Each gradient has norm per parameter due to the scaling. Over steps, the total displacement per parameter is . Show that this displacement is small enough that the Hessian remainder in the Taylor expansion is negligible, so the linearization holds.
Why It Matters
This theorem explains why the NTK theory accurately describes training of very wide networks with standard learning rates. It also reveals the limitation: the NTK regime is a degenerate limit where the network does not learn features.
Failure Mode
The bound requires standard parameterization. With maximal update parameterization (muP), the scaling is chosen so that feature learning persists even at infinite width. The lazy regime is parameterization-dependent, not an intrinsic property of wide networks.
The Feature Learning Regime
Feature Learning / Rich Regime
A neural network is in the feature learning regime if, during training, the internal representations (hidden layer activations) change substantially from their initial values. The network learns data-dependent features rather than relying on random initialization features.
Feature learning is what makes deep learning powerful. The network discovers useful intermediate representations. edge detectors in vision, syntactic structures in language. that are not present at initialization.
What Controls the Regime?
The two main controls are parameterization and learning rate:
| Setting | Parameterization | Learning Rate | Regime |
|---|---|---|---|
| Standard param, moderate LR | Standard ( scaling) | Lazy | |
| Standard param, large LR | Standard ( scaling) | Feature learning | |
| muP | Maximal update ( scaling for certain layers) | Feature learning | |
| Mean field | scaling | Feature learning |
The key insight: it is the ratio of learning rate to width that matters, not either quantity alone.
Why the Lazy Regime Is Limited
Feature Learning Separation
Statement
There exist data distributions where:
-
Any kernel method using the NTK at initialization requires samples for some , where is the input dimension.
-
A neural network in the feature learning regime achieves low error with samples.
The gap can be exponential in the relevant dimension.
Intuition
The NTK is fixed at initialization and reflects random features. If the target function depends on a low-dimensional structure in the data (e.g., a sparse subset of coordinates), the random NTK kernel wastes capacity on all directions equally. A feature-learning network can discover and focus on the relevant directions, achieving a sample complexity that depends on the intrinsic dimension rather than the ambient dimension.
Proof Sketch
Construct a distribution where the label depends on a single direction via . Show that the NTK inner product cannot distinguish from other directions without samples per direction. A feature-learning network, by contrast, can learn directly by gradient descent on the first layer, needing only total samples.
Why It Matters
This is the theoretical justification for why deep learning is not just kernel regression. Feature learning gives neural networks a qualitative advantage on structured problems. which is most real problems.
Failure Mode
The separation requires the data to have exploitable structure. On unstructured (e.g., purely random) problems, kernel methods and feature-learning networks perform comparably.
Connection to Mean Field Theory
The mean field perspective provides an alternative infinite-width limit where feature learning is preserved. Instead of tracking individual neurons, you track the distribution of neurons. In this limit:
- The width , but the dynamics are described by a distributional PDE rather than a fixed kernel.
- Neurons can move to new locations in activation space, representing genuine feature learning.
- The optimization landscape is convex in the space of measures (under certain conditions).
This is the theoretical counterpart to the lazy/NTK limit. The lazy limit is the "linearized" theory; the mean field limit is the "nonlinear" theory.
Common Confusions
Width alone does not determine the regime
A common misconception is that wider networks are always in the lazy regime. This is only true under standard parameterization with learning rate. With muP or mean-field scaling, infinitely wide networks can still learn features. The regime depends on the parameterization-width-learning rate triple, not width alone.
Lazy regime is not the same as linear models
In the lazy regime, the network is linearized around initialization, but the features at initialization are still nonlinear random features. A lazy network is a kernel method with a specific (NTK) kernel, not a simple linear model. It can still fit complex functions. It just cannot adapt its kernel to the data.
Feature learning does not mean the NTK theory is useless
The NTK theory provides exact predictions for training dynamics, convergence rates, and generalization in the lazy regime. Even for feature-learning networks, the NTK at initialization often provides a useful lower bound on performance. The NTK theory fails only when you ask: does the network do better than the kernel prediction?
Summary
- The lazy regime: weights barely move, network is a kernel machine (NTK)
- The feature learning regime: weights move substantially, network learns data-dependent representations
- Standard parameterization + large width + learning rate = lazy regime
- muP or mean-field parameterization preserves feature learning at large width
- Feature learning can be exponentially more sample-efficient than kernel methods on structured problems
- The central question: real networks learn features, but the theory of feature learning is much harder than NTK theory
Exercises
Problem
Explain in your own words why increasing width with standard parameterization pushes a network into the lazy regime. What happens to the per-parameter gradient magnitude as grows?
Problem
Consider a single hidden layer network with fixed and trained. In the lazy regime, what kernel does this correspond to? Write it explicitly.
Problem
The muP (maximal update parameterization) is designed so that feature learning persists at infinite width. What is the key difference in how the learning rate scales with width in muP versus standard parameterization? Why does this allow feature learning?
Related Comparisons
References
Canonical:
- Jacot, Gabriel, Hongler, "Neural Tangent Kernel: Convergence and Generalization in Neural Networks" (NeurIPS 2018)
- Chizat, Oyallon, Bach, "On Lazy Training in Differentiable Programming" (NeurIPS 2019)
Current:
- Yang & Hu, "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer" (muP, 2022)
- Ba, Erdogdu, Suzuki et al., various works on feature learning in two-layer networks (2022-2024)
Next Topics
The natural next steps from lazy vs feature learning:
- Double descent: how the interpolation threshold interacts with the lazy/rich transition
- Implicit bias of gradient descent: what solutions does GD find in the feature learning regime?
- Neural scaling laws: how do scaling behaviors differ between regimes?
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Neural Tangent KernelLayer 4
- Kernels and Reproducing Kernel Hilbert SpacesLayer 3
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Rademacher ComplexityLayer 3
- Empirical Risk MinimizationLayer 2
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- VC DimensionLayer 2
- Implicit Bias and Modern GeneralizationLayer 4
- Gradient Descent VariantsLayer 1
- Linear RegressionLayer 1
- Maximum Likelihood EstimationLayer 0B
- Mean Field TheoryLayer 4