What Each Regime Describes
Both regimes describe the training dynamics of neural networks, but they make opposite predictions about what happens to the learned representations.
Lazy regime (NTK): the network parameters stay close to their initialization throughout training. The network output is well-approximated by its first-order Taylor expansion around . The resulting predictor is a kernel method with the neural tangent kernel . Features are fixed at initialization.
Feature learning regime: the parameters move far from initialization. The network learns data-dependent features that differ from the random features at initialization. Internal representations change during training to become task-relevant.
Side-by-Side Statement
Lazy Regime
A network operates in the lazy regime when the change in parameters remains small relative to the scale of throughout training. The network function is approximately linear in the parameters:
Training reduces to kernel regression with a fixed kernel (the NTK).
Feature Learning Regime
A network operates in the feature learning regime when internal representations (hidden layer activations) change substantially during training. The effective kernel at the end of training differs from the initial NTK. The network discovers structure in the data that was not present in the random initialization.
Where Each Is Stronger
Lazy regime wins on theoretical tractability
The NTK theory provides exact convergence guarantees for gradient descent, generalization bounds via kernel theory, and a clean characterization of which functions the network can learn. The entire training trajectory is governed by a fixed kernel, making analysis possible.
Feature learning wins on practical performance
Deep learning's empirical success comes from feature learning, not kernel behavior. Features learned by deep networks on ImageNet, language modeling, and other tasks outperform any fixed kernel, including the NTK. The ability to learn hierarchical, task-specific representations is what makes deep learning different from kernel methods.
What Controls the Regime
Width
The NTK theory shows that as width (with standard parameterization), the network enters the lazy regime. The kernel converges to a deterministic limit and stays fixed during training. At finite width, the kernel changes during training, enabling feature learning.
Learning rate
Small learning rates keep parameters close to initialization (lazy). Larger learning rates allow parameters to move farther, enabling feature learning. The critical learning rate scales inversely with width under standard parameterization.
Parameterization
The standard (NTK) parameterization scales the output by , which sends the network to the lazy regime as width grows. The mean-field (maximal update) parameterization (P) scales differently, preserving feature learning even at large width. The choice of parameterization determines whether infinite-width limits are lazy or feature-learning.
| Factor | Lazy regime | Feature learning |
|---|---|---|
| Width | Very large (infinite limit) | Finite, moderate |
| Learning rate | Small (scales as ) | or larger |
| Parameterization | Standard/NTK | Mean-field/P |
| Training duration | Short | Long enough for features to form |
Where Each Fails
Lazy regime fails to explain deep learning performance
The NTK for standard architectures (fully connected, CNNs) is a relatively simple kernel that does not capture the empirical success of deep learning. NTK-regime networks perform comparably to classical kernel methods, not better. Any claim that "NTK explains why deep learning works" is misleading. The NTK regime is a tractable limit, not a faithful description of trained networks.
Feature learning theory is incomplete
While we observe that trained networks learn good features, the theory for why gradient descent discovers good features is much less developed than NTK theory. Results exist for specific architectures and data distributions (e.g., learning single-index models, sparse functions), but a general theory of feature learning remains open.
Key Assumptions That Differ
| Lazy regime | Feature learning | |
|---|---|---|
| Parameter movement | as width | |
| Kernel | Fixed at initialization | Changes during training |
| Effective model | Kernel regression | Nonlinear representation learning |
| Theory status | Well-understood | Active research |
| Practical relevance | Limited | High |
When a Researcher Would Use Each
Proving convergence guarantees for overparameterized networks
If you want to prove that gradient descent on a wide network converges to zero training loss, the lazy/NTK framework is the standard tool. The analysis reduces to showing that the minimum eigenvalue of the NTK Gram matrix is positive and the kernel does not change much during training.
Designing transfer learning systems
If you want to understand why pretrained features transfer across tasks, you need the feature learning perspective. The lazy regime predicts that random features (at initialization) are as good as trained features, which contradicts the entire motivation for pretraining and fine-tuning.
Hyperparameter transfer across model scales
The P framework uses the feature learning regime to derive learning rate and initialization schemes that transfer across widths. This is practically valuable: tune hyperparameters on a small model and transfer to a large one. This only works in the feature learning regime; in the lazy regime, optimal hyperparameters change with width.
Common Confusions
The NTK is not wrong, it is a specific limit
The NTK theory is mathematically correct. It describes the infinite-width, NTK-parameterized limit. The issue is that this limit does not capture what makes finite-width trained networks powerful. The NTK is a valid but limited model, not a general theory of deep learning.
Feature learning does not mean the NTK is useless
The NTK provides useful tools even outside the lazy regime. The initial NTK determines early-time training dynamics. NTK analysis gives necessary conditions for trainability. And the gap between NTK performance and actual network performance quantifies how much feature learning contributes.
Width alone does not determine the regime
A very wide network can still learn features if the parameterization and learning rate are chosen appropriately (P). Width pushes toward the lazy regime only under standard parameterization with correspondingly small learning rates.
What to Memorize
- Lazy = kernel: weights barely move, output is linear in parameters, fixed NTK governs training.
- Feature learning = representation learning: weights move, internal features adapt to the task, kernel changes.
- Width + standard param lazy. Width + P feature learning.
- Practical deep learning is feature learning, but lazy regime theory is more complete.