Lazy (NTK) Regime vs. Feature Learning in Neural Networks

What Each Regime Describes

Both regimes describe the training dynamics of neural networks, but they make opposite predictions about what happens to the learned representations.

Lazy regime (NTK): the network parameters $\theta$ stay close to their initialization throughout training. The network output is well-approximated by its first-order Taylor expansion around $\theta_0$ . The resulting predictor is a kernel method with the neural tangent kernel $K(x, x') = \nabla_\theta f(x; \theta_0)^\top \nabla_\theta f(x'; \theta_0)$ . Features are fixed at initialization.

Feature learning regime: the parameters $\theta$ move far from initialization. The network learns data-dependent features that differ from the random features at initialization. Internal representations change during training to become task-relevant.

Side-by-Side Statement

Definition

Lazy Regime

A network operates in the lazy regime when the change in parameters $\|\theta_T - \theta_0\|$ remains small relative to the scale of $\theta_0$ throughout training. The network function is approximately linear in the parameters:

$f(x; \theta) \approx f(x; \theta_0) + \nabla_\theta f(x; \theta_0)^\top (\theta - \theta_0)$

Training reduces to kernel regression with a fixed kernel (the NTK).

Definition

Feature Learning Regime

A network operates in the feature learning regime when internal representations (hidden layer activations) change substantially during training. The effective kernel at the end of training differs from the initial NTK. The network discovers structure in the data that was not present in the random initialization.

Where Each Is Stronger

Lazy regime wins on theoretical tractability

The NTK theory provides exact convergence guarantees for gradient descent, generalization bounds via kernel theory, and a clean characterization of which functions the network can learn. The entire training trajectory is governed by a fixed kernel, making analysis possible.

Feature learning wins on practical performance

Deep learning's empirical success comes from feature learning, not kernel behavior. Features learned by deep networks on ImageNet, language modeling, and other tasks outperform any fixed kernel, including the NTK. The ability to learn hierarchical, task-specific representations is what makes deep learning different from kernel methods.

What Controls the Regime

Width

The NTK theory shows that as width $\to \infty$ (with standard parameterization), the network enters the lazy regime. The kernel converges to a deterministic limit and stays fixed during training. At finite width, the kernel changes during training, enabling feature learning.

Learning rate

Small learning rates keep parameters close to initialization (lazy). Larger learning rates allow parameters to move farther, enabling feature learning. The critical learning rate scales inversely with width under standard parameterization.

Parameterization

The standard (NTK) parameterization scales the output by $1/\sqrt{\text{width}}$ , which sends the network to the lazy regime as width grows. The mean-field (maximal update) parameterization ( $\mu$ P) scales differently, preserving feature learning even at large width. The choice of parameterization determines whether infinite-width limits are lazy or feature-learning.

Factor	Lazy regime	Feature learning
Width	Very large (infinite limit)	Finite, moderate
Learning rate	Small (scales as $1/\text{width}$ )	$O(1)$ or larger
Parameterization	Standard/NTK	Mean-field/ $\mu$ P
Training duration	Short	Long enough for features to form

Where Each Fails

Lazy regime fails to explain deep learning performance

The NTK for standard architectures (fully connected, CNNs) is a relatively simple kernel that does not capture the empirical success of deep learning. NTK-regime networks perform comparably to classical kernel methods, not better. Any claim that "NTK explains why deep learning works" is misleading. The NTK regime is a tractable limit, not a faithful description of trained networks.

Feature learning theory is incomplete

While we observe that trained networks learn good features, the theory for why gradient descent discovers good features is much less developed than NTK theory. Results exist for specific architectures and data distributions (e.g., learning single-index models, sparse functions), but a general theory of feature learning remains open.

Key Assumptions That Differ

	Lazy regime	Feature learning
Parameter movement	$\\|\theta - \theta_0\\| \to 0$ as width $\to \infty$	$\\|\theta - \theta_0\\| = \Theta(1)$
Kernel	Fixed at initialization	Changes during training
Effective model	Kernel regression	Nonlinear representation learning
Theory status	Well-understood	Active research
Practical relevance	Limited	High

When a Researcher Would Use Each

Example

Proving convergence guarantees for overparameterized networks

If you want to prove that gradient descent on a wide network converges to zero training loss, the lazy/NTK framework is the standard tool. The analysis reduces to showing that the minimum eigenvalue of the NTK Gram matrix is positive and the kernel does not change much during training.

Example

Designing transfer learning systems

If you want to understand why pretrained features transfer across tasks, you need the feature learning perspective. The lazy regime predicts that random features (at initialization) are as good as trained features, which contradicts the entire motivation for pretraining and fine-tuning.

Example

Hyperparameter transfer across model scales

The $\mu$ P framework uses the feature learning regime to derive learning rate and initialization schemes that transfer across widths. This is practically valuable: tune hyperparameters on a small model and transfer to a large one. This only works in the feature learning regime; in the lazy regime, optimal hyperparameters change with width.

Common Confusions

Watch Out

The NTK is not wrong, it is a specific limit

The NTK theory is mathematically correct. It describes the infinite-width, NTK-parameterized limit. The issue is that this limit does not capture what makes finite-width trained networks powerful. The NTK is a valid but limited model, not a general theory of deep learning.

Watch Out

Feature learning does not mean the NTK is useless

The NTK provides useful tools even outside the lazy regime. The initial NTK determines early-time training dynamics. NTK analysis gives necessary conditions for trainability. And the gap between NTK performance and actual network performance quantifies how much feature learning contributes.

Watch Out

Width alone does not determine the regime

A very wide network can still learn features if the parameterization and learning rate are chosen appropriately ( $\mu$ P). Width pushes toward the lazy regime only under standard parameterization with correspondingly small learning rates.

What to Memorize

Lazy = kernel: weights barely move, output is linear in parameters, fixed NTK governs training.
Feature learning = representation learning: weights move, internal features adapt to the task, kernel changes.
Width + standard param $\to$ lazy. Width + $\mu$ P $\to$ feature learning.
Practical deep learning is feature learning, but lazy regime theory is more complete.