Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Comparison

Lazy (NTK) Regime vs. Feature Learning Regime

Neural networks can operate in two regimes: the lazy regime where weights barely move and the network behaves like a fixed kernel, or the feature learning regime where weights move substantially and learn task-specific representations.

What Each Regime Describes

Both regimes describe the training dynamics of neural networks, but they make opposite predictions about what happens to the learned representations.

Lazy regime (NTK): the network parameters θ\theta stay close to their initialization throughout training. The network output is well-approximated by its first-order Taylor expansion around θ0\theta_0. The resulting predictor is a kernel method with the neural tangent kernel K(x,x)=θf(x;θ0)θf(x;θ0)K(x, x') = \nabla_\theta f(x; \theta_0)^\top \nabla_\theta f(x'; \theta_0). Features are fixed at initialization.

Feature learning regime: the parameters θ\theta move far from initialization. The network learns data-dependent features that differ from the random features at initialization. Internal representations change during training to become task-relevant.

Side-by-Side Statement

Definition

Lazy Regime

A network operates in the lazy regime when the change in parameters θTθ0\|\theta_T - \theta_0\| remains small relative to the scale of θ0\theta_0 throughout training. The network function is approximately linear in the parameters:

f(x;θ)f(x;θ0)+θf(x;θ0)(θθ0)f(x; \theta) \approx f(x; \theta_0) + \nabla_\theta f(x; \theta_0)^\top (\theta - \theta_0)

Training reduces to kernel regression with a fixed kernel (the NTK).

Definition

Feature Learning Regime

A network operates in the feature learning regime when internal representations (hidden layer activations) change substantially during training. The effective kernel at the end of training differs from the initial NTK. The network discovers structure in the data that was not present in the random initialization.

Where Each Is Stronger

Lazy regime wins on theoretical tractability

The NTK theory provides exact convergence guarantees for gradient descent, generalization bounds via kernel theory, and a clean characterization of which functions the network can learn. The entire training trajectory is governed by a fixed kernel, making analysis possible.

Feature learning wins on practical performance

Deep learning's empirical success comes from feature learning, not kernel behavior. Features learned by deep networks on ImageNet, language modeling, and other tasks outperform any fixed kernel, including the NTK. The ability to learn hierarchical, task-specific representations is what makes deep learning different from kernel methods.

What Controls the Regime

Width

The NTK theory shows that as width \to \infty (with standard parameterization), the network enters the lazy regime. The kernel converges to a deterministic limit and stays fixed during training. At finite width, the kernel changes during training, enabling feature learning.

Learning rate

Small learning rates keep parameters close to initialization (lazy). Larger learning rates allow parameters to move farther, enabling feature learning. The critical learning rate scales inversely with width under standard parameterization.

Parameterization

The standard (NTK) parameterization scales the output by 1/width1/\sqrt{\text{width}}, which sends the network to the lazy regime as width grows. The mean-field (maximal update) parameterization (μ\muP) scales differently, preserving feature learning even at large width. The choice of parameterization determines whether infinite-width limits are lazy or feature-learning.

FactorLazy regimeFeature learning
WidthVery large (infinite limit)Finite, moderate
Learning rateSmall (scales as 1/width1/\text{width})O(1)O(1) or larger
ParameterizationStandard/NTKMean-field/μ\muP
Training durationShortLong enough for features to form

Where Each Fails

Lazy regime fails to explain deep learning performance

The NTK for standard architectures (fully connected, CNNs) is a relatively simple kernel that does not capture the empirical success of deep learning. NTK-regime networks perform comparably to classical kernel methods, not better. Any claim that "NTK explains why deep learning works" is misleading. The NTK regime is a tractable limit, not a faithful description of trained networks.

Feature learning theory is incomplete

While we observe that trained networks learn good features, the theory for why gradient descent discovers good features is much less developed than NTK theory. Results exist for specific architectures and data distributions (e.g., learning single-index models, sparse functions), but a general theory of feature learning remains open.

Key Assumptions That Differ

Lazy regimeFeature learning
Parameter movementθθ00\|\theta - \theta_0\| \to 0 as width \to \inftyθθ0=Θ(1)\|\theta - \theta_0\| = \Theta(1)
KernelFixed at initializationChanges during training
Effective modelKernel regressionNonlinear representation learning
Theory statusWell-understoodActive research
Practical relevanceLimitedHigh

When a Researcher Would Use Each

Example

Proving convergence guarantees for overparameterized networks

If you want to prove that gradient descent on a wide network converges to zero training loss, the lazy/NTK framework is the standard tool. The analysis reduces to showing that the minimum eigenvalue of the NTK Gram matrix is positive and the kernel does not change much during training.

Example

Designing transfer learning systems

If you want to understand why pretrained features transfer across tasks, you need the feature learning perspective. The lazy regime predicts that random features (at initialization) are as good as trained features, which contradicts the entire motivation for pretraining and fine-tuning.

Example

Hyperparameter transfer across model scales

The μ\muP framework uses the feature learning regime to derive learning rate and initialization schemes that transfer across widths. This is practically valuable: tune hyperparameters on a small model and transfer to a large one. This only works in the feature learning regime; in the lazy regime, optimal hyperparameters change with width.

Common Confusions

Watch Out

The NTK is not wrong, it is a specific limit

The NTK theory is mathematically correct. It describes the infinite-width, NTK-parameterized limit. The issue is that this limit does not capture what makes finite-width trained networks powerful. The NTK is a valid but limited model, not a general theory of deep learning.

Watch Out

Feature learning does not mean the NTK is useless

The NTK provides useful tools even outside the lazy regime. The initial NTK determines early-time training dynamics. NTK analysis gives necessary conditions for trainability. And the gap between NTK performance and actual network performance quantifies how much feature learning contributes.

Watch Out

Width alone does not determine the regime

A very wide network can still learn features if the parameterization and learning rate are chosen appropriately (μ\muP). Width pushes toward the lazy regime only under standard parameterization with correspondingly small learning rates.

What to Memorize

  1. Lazy = kernel: weights barely move, output is linear in parameters, fixed NTK governs training.
  2. Feature learning = representation learning: weights move, internal features adapt to the task, kernel changes.
  3. Width + standard param \to lazy. Width + μ\muP \to feature learning.
  4. Practical deep learning is feature learning, but lazy regime theory is more complete.