Kernel Methods vs. Feature Learning. Fixed Features vs. Learned Representations

What Each Does

Both kernel methods and feature learning solve supervised learning problems, but they differ in what is fixed and what is learned.

Kernel methods fix a feature map $\phi: \mathcal{X} \to \mathcal{F}$ (implicitly, via a kernel $k(x, x') = \langle \phi(x), \phi(x') \rangle$ ) and learn only the linear weights in that feature space.

Feature learning learns the feature map itself. A neural network trained in the "rich" or "feature learning" regime jointly learns both the representation and the linear head.

Side-by-Side Statement

Definition

Kernel Prediction

A kernel method predicts via:

$f(x) = \sum_{i=1}^n \alpha_i k(x, x_i)$

The kernel $k$ is chosen before seeing data. Only the coefficients $\alpha_i$ are learned from data. The representer theorem guarantees this form is optimal over the RKHS.

Definition

Feature Learning Prediction

A neural network in the feature learning regime computes:

$f(x) = w^\top \phi_\theta(x)$

where $\phi_\theta$ is a learned feature map parameterized by $\theta$ . Both $w$ and $\theta$ are updated during training. The representation $\phi_\theta$ changes to align with the task.

Where Each Is Stronger

Kernel methods win on theory and small data

Kernel methods come with strong theoretical guarantees. The RKHS norm provides a natural complexity measure. Generalization bounds depend on the RKHS norm of the learned function, not on the number of parameters. For small datasets with well-chosen kernels, kernel methods can match or beat neural networks.

Kernel methods are also convex: the optimization problem has a unique global minimum. There is no concern about local minima, saddle points, or sensitivity to initialization.

Feature learning wins on representation quality

The central limitation of kernel methods is that the feature map is fixed before seeing data. If the kernel does not capture task-relevant structure, no amount of data will help. A Gaussian RBF kernel treats all directions in input space equally. It cannot discover that only a low-dimensional subspace matters.

Feature learning adapts the representation to the task. A neural network trained on images learns edge detectors, then texture detectors, then object-part detectors. These features transfer across tasks. No fixed kernel achieves this.

The NTK Connection

Theorem

Neural Tangent Kernel Regime

Statement

In the infinite-width limit with standard parameterization, the neural tangent kernel $\Theta(x, x') = \langle \nabla_\theta f_\theta(x), \nabla_\theta f_\theta(x') \rangle$ converges to a deterministic kernel $\Theta^*$ at initialization and remains constant during training. The network dynamics become equivalent to kernel regression with kernel $\Theta^*$ .

Intuition

When the network is very wide, each parameter moves very little during training. The feature map $\nabla_\theta f_\theta$ barely changes from its random initialization. The network is effectively doing kernel regression with a fixed (random) feature map. This is the "lazy" regime: the parameters are lazy, they do not move far enough to learn new features.

Failure Mode

Finite-width networks deviate from the NTK regime. With large learning rates or small width, the feature map changes substantially during training. This is precisely the regime where neural networks outperform kernel methods, because they learn task-adapted features.

report a correction →

When Kernel Methods Suffice

Kernel methods are the right tool when:

The data has known structure that a standard kernel captures. For example, string kernels for protein sequences, graph kernels for molecular data.
The sample size is small (hundreds to low thousands). Kernel methods have stronger finite-sample guarantees and no optimization difficulties.
You need exact uncertainty quantification. Gaussian processes (which are kernel methods) provide calibrated posterior distributions.
Interpretability of the solution matters. The representer theorem makes the solution a weighted sum over training points.

When You Need Feature Learning

Feature learning is necessary when:

The relevant features are unknown. Images, audio, and text have complex hierarchical structure that no hand-designed kernel captures well.
Transfer learning matters. Learned features transfer to new tasks; kernel features do not adapt.
The dataset is large. Kernel methods scale as $O(n^3)$ in time and $O(n^2)$ in memory for $n$ training points. Neural networks scale much better with data.
The task requires compositional features. Deep networks compose simple features into complex ones. Kernel methods are shallow (even "deep kernels" do not compose features in the same way).

Key Assumptions That Differ

	Kernel Methods	Feature Learning
Feature map	Fixed (chosen a priori)	Learned from data
Optimization	Convex	Non-convex
Scalability	$O(n^2)$ memory, $O(n^3)$ time	Scales with architecture, not $n$
Theory	Tight RKHS bounds	Looser, still developing
Transfer	Kernel must be redesigned per task	Features transfer across tasks

Empirical Evidence

On standard vision benchmarks, the NTK of a convolutional network underperforms the same network trained with feature learning by 10-20% accuracy. This gap grows with dataset size and task complexity. The NTK captures the "easy" structure (low-frequency components) but misses the hierarchical, task-specific features that make deep learning work.

On tabular data with small samples, kernel methods (especially Gaussian processes) remain competitive with or superior to neural networks.

Common Confusions

Watch Out

NTK does not mean neural networks are kernel methods

The NTK describes a limiting regime where neural networks behave like kernel methods. Real neural networks, trained with practical learning rates and finite width, operate in a different regime where features change during training. The NTK is a useful theoretical tool, not a description of how practical networks work.

Watch Out

Deep kernels are not feature learning

Composing kernels (e.g., the arc-cosine kernel that mimics ReLU networks) produces a fixed kernel, not a learned representation. The resulting "deep kernel" is still a kernel method with all its limitations: the feature map does not adapt to the task.

References

Canonical:

Scholkopf & Smola, Learning with Kernels (2002), Chapters 1-2
Jacot, Gabriel, Hongler, "Neural Tangent Kernel" (NeurIPS 2018)

Current:

Yang & Hu, "Feature Learning in Infinite-Width Neural Networks" (ICML 2021)
Chizat & Bach, "On the Global Convergence of Gradient Descent for Over-parameterized Models" (NeurIPS 2018)