What Each Does
Both kernel methods and feature learning solve supervised learning problems, but they differ in what is fixed and what is learned.
Kernel methods fix a feature map (implicitly, via a kernel ) and learn only the linear weights in that feature space.
Feature learning learns the feature map itself. A neural network trained in the "rich" or "feature learning" regime jointly learns both the representation and the linear head.
Side-by-Side Statement
Kernel Prediction
A kernel method predicts via:
The kernel is chosen before seeing data. Only the coefficients are learned from data. The representer theorem guarantees this form is optimal over the RKHS.
Feature Learning Prediction
A neural network in the feature learning regime computes:
where is a learned feature map parameterized by . Both and are updated during training. The representation changes to align with the task.
Where Each Is Stronger
Kernel methods win on theory and small data
Kernel methods come with strong theoretical guarantees. The RKHS norm provides a natural complexity measure. Generalization bounds depend on the RKHS norm of the learned function, not on the number of parameters. For small datasets with well-chosen kernels, kernel methods can match or beat neural networks.
Kernel methods are also convex: the optimization problem has a unique global minimum. There is no concern about local minima, saddle points, or sensitivity to initialization.
Feature learning wins on representation quality
The central limitation of kernel methods is that the feature map is fixed before seeing data. If the kernel does not capture task-relevant structure, no amount of data will help. A Gaussian RBF kernel treats all directions in input space equally. It cannot discover that only a low-dimensional subspace matters.
Feature learning adapts the representation to the task. A neural network trained on images learns edge detectors, then texture detectors, then object-part detectors. These features transfer across tasks. No fixed kernel achieves this.
The NTK Connection
Neural Tangent Kernel Regime
Statement
In the infinite-width limit with standard parameterization, the neural tangent kernel converges to a deterministic kernel at initialization and remains constant during training. The network dynamics become equivalent to kernel regression with kernel .
Intuition
When the network is very wide, each parameter moves very little during training. The feature map barely changes from its random initialization. The network is effectively doing kernel regression with a fixed (random) feature map. This is the "lazy" regime: the parameters are lazy, they do not move far enough to learn new features.
Failure Mode
Finite-width networks deviate from the NTK regime. With large learning rates or small width, the feature map changes substantially during training. This is precisely the regime where neural networks outperform kernel methods, because they learn task-adapted features.
When Kernel Methods Suffice
Kernel methods are the right tool when:
- The data has known structure that a standard kernel captures. For example, string kernels for protein sequences, graph kernels for molecular data.
- The sample size is small (hundreds to low thousands). Kernel methods have stronger finite-sample guarantees and no optimization difficulties.
- You need exact uncertainty quantification. Gaussian processes (which are kernel methods) provide calibrated posterior distributions.
- Interpretability of the solution matters. The representer theorem makes the solution a weighted sum over training points.
When You Need Feature Learning
Feature learning is necessary when:
- The relevant features are unknown. Images, audio, and text have complex hierarchical structure that no hand-designed kernel captures well.
- Transfer learning matters. Learned features transfer to new tasks; kernel features do not adapt.
- The dataset is large. Kernel methods scale as in time and in memory for training points. Neural networks scale much better with data.
- The task requires compositional features. Deep networks compose simple features into complex ones. Kernel methods are shallow (even "deep kernels" do not compose features in the same way).
Key Assumptions That Differ
| Kernel Methods | Feature Learning | |
|---|---|---|
| Feature map | Fixed (chosen a priori) | Learned from data |
| Optimization | Convex | Non-convex |
| Scalability | memory, time | Scales with architecture, not |
| Theory | Tight RKHS bounds | Looser, still developing |
| Transfer | Kernel must be redesigned per task | Features transfer across tasks |
Empirical Evidence
On standard vision benchmarks, the NTK of a convolutional network underperforms the same network trained with feature learning by 10-20% accuracy. This gap grows with dataset size and task complexity. The NTK captures the "easy" structure (low-frequency components) but misses the hierarchical, task-specific features that make deep learning work.
On tabular data with small samples, kernel methods (especially Gaussian processes) remain competitive with or superior to neural networks.
Common Confusions
NTK does not mean neural networks are kernel methods
The NTK describes a limiting regime where neural networks behave like kernel methods. Real neural networks, trained with practical learning rates and finite width, operate in a different regime where features change during training. The NTK is a useful theoretical tool, not a description of how practical networks work.
Deep kernels are not feature learning
Composing kernels (e.g., the arc-cosine kernel that mimics ReLU networks) produces a fixed kernel, not a learned representation. The resulting "deep kernel" is still a kernel method with all its limitations: the feature map does not adapt to the task.
References
Canonical:
- Scholkopf & Smola, Learning with Kernels (2002), Chapters 1-2
- Jacot, Gabriel, Hongler, "Neural Tangent Kernel" (NeurIPS 2018)
Current:
- Yang & Hu, "Feature Learning in Infinite-Width Neural Networks" (ICML 2021)
- Chizat & Bach, "On the Global Convergence of Gradient Descent for Over-parameterized Models" (NeurIPS 2018)