Kernel Methods for Molecules

Sneiderman, Robby

Applied ML

Kernel Methods for Molecules

Tanimoto kernels on Morgan fingerprints, Coulomb-matrix and SOAP descriptors for materials, FCHL atomic kernels, and GP regression on SMILES with string kernels as a calibrated baseline against GNNs.

AdvancedTier 3CurrentReference~15 min

Prerequisites

Kernels and Rkhs Gaussian Processes for ML

Prereq Map

Learning position

Read this page in the graph.

applied-ml | layer 4 | tier 3. This page has 2 direct prerequisites and 1 published dependent.

Open Atlas Prerequisites Leads to

What next

Gaussian Process Regression

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Before message-passing neural networks dominated molecular property benchmarks, the standard pipeline was a fixed descriptor plus a kernel machine. That pipeline did not go away. Kernel ridge regression on FCHL or SOAP still beats GNNs in the small-data regime that matters most in chemistry: a few hundred to a few thousand DFT-labeled structures of new chemistry, where neural models overfit and miscalibrate.

A positive-definite molecular kernel encodes a similarity prior good enough to work without training. A Gaussian process on top returns calibrated uncertainty, which matters when an active-learning loop picks the next DFT calculation or compound to synthesize.

Core Ideas

The most-used molecular kernel is Tanimoto on circular fingerprints. ECFP (Rogers-Hahn 2010, J. Chem. Inf. Model. 50) and the equivalent Morgan fingerprint encode each atom's neighborhood up to radius $r$ (typically 2) as a hashed bit vector $\mathbf{f}(x) \in \{0,1\}^d$ with $d = 1024$ or $2048$ . The Tanimoto similarity is

$T(x, x') = \frac{\langle \mathbf{f}(x), \mathbf{f}(x') \rangle}{\|\mathbf{f}(x)\|^2 + \|\mathbf{f}(x')\|^2 - \langle \mathbf{f}(x), \mathbf{f}(x') \rangle}$

This is positive definite (Gower 1971) so it plugs into kernel ridge regression and Gaussian processes directly. The MinHash and SECFP extensions trade some bit collisions for better generalization on rare substructures.

For materials and 3D chemistry the descriptor changes but the strategy does not. The Coulomb matrix (Rupp et al. 2012, Phys. Rev. Lett. 108) encodes a molecule as $M_{ij} = Z_i Z_j / \|R_i - R_j\|$ off-diagonal and $0.5 Z_i^{2.4}$ on the diagonal, sorted by row norm for permutation invariance. SOAP (Bartók-Kondor-Csányi 2013, Phys. Rev. B 87) projects a smooth atom density onto spherical harmonics for rotation- and permutation-invariance. FCHL (Christensen et al. 2020, J. Chem. Phys. 152) extends SOAP with chemistry-aware basis functions and is the strongest kernel baseline on QM7b and QM9 atomization energies.

Kernel ridge regression $\hat{f}(x) = \mathbf{k}(x)^\top (K + \lambda I)^{-1} \mathbf{y}$ is the workhorse fit. The $O(n^3)$ training cost caps practical dataset sizes near $n \sim 10^5$ without low-rank approximations. For SMILES, string kernels and GP-BO implementations like GAUCHE (Griffiths et al. 2024) make Bayesian optimization over chemical space tractable at typical discovery-campaign scale.

The fair comparison against a GNN is not million-molecule benchmarks. On a 500-molecule slice of new chemistry, FCHL plus kernel ridge typically matches or beats a freshly trained GNN, gives calibrated uncertainty, and trains in seconds. The GNN wins decisively once $n > 10^4$ or when the descriptor stops being a good prior.

Definition

Molecular Kernel $k (x, x^{'})$

A molecular kernel is a positive-definite similarity function between molecules after choosing a representation such as a fingerprint, Coulomb matrix, SOAP environment, or FCHL descriptor. Positive definiteness lets the kernel serve as an inner product in an implicit feature space.

Proposition

Small-Data Prior Advantage

Statement

In small molecular datasets, a strong fixed descriptor plus kernel regression can outperform a neural model because the descriptor supplies more inductive bias than the data can learn.

Intuition

The kernel pipeline spends its modeling budget on a chemically meaningful similarity prior. The neural model must learn that prior from examples, which is hard when each label is expensive.

Failure Mode

The advantage disappears when the descriptor misses the relevant physics, when enough labeled data are available, or when the task requires learned 3D equivariant features.

report a correction →

ExerciseCore

Problem

You have 600 DFT labels for a new family of catalysts. Why is FCHL plus kernel ridge a serious baseline before training a graph neural network?

Common Confusions

Watch Out

Tanimoto similarity is not a metric and does not satisfy the triangle inequality

Tanimoto distance $1 - T$ is a metric on binary vectors but the similarity itself is not. More importantly, two molecules with $T = 0.85$ on ECFP4 can have very different activities (the "activity cliff" phenomenon) and two with $T = 0.3$ can share a binding mode. Treat Tanimoto as a similarity prior, not a guarantee.

Watch Out

Coulomb matrices are not the same descriptor across molecules

The matrix is $n_\text{atom} \times n_\text{atom}$ and a molecule with 12 atoms produces a different-sized object than one with 18. Standard practice pads to a fixed maximum size and sorts rows by norm, but the sort-by-norm step is only piecewise smooth in the coordinates, which matters for force-field applications.

References

Rogers, Hahn, "Extended-Connectivity Fingerprints," J. Chem. Inf. Model. 50(5), 2010, pp. 742-754.
Rupp, Tkatchenko, Müller, von Lilienfeld, "Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning," Phys. Rev. Lett. 108, 2012, 058301.
Bartók, Kondor, Csányi, "On representing chemical environments," Phys. Rev. B 87, 2013, 184115, arXiv:1209.3140.
Christensen, Bratholm, Faber, von Lilienfeld, "FCHL revisited: Faster and more accurate quantum machine learning," J. Chem. Phys. 152(4), 2020, 044107.
Griffiths et al., "GAUCHE: A Library for Gaussian Processes in Chemistry," NeurIPS Datasets and Benchmarks 2023, arXiv:2212.02314.
Faber et al., "Prediction Errors of Molecular Machine Learning Models Lower than Hybrid DFT Error," J. Chem. Theory Comput. 13(11), 2017, pp. 5255-5264.

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics