Skip to main content

Applied ML

Kernel Methods for Molecules

Tanimoto kernels on Morgan fingerprints, Coulomb-matrix and SOAP descriptors for materials, FCHL atomic kernels, and GP regression on SMILES with string kernels as a calibrated baseline against GNNs.

AdvancedTier 3Current~15 min
0

Why This Matters

Before message-passing neural networks dominated molecular property benchmarks, the standard pipeline was a fixed descriptor plus a kernel machine. That pipeline did not go away. Kernel ridge regression on FCHL or SOAP still beats GNNs in the small-data regime that matters most in chemistry: a few hundred to a few thousand DFT-labeled structures of new chemistry, where neural models overfit and miscalibrate.

A positive-definite molecular kernel encodes a similarity prior good enough to work without training. A Gaussian process on top returns calibrated uncertainty, which matters when an active-learning loop picks the next DFT calculation or compound to synthesize.

Core Ideas

The most-used molecular kernel is Tanimoto on circular fingerprints. ECFP (Rogers-Hahn 2010, J. Chem. Inf. Model. 50) and the equivalent Morgan fingerprint encode each atom's neighborhood up to radius rr (typically 2) as a hashed bit vector f(x){0,1}d\mathbf{f}(x) \in \{0,1\}^d with d=1024d = 1024 or 20482048. The Tanimoto similarity is

T(x,x)=f(x),f(x)f(x)2+f(x)2f(x),f(x)T(x, x') = \frac{\langle \mathbf{f}(x), \mathbf{f}(x') \rangle}{\|\mathbf{f}(x)\|^2 + \|\mathbf{f}(x')\|^2 - \langle \mathbf{f}(x), \mathbf{f}(x') \rangle}

This is positive definite (Gower 1971) so it plugs into kernel ridge regression and Gaussian processes directly. The MinHash and SECFP extensions trade some bit collisions for better generalization on rare substructures.

For materials and 3D chemistry the descriptor changes but the strategy does not. The Coulomb matrix (Rupp et al. 2012, Phys. Rev. Lett. 108) encodes a molecule as Mij=ZiZj/RiRjM_{ij} = Z_i Z_j / \|R_i - R_j\| off-diagonal and 0.5Zi2.40.5 Z_i^{2.4} on the diagonal, sorted by row norm for permutation invariance. SOAP (Bartók-Kondor-Csányi 2013, Phys. Rev. B 87) projects a smooth atom density onto spherical harmonics for rotation- and permutation-invariance. FCHL (Christensen et al. 2020, J. Chem. Phys. 152) extends SOAP with chemistry-aware basis functions and is the strongest kernel baseline on QM7b and QM9 atomization energies.

Kernel ridge regression f^(x)=k(x)(K+λI)1y\hat{f}(x) = \mathbf{k}(x)^\top (K + \lambda I)^{-1} \mathbf{y} is the workhorse fit. The O(n3)O(n^3) training cost caps practical dataset sizes near n105n \sim 10^5 without low-rank approximations. For SMILES, string kernels and GP-BO implementations like GAUCHE (Griffiths et al. 2024) make Bayesian optimization over chemical space tractable at typical discovery-campaign scale.

The fair comparison against a GNN is not million-molecule benchmarks. On a 500-molecule slice of new chemistry, FCHL plus kernel ridge typically matches or beats a freshly trained GNN, gives calibrated uncertainty, and trains in seconds. The GNN wins decisively once n>104n > 10^4 or when the descriptor stops being a good prior.

Common Confusions

Watch Out

Tanimoto similarity is not a metric and does not satisfy the triangle inequality

Tanimoto distance 1T1 - T is a metric on binary vectors but the similarity itself is not. More importantly, two molecules with T=0.85T = 0.85 on ECFP4 can have very different activities (the "activity cliff" phenomenon) and two with T=0.3T = 0.3 can share a binding mode. Treat Tanimoto as a similarity prior, not a guarantee.

Watch Out

Coulomb matrices are not the same descriptor across molecules

The matrix is natom×natomn_\text{atom} \times n_\text{atom} and a molecule with 12 atoms produces a different-sized object than one with 18. Standard practice pads to a fixed maximum size and sorts rows by norm, but the sort-by-norm step is only piecewise smooth in the coordinates, which matters for force-field applications.

References

Rogers 2010 ECFP

Rogers, Hahn, "Extended-Connectivity Fingerprints," J. Chem. Inf. Model. 50(5), 2010, pp. 742-754.

Rupp 2012 Coulomb matrix

Rupp, Tkatchenko, Müller, von Lilienfeld, "Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning," Phys. Rev. Lett. 108, 2012, 058301.

Bartók 2013 SOAP

Bartók, Kondor, Csányi, "On representing chemical environments," Phys. Rev. B 87, 2013, 184115, arXiv:1209.3140.

Christensen 2020 FCHL

Christensen, Bratholm, Faber, von Lilienfeld, "FCHL revisited: Faster and more accurate quantum machine learning," J. Chem. Phys. 152(4), 2020, 044107.

Griffiths 2024 GAUCHE

Griffiths et al., "GAUCHE: A Library for Gaussian Processes in Chemistry," NeurIPS Datasets and Benchmarks 2023, arXiv:2212.02314.

Faber 2017 prediction errors

Faber et al., "Prediction Errors of Molecular Machine Learning Models Lower than Hybrid DFT Error," J. Chem. Theory Comput. 13(11), 2017, pp. 5255-5264.

Related Topics

Last reviewed: April 18, 2026

Prerequisites

Foundations this topic depends on.

Next Topics