Applied ML
Kernel Methods for Molecules
Tanimoto kernels on Morgan fingerprints, Coulomb-matrix and SOAP descriptors for materials, FCHL atomic kernels, and GP regression on SMILES with string kernels as a calibrated baseline against GNNs.
Prerequisites
Why This Matters
Before message-passing neural networks dominated molecular property benchmarks, the standard pipeline was a fixed descriptor plus a kernel machine. That pipeline did not go away. Kernel ridge regression on FCHL or SOAP still beats GNNs in the small-data regime that matters most in chemistry: a few hundred to a few thousand DFT-labeled structures of new chemistry, where neural models overfit and miscalibrate.
A positive-definite molecular kernel encodes a similarity prior good enough to work without training. A Gaussian process on top returns calibrated uncertainty, which matters when an active-learning loop picks the next DFT calculation or compound to synthesize.
Core Ideas
The most-used molecular kernel is Tanimoto on circular fingerprints. ECFP (Rogers-Hahn 2010, J. Chem. Inf. Model. 50) and the equivalent Morgan fingerprint encode each atom's neighborhood up to radius (typically 2) as a hashed bit vector with or . The Tanimoto similarity is
This is positive definite (Gower 1971) so it plugs into kernel ridge regression and Gaussian processes directly. The MinHash and SECFP extensions trade some bit collisions for better generalization on rare substructures.
For materials and 3D chemistry the descriptor changes but the strategy does not. The Coulomb matrix (Rupp et al. 2012, Phys. Rev. Lett. 108) encodes a molecule as off-diagonal and on the diagonal, sorted by row norm for permutation invariance. SOAP (Bartók-Kondor-Csányi 2013, Phys. Rev. B 87) projects a smooth atom density onto spherical harmonics for rotation- and permutation-invariance. FCHL (Christensen et al. 2020, J. Chem. Phys. 152) extends SOAP with chemistry-aware basis functions and is the strongest kernel baseline on QM7b and QM9 atomization energies.
Kernel ridge regression is the workhorse fit. The training cost caps practical dataset sizes near without low-rank approximations. For SMILES, string kernels and GP-BO implementations like GAUCHE (Griffiths et al. 2024) make Bayesian optimization over chemical space tractable at typical discovery-campaign scale.
The fair comparison against a GNN is not million-molecule benchmarks. On a 500-molecule slice of new chemistry, FCHL plus kernel ridge typically matches or beats a freshly trained GNN, gives calibrated uncertainty, and trains in seconds. The GNN wins decisively once or when the descriptor stops being a good prior.
Common Confusions
Tanimoto similarity is not a metric and does not satisfy the triangle inequality
Tanimoto distance is a metric on binary vectors but the similarity itself is not. More importantly, two molecules with on ECFP4 can have very different activities (the "activity cliff" phenomenon) and two with can share a binding mode. Treat Tanimoto as a similarity prior, not a guarantee.
Coulomb matrices are not the same descriptor across molecules
The matrix is and a molecule with 12 atoms produces a different-sized object than one with 18. Standard practice pads to a fixed maximum size and sorts rows by norm, but the sort-by-norm step is only piecewise smooth in the coordinates, which matters for force-field applications.
References
Rogers 2010 ECFP
Rogers, Hahn, "Extended-Connectivity Fingerprints," J. Chem. Inf. Model. 50(5), 2010, pp. 742-754.
Rupp 2012 Coulomb matrix
Rupp, Tkatchenko, Müller, von Lilienfeld, "Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning," Phys. Rev. Lett. 108, 2012, 058301.
Bartók 2013 SOAP
Bartók, Kondor, Csányi, "On representing chemical environments," Phys. Rev. B 87, 2013, 184115, arXiv:1209.3140.
Christensen 2020 FCHL
Christensen, Bratholm, Faber, von Lilienfeld, "FCHL revisited: Faster and more accurate quantum machine learning," J. Chem. Phys. 152(4), 2020, 044107.
Griffiths 2024 GAUCHE
Griffiths et al., "GAUCHE: A Library for Gaussian Processes in Chemistry," NeurIPS Datasets and Benchmarks 2023, arXiv:2212.02314.
Faber 2017 prediction errors
Faber et al., "Prediction Errors of Molecular Machine Learning Models Lower than Hybrid DFT Error," J. Chem. Theory Comput. 13(11), 2017, pp. 5255-5264.
Related Topics
Last reviewed: April 18, 2026
Prerequisites
Foundations this topic depends on.
- Kernels and Reproducing Kernel Hilbert SpacesLayer 3
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Rademacher ComplexityLayer 3
- Empirical Risk MinimizationLayer 2
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- VC DimensionLayer 2
- Gaussian Processes for Machine LearningLayer 4