Skip to main content

Applied ML

Deep Generative Models for Molecules

VAEs on SMILES, Junction-Tree VAEs on graphs, normalizing flows like GraphAF and MoFlow, and equivariant 3D diffusion (EDM, GeoLDM). The evaluation pitfalls — validity, novelty, drug-likeness — are as important as the architectures.

AdvancedTier 3Current~15 min
0

Why This Matters

Generative models for molecules promise inverse design: condition on a property (binding affinity, band gap, solubility) and sample structures that are likely to satisfy it. The hard part is not the conditioning. It is generating chemically valid, synthesizable, and genuinely novel candidates rather than minor edits of training-set molecules that happen to fool a property predictor.

The history of the field is partly a history of broken benchmarks. Validity rates near 100% are easy to claim with grammar-aware decoders; novelty and uniqueness collapse if the model memorizes ZINC. Drug-likeness metrics like QED can be optimized to ceiling values by molecules no chemist would synthesize. A reader of this literature should treat headline numbers with the same skepticism they apply to image-generation FID scores.

Core Ideas

The first wave used SMILES as the molecular representation. Gómez-Bombarelli et al. (2018, ACS Cent. Sci. 4) trained a character-level VAE on ZINC SMILES and ran Bayesian optimization in the latent space. Random samples produced syntactically broken SMILES roughly 70% of the time. Grammar VAE and SD-VAE fixed this by decoding into a context-free grammar of valid SMILES; the junction-tree VAE (Jin, Barzilay, Jaakkola 2018, ICML) sidestepped strings entirely by decoding a tree of chemical substructures and assembling them into a valid molecular graph.

Normalizing flows on molecular graphs (GraphAF, Shi et al. 2020, ICLR; MoFlow, Zang-Wang 2020, KDD) parameterize an invertible map between a Gaussian latent and the discrete graph adjacency plus node-type tensors. The benefit over VAEs is exact log-likelihood and tractable inverses, useful for property optimization with gradient ascent in latent space. The cost is architectural complexity around discrete variables: dequantization adds noise, and autoregressive flows make sampling sequential.

The current frontier is 3D diffusion. EDM (Hoogeboom et al. 2022, ICML) trains an E(3)E(3)-equivariant denoising network on QM9 atomic coordinates and types, generating geometries directly without an intermediate graph step. GeoLDM (Xu et al. 2023, ICML) adds a latent autoencoder so the diffusion runs in a lower-dimensional space, with reported gains on validity-and-uniqueness metrics. For pocket-conditioned drug design, the DiffSBDD line (Schneuing et al. 2024) and PocketFlow port the same machinery to ligand generation inside a fixed protein binding pocket.

Evaluation pitfalls cut across all of these. Validity is the easy metric, usually >95%> 95\%. Uniqueness measures distinct samples in nn draws and is sensitive to mode collapse. Novelty is typically reported only against ZINC, not larger commercial catalogs. GuacaMol and MOSES include "drug-likeness gaming" filters because earlier papers reported near-perfect penalized logP on molecules that violated synthetic feasibility (Renz et al. 2019).

Common Confusions

Watch Out

Validity is a syntactic check, not a chemistry check

RDKit-validity confirms a parseable SMILES with consistent valences. It does not check that the molecule is stable, isolable, or known to exist. A 100% validity rate can coexist with a sample population that is mostly strained ring systems or unstable functional groups. Use SAscore and expert filters in addition to validity.

Watch Out

Novelty against ZINC is a weak novelty claim

ZINC is large but is not the universe of known compounds. A model that samples novel-versus-ZINC molecules is often producing compounds present in PubChem or in proprietary catalogs. The literal claim "this molecule has never been made" requires checking against patent and commercial databases.

References

Gómez-Bombarelli 2018

Gómez-Bombarelli et al., "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules," ACS Cent. Sci. 4(2), 2018, pp. 268-276, arXiv:1610.02415.

Jin 2018 JT-VAE

Jin, Barzilay, Jaakkola, "Junction Tree Variational Autoencoder for Molecular Graph Generation," ICML 2018, arXiv:1802.04364.

Shi 2020 GraphAF

Shi et al., "GraphAF: a Flow-based Autoregressive Model for Molecular Graph Generation," ICLR 2020, arXiv:2001.09382.

Hoogeboom 2022 EDM

Hoogeboom, Satorras, Vignac, Welling, "Equivariant Diffusion for Molecule Generation in 3D," ICML 2022, arXiv:2203.17003.

Xu 2023 GeoLDM

Xu et al., "Geometric Latent Diffusion Models for 3D Molecule Generation," ICML 2023, arXiv:2305.01140.

Renz 2019 evaluation

Renz et al., "On failure modes in molecule generation and optimization," Drug Discov. Today: Technol. 32-33, 2019, pp. 55-63.

Related Topics

Last reviewed: April 18, 2026

Prerequisites

Foundations this topic depends on.

Next Topics