Skip to main content

Applied ML

Reinforcement Learning for Drug Discovery

Molecule generation as RL: REINVENT and MolDQN treat SMILES or graph edits as actions; rewards come from QED, synthetic accessibility, and docking. GFlowNets target diversity instead of single-mode optimization.

AdvancedTier 3Current~15 min
0

Why This Matters

The space of drug-like molecules is estimated at 106010^{60}. Brute-force search and human chemist intuition both fail. The early generative-chemistry wave (2017-2019) used VAEs over SMILES; the next wave reframed generation as sequential decision-making and applied policy gradients. The unit of action is either a token in a SMILES string or an edit on a molecular graph; the reward is a scalar score combining drug-likeness, synthetic accessibility, and predicted binding to a target.

The case for RL over straight likelihood maximization: rewards in chemistry are non-differentiable (a docking simulator, a wet-lab assay, a third-party predictor), and you want to bias generation toward high-reward regions of a space the prior model already covers. Policy-gradient fine-tuning of a pretrained generator is the standard pattern.

The case against single-mode RL is mode collapse. Drug discovery wants a diverse panel of 10210^2 to 10310^3 candidates, not the single highest-reward molecule. GFlowNets (Bengio et al. 2021, NeurIPS) reframe generation to sample proportionally to the reward, returning an inherently diverse population.

Core Ideas

SMILES as a sequential action space. A SMILES string is a token sequence over an alphabet of atoms, bonds, and ring markers. A pretrained autoregressive model defines a policy πθ(ats<t)\pi_\theta(a_t \mid s_{<t}) over next tokens. REINVENT (Olivecrona, Blaschke, Engkvist, Chen 2017, J. Cheminform. 9) fine-tunes a SMILES RNN with policy gradients, augmented by a KL penalty to a frozen prior so the policy stays in chemically valid regions. The objective is Exπθ[R(x)]βKL(πθπprior)\mathbb{E}_{x \sim \pi_\theta}[R(x)] - \beta \, \mathrm{KL}(\pi_\theta \,\|\, \pi_{\mathrm{prior}}), which is structurally identical to RLHF for language models. See policy gradient theorem.

Graph edits as actions. SMILES has edge cases (invalid strings, ambiguous resonance) that graph representations avoid. MolDQN (Zhou et al. 2019, Sci. Rep. 9) treats molecule construction as Q-learning on a graph: states are partial molecules, actions are atom additions, bond additions, or atom removals. The Q-function scores each (state, action) pair; training uses double Q-learning with prioritized replay. Generation produces only valid molecules by construction.

Reward design. Common reward components: drug-likeness QED (Bickerton et al. 2012, normalized [0,1][0, 1] score over molecular weight, logP, HBD/HBA, rotatable bonds, aromatic rings, structural alerts); synthetic accessibility SA (Ertl, Schuffenhauer 2009, [1,10][1, 10] scale, lower is easier); docking score from AutoDock or Glide as a binding-affinity proxy; selectivity over off-target proteins; predicted ADMET properties from QSAR models. The composite reward is usually a weighted sum or a multi-objective scalarization. Reward hacking is universal: optimizers find molecules with great QED that are synthetically intractable, or great docking scores that are docking artifacts not real binders. See reward hacking.

GFlowNets for diversity. Standard policy gradients converge toward the argmax. GFlowNets train a sampler πθ\pi_\theta such that πθ(x)R(x)\pi_\theta(x) \propto R(x), treating the reward as an unnormalized target distribution. The training objective enforces a flow-conservation condition on the generation DAG: the sum of incoming flows to a state equals the sum of outgoing flows plus the reward at terminal states. Sampling from a trained GFlowNet yields candidates with frequency proportional to their reward, which gives diverse high-reward sets in one pass. Bengio et al. 2021 introduced the framework; subsequent work scaled it to molecule and biological-sequence design.

Exploration in chemical space. Even with diversity-aware methods, the reachable subset of 106010^{60} molecules from a pretrained generator is small. Pretraining on ChEMBL or ZINC biases toward known chemistry. Out-of-distribution generation is hard because the reward model was also trained on similar data and extrapolates poorly. Active learning with uncertainty estimates, or coupling to wet-lab feedback loops, is the current frontier. See exploration vs. exploitation.

Common Confusions

Watch Out

Validity is not goodness

RNN policies on SMILES regularly produce invalid strings or chemically nonsensical fragments. Validity is the bare minimum, not a measure of usefulness. Most papers report validity, uniqueness, and novelty as table-stakes; the harder bar is whether generated molecules are synthetically tractable, drug-like, and selective for the intended target. Most published RL-for-chemistry methods overstate progress by reporting on these proxy metrics without prospective wet-lab validation.

Watch Out

Docking scores are noisy proxies, not binding affinities

Docking simulates rigid-body fitting of a ligand into a protein pocket. Real binding involves conformational change, water displacement, and entropy that docking ignores. Correlation between docking score and measured affinity is often r<0.5r < 0.5. Optimizing docking aggressively tends to find pose artifacts. More expensive methods (free-energy perturbation, MM-PBSA) help but are 100x slower.

References

Related Topics

Last reviewed: April 18, 2026

Prerequisites

Foundations this topic depends on.

Next Topics