Reinforcement Learning for Drug Discovery

Sneiderman, Robby

Applied ML

Reinforcement Learning for Drug Discovery

Molecule generation as RL: REINVENT and MolDQN treat SMILES or graph edits as actions; rewards come from QED, synthetic accessibility, and docking. GFlowNets target diversity instead of single-mode optimization.

AdvancedTier 3CurrentReference~15 min

Prerequisites

Policy Gradient Theorem Markov Decision Processes

Prereq Map

Learning position

Read this page in the graph.

applied-ml | layer 4 | tier 3. This page has 2 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Exploration vs Exploitation

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

The space of drug-like molecules is estimated at $10^{60}$ . Brute-force search and human chemist intuition both fail. The early generative-chemistry wave (2017-2019) used VAEs over SMILES; the next wave reframed generation as sequential decision-making and applied policy gradients. The unit of action is either a token in a SMILES string or an edit on a molecular graph; the reward is a scalar score combining drug-likeness, synthetic accessibility, and predicted binding to a target.

The case for RL over straight likelihood maximization: rewards in chemistry are non-differentiable (a docking simulator, a wet-lab assay, a third-party predictor), and you want to bias generation toward high-reward regions of a space the prior model already covers. Policy-gradient fine-tuning of a pretrained generator is the standard pattern.

The case against single-mode RL is mode collapse. Drug discovery wants a diverse panel of $10^2$ to $10^3$ candidates, not the single highest-reward molecule. GFlowNets (Bengio et al. 2021, NeurIPS) reframe generation to sample proportionally to the reward, returning an inherently diverse population.

Core Ideas

SMILES as a sequential action space. A SMILES string is a token sequence over an alphabet of atoms, bonds, and ring markers. A pretrained autoregressive model defines a policy $\pi_\theta(a_t \mid s_{<t})$ over next tokens. REINVENT (Olivecrona, Blaschke, Engkvist, Chen 2017, J. Cheminform. 9) fine-tunes a SMILES RNN with policy gradients, augmented by a KL penalty to a frozen prior so the policy stays in chemically valid regions. The objective is $\mathbb{E}_{x \sim \pi_\theta}[R(x)] - \beta \, \mathrm{KL}(\pi_\theta \,\|\, \pi_{\mathrm{prior}})$ , which is structurally identical to RLHF for language models. See policy gradient theorem.

Graph edits as actions. SMILES has edge cases (invalid strings, ambiguous resonance) that graph representations avoid. MolDQN (Zhou et al. 2019, Sci. Rep. 9) treats molecule construction as Q-learning on a graph: states are partial molecules, actions are atom additions, bond additions, or atom removals. The Q-function scores each (state, action) pair; training uses double Q-learning with prioritized replay. Generation produces only valid molecules by construction.

Reward design. Common reward components: drug-likeness QED (Bickerton et al. 2012, normalized $[0, 1]$ score over molecular weight, logP, HBD/HBA, rotatable bonds, aromatic rings, structural alerts); synthetic accessibility SA (Ertl, Schuffenhauer 2009, $[1, 10]$ scale, lower is easier); docking score from AutoDock or Glide as a binding-affinity proxy; selectivity over off-target proteins; predicted ADMET properties from QSAR models. The composite reward is usually a weighted sum or a multi-objective scalarization. Reward hacking is universal: optimizers find molecules with great QED that are synthetically intractable, or great docking scores that are docking artifacts not real binders. See reward hacking.

GFlowNets for diversity. Standard policy gradients converge toward the argmax. GFlowNets train a sampler $\pi_\theta$ such that $\pi_\theta(x) \propto R(x)$ , treating the reward as an unnormalized target distribution. The training objective enforces a flow-conservation condition on the generation DAG: the sum of incoming flows to a state equals the sum of outgoing flows plus the reward at terminal states. Sampling from a trained GFlowNet yields candidates with frequency proportional to their reward, which gives diverse high-reward sets in one pass. Bengio et al. 2021 introduced the framework; subsequent work scaled it to molecule and biological-sequence design.

Exploration in chemical space. Even with diversity-aware methods, the reachable subset of $10^{60}$ molecules from a pretrained generator is small. Pretraining on ChEMBL or ZINC biases toward known chemistry. Out-of-distribution generation is hard because the reward model was also trained on similar data and extrapolates poorly. Active learning with uncertainty estimates, or coupling to wet-lab feedback loops, is the current frontier. See exploration vs. exploitation.

Common Confusions

Watch Out

Validity is not goodness

RNN policies on SMILES regularly produce invalid strings or chemically nonsensical fragments. Validity is the bare minimum, not a measure of usefulness. Most papers report validity, uniqueness, and novelty as table-stakes; the harder bar is whether generated molecules are synthetically tractable, drug-like, and selective for the intended target. Most published RL-for-chemistry methods overstate progress by reporting on these proxy metrics without prospective wet-lab validation.

Watch Out

Docking scores are noisy proxies, not binding affinities

Docking simulates rigid-body fitting of a ligand into a protein pocket. Real binding involves conformational change, water displacement, and entropy that docking ignores. Correlation between docking score and measured affinity is often $r < 0.5$ . Optimizing docking aggressively tends to find pose artifacts. More expensive methods (free-energy perturbation, MM-PBSA) help but are 100x slower.

References

Canonical:

Olivecrona, Blaschke, Engkvist, Chen. "Molecular de-novo design through deep reinforcement learning (REINVENT)." J. Cheminform. 9, 2017. https://jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0235-x
Zhou, Kearnes, Li, Zare, Riley. "Optimization of Molecules via Deep Reinforcement Learning (MolDQN)." Scientific Reports 9, 2019. https://www.nature.com/articles/s41598-019-47148-x

Current:

Bengio, Jain, Korablyov, Precup, Bengio. "Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation (GFlowNets)." NeurIPS 2021. https://arxiv.org/abs/2106.04399
Bickerton, Paolini, Besnard, Muresan, Hopkins. "Quantifying the chemical beauty of drugs (QED)." Nature Chemistry 4, 2012. https://www.nature.com/articles/nchem.1243
Ertl, Schuffenhauer. "Estimation of synthetic accessibility score of drug-like molecules (SA score)." J. Cheminform. 1, 2009. https://jcheminf.biomedcentral.com/articles/10.1186/1758-2946-1-8
Blaschke et al. "REINVENT 2.0: an AI tool for de novo drug design." J. Chem. Inf. Model. 60, 2020. https://pubs.acs.org/doi/10.1021/acs.jcim.0c00915

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics