Reinforcement Learning for Synthesis Planning

Sneiderman, Robby

Applied ML

Reinforcement Learning for Synthesis Planning

Retrosynthesis as tree search: AlphaChem-style MCTS over learned reaction templates, transformer-based template-free models, and reward shaping with synthetic accessibility heuristics.

AdvancedTier 3CurrentReference~15 min

Prerequisites

Markov Decision Processes Policy Gradient Theorem

Prereq Map

Why This Matters

Retrosynthesis is the inverse problem of organic synthesis: given a target molecule, propose a sequence of reactions that build it from purchasable starting materials. The search tree branches on every disconnection (thousands of templates apply to a typical target) and goes 5-15 steps deep before reaching commercial precursors. Exhaustive enumeration is infeasible. Expert chemists prune by chemical intuition; that intuition is exactly what learned policies and value functions can encode.

Synthesis planning sits between real bottlenecks. A 20-step drug candidate is effectively unsynthesizable even with excellent predicted activity. In process chemistry, route cost is measured in weeks of bench work and kilograms of waste solvent. A planner that returns a 6-step route instead of an 11-step one is a direct win, not a benchmark game.

Core Ideas

The classic formulation treats retrosynthesis as a Markov decision process: the state is the current set of unresolved target molecules, the action is a (template, target) pair, the transition applies the template's disconnection, and the episode terminates when every leaf is in a stock catalog. The reward is sparse: 1 for a complete route, 0 otherwise, often shaped by route length and step plausibility.

Segler, Preuss, and Waller (2018, Nature 555) introduced this MCTS formulation, which the community calls AlphaChem by analogy. Three networks do the work: an expansion policy $p_\phi(\text{template} \mid \text{molecule})$ trained on millions of literature reactions, a fast rollout policy used inside playouts, and an in-scope filter that predicts whether a (template, target) pair will react. On a held-out test set the planner solved 80% of targets within 5 seconds versus 4% for a best-first heuristic baseline.

Template-free models replace the symbolic template library with a sequence-to-sequence transformer that maps product SMILES to reactant SMILES directly (Schwaller et al. 2019, ACS Cent. Sci. 5, "Molecular Transformer"; Schwaller et al. 2020, Chem. Sci. 11, "AutoSynRoute"). This removes the template-extraction bottleneck and handles long-tail chemistry that templates miss, at the cost of occasional invalid SMILES outputs and reduced interpretability of the proposed disconnection.

Reward shaping is unavoidable. A binary "reached stock" signal is too sparse for policy gradient methods, so practitioners blend in SAscore (Ertl-Schuffenhauer 2009) for synthetic accessibility, route length, and in-scope confidence. The shaped reward changes the optimal policy: SAscore was calibrated against a 2009 chemist survey and has known biases against fluorine-rich and macrocyclic chemistry.

Definition

Retrosynthesis Search State $s_{t}$

In retrosynthesis planning, a search state is the current multiset of unresolved molecules. An action chooses one molecule and one disconnection rule or model proposal, replacing the target with candidate precursors.

Proposition

Sparse Reward Search Bottleneck

Statement

Pure terminal rewards make retrosynthesis planning sample-inefficient because most partial routes receive no learning signal before the search tree has already exploded.

Intuition

The planner needs hints before reaching a purchasable-leaf route. Template priors, value estimates, in-scope filters, and synthetic-accessibility penalties all supply intermediate pressure.

Failure Mode

Shaped rewards can optimize the wrong chemistry: a route can look short, cheap, and plausible to the model while failing under real conditions.

report a correction →

ExerciseCore

Problem

A planner rewards only complete routes and gives zero reward to every partial route. What failure mode should you expect on a molecule whose route depth is ten steps?

Common Confusions

Watch Out

Template policies are not retrieving from a fixed library at inference

Templates are extracted once from a reaction corpus (USPTO, Reaxys), but the policy network is a learned conditional distribution over templates given a target. New target molecules trigger generalization across templates, not a database lookup. A template that never co-occurred with the exact target functional group can still be ranked highly if the policy learned the underlying disconnection logic.

Watch Out

A high-confidence route is not a working route

Reported solve rates are computed against in-stock catalogs and learned in-scope filters, not against wet-lab outcomes. A route with all steps at 85% predicted confidence has roughly $0.85^8 \approx 27\%$ expected end-to-end success even under the model's own assumptions, and real conditions (solvent, scale, impurities) further degrade the number.

References

Segler, Preuss, Waller, "Planning chemical syntheses with deep neural networks and symbolic AI," Nature 555, 2018, pp. 604-610.
Schwaller et al., "Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction," ACS Cent. Sci. 5(9), 2019, pp. 1572-1583.
Schwaller et al., "Predicting retrosynthetic pathways using transformer-based models and a hyper-graph exploration strategy," Chem. Sci. 11, 2020, pp. 3316-3325.
Coley et al., "A robotic platform for flow synthesis of organic compounds informed by AI planning," Science 365(6453), 2019, eaax1566.
Ertl, Schuffenhauer, "Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions," J. Cheminform. 1:8, 2009.
Genheden et al., "AiZynthFinder: a fast, robust and flexible open-source software for retrosynthetic planning," J. Cheminform. 12:70, 2020.

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics