Attention for Protein Structure: AlphaFold and Successors

Sneiderman, Robby

Applied ML

Attention for Protein Structure: AlphaFold and Successors

AlphaFold2's Evoformer fuses MSA and pair representations through axial attention; the structure module decodes coordinates with invariant point attention. ESMFold replaces MSAs with a language model.

AdvancedTier 3CurrentReference~15 min

Prerequisites

Attention Mechanism Theory Transformer Architecture

Prereq Map

Learning position

Read this page in the graph.

applied-ml | layer 4 | tier 3. This page has 2 direct prerequisites and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Equivariant Deep Learning

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

For five decades the protein folding problem was framed as physics: given an amino-acid sequence, find the tertiary structure that minimizes free energy. AlphaFold2 (Jumper et al. 2021, Nature 596) reframed it as supervised learning over the Protein Data Bank, with attention doing the heavy lifting. On CASP14 the median backbone RMSD for free-modeling targets dropped from roughly 5 angstroms to about 1 angstrom, comparable to experimental resolution.

The architectural ideas matter beyond structure prediction. The Evoformer is a template for fusing two related tensors (MSA and pair) by alternating attention along their shared axes. Invariant point attention (IPA) is a clean way to make geometric attention SE(3)-equivariant without the heavy machinery of fully equivariant networks. Recycling (running the model multiple times with its own outputs as additional inputs) is a useful trick for problems where iterative refinement helps.

The successor wave is split. AlphaFold-Multimer extended AF2 to complexes; ESMFold (Lin et al. 2023, Science 379) showed a protein language model can replace the MSA at modest accuracy cost; RoseTTAFold's three-track architecture (Baek et al. 2021, Science 373) ran in parallel to AF2 with similar ideas. AlphaFold3 (Abramson et al. 2024, Nature 630) generalized to ligands, nucleic acids, and modifications using a diffusion head.

Core Ideas

Multiple sequence alignment as input. Evolutionary covariance carries structural information: residues that contact in 3D often co-vary across homologs. AlphaFold2's input is an MSA tensor of shape $(N_{\mathrm{seq}}, N_{\mathrm{res}}, c)$ together with a pair representation of shape $(N_{\mathrm{res}}, N_{\mathrm{res}}, c)$ initialized from MSA statistics and templates. The MSA is the central feature, not an afterthought.

Evoformer block: axial attention over MSA and pair. Each block alternates: row-wise attention on the MSA biased by the pair representation, column-wise attention on the MSA, then triangle attention and triangle multiplicative updates on the pair. Triangle updates encode the constraint that distances are consistent across triples $(i, j, k)$ : if $i$ is close to $j$ and $j$ is close to $k$ , then $i$ and $k$ have constrained distance. Stacking 48 Evoformer blocks builds a structurally aware pair representation.

Structure module and invariant point attention. The pair representation feeds the structure module, which iteratively places residue frames in 3D. IPA computes attention with queries and keys that include 3D points expressed in each residue's local frame; because all geometric quantities transform together under global rotations and translations, the attention is invariant to global rigid motion. The output is a set of backbone frames plus side-chain torsion angles, with refinement by gradient descent on a violation loss to enforce stereochemistry.

Recycling and self-distillation. Running the model three or four times, feeding each pass's pair representation and predicted structure back as additional inputs, improves accuracy. Training also uses self-distillation: AF2 predictions on unlabeled sequences become pseudo-labels for the next training round. This bootstraps from $\sim 10^5$ PDB structures to $\sim 10^8$ training examples.

ESMFold and language-model folding. ESM-2 is a 15B-parameter masked language model trained on UniRef sequences. The internal attention maps already encode contact propensities. ESMFold passes ESM-2 representations into a folding head similar to AF2's structure module, skipping the MSA construction step. This is roughly an order of magnitude faster at inference, with about 1 angstrom RMSD penalty on average. For metagenomic sequences with no MSA available, ESMFold is the only option; the ESM Metagenomic Atlas released $\sim 6 \times 10^8$ predicted structures.

RoseTTAFold's three-track design. Baek et al. (2021) ran 1D (sequence), 2D (pair), and 3D (coordinates) tracks in parallel, with information flowing among them at each block. Independently developed, broadly contemporaneous with AF2, somewhat lower accuracy at release but cheaper to train. RoseTTAFold All-Atom (2024) generalized to small molecules and covalent modifications.

Multimer and complexes. AlphaFold-Multimer (Evans et al. 2022) modified the MSA construction and loss to handle protein-protein interfaces. Performance on heterodimers is solid; large complexes and conformational change still fail more often than monomer prediction.

Common Confusions

Watch Out

AlphaFold predicts a single structure, not a conformational ensemble

Many proteins exist in multiple functional states (open vs. closed, apo vs. bound). AF2 returns one structure plus a per-residue confidence (pLDDT). It tends to return the most populated or most templated state. Predicting alternative conformations and disorder is an active area; subsampled MSAs and multiple seeds are partial workarounds.

Watch Out

High pLDDT does not mean correct

pLDDT is a per-residue self-confidence trained against backbone accuracy. It correlates well with correctness on average and is a good proxy for ordered vs. disordered regions. It can still be confidently wrong, especially for designed proteins, novel folds, and multi-state systems. Use it as a triage signal, not a guarantee.

References

Canonical:

Jumper et al. "Highly accurate protein structure prediction with AlphaFold." Nature 596, 2021. https://www.nature.com/articles/s41586-021-03819-2
Baek et al. "Accurate prediction of protein structures and interactions using a three-track neural network (RoseTTAFold)." Science 373, 2021. https://www.science.org/doi/10.1126/science.abj8754

Current:

Lin et al. "Evolutionary-scale prediction of atomic-level protein structure (ESMFold)." Science 379, 2023. https://www.science.org/doi/10.1126/science.ade2574
Evans et al. "Protein complex prediction with AlphaFold-Multimer." bioRxiv, 2022. https://www.biorxiv.org/content/10.1101/2021.10.04.463034v2
Abramson et al. "Accurate structure prediction of biomolecular interactions with AlphaFold 3." Nature 630, 2024. https://www.nature.com/articles/s41586-024-07487-w
Krishna et al. "Generalized biomolecular modeling and design with RoseTTAFold All-Atom." Science 384, 2024. https://www.science.org/doi/10.1126/science.adl2528

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics