Skip to main content

Applied ML

Attention for Protein Structure: AlphaFold and Successors

AlphaFold2's Evoformer fuses MSA and pair representations through axial attention; the structure module decodes coordinates with invariant point attention. ESMFold replaces MSAs with a language model.

AdvancedTier 3Current~15 min
0

Why This Matters

For five decades the protein folding problem was framed as physics: given an amino-acid sequence, find the tertiary structure that minimizes free energy. AlphaFold2 (Jumper et al. 2021, Nature 596) reframed it as supervised learning over the Protein Data Bank, with attention doing the heavy lifting. On CASP14 the median backbone RMSD for free-modeling targets dropped from roughly 5 angstroms to about 1 angstrom, comparable to experimental resolution.

The architectural ideas matter beyond structure prediction. The Evoformer is a template for fusing two related tensors (MSA and pair) by alternating attention along their shared axes. Invariant point attention (IPA) is a clean way to make geometric attention SE(3)-equivariant without the heavy machinery of fully equivariant networks. Recycling (running the model multiple times with its own outputs as additional inputs) is a useful trick for problems where iterative refinement helps.

The successor wave is split. AlphaFold-Multimer extended AF2 to complexes; ESMFold (Lin et al. 2023, Science 379) showed a protein language model can replace the MSA at modest accuracy cost; RoseTTAFold's three-track architecture (Baek et al. 2021, Science 373) ran in parallel to AF2 with similar ideas. AlphaFold3 (Abramson et al. 2024, Nature 630) generalized to ligands, nucleic acids, and modifications using a diffusion head.

Core Ideas

Multiple sequence alignment as input. Evolutionary covariance carries structural information: residues that contact in 3D often co-vary across homologs. AlphaFold2's input is an MSA tensor of shape (Nseq,Nres,c)(N_{\mathrm{seq}}, N_{\mathrm{res}}, c) together with a pair representation of shape (Nres,Nres,c)(N_{\mathrm{res}}, N_{\mathrm{res}}, c) initialized from MSA statistics and templates. The MSA is the central feature, not an afterthought.

Evoformer block: axial attention over MSA and pair. Each block alternates: row-wise attention on the MSA biased by the pair representation, column-wise attention on the MSA, then triangle attention and triangle multiplicative updates on the pair. Triangle updates encode the constraint that distances are consistent across triples (i,j,k)(i, j, k): if ii is close to jj and jj is close to kk, then ii and kk have constrained distance. Stacking 48 Evoformer blocks builds a structurally aware pair representation.

Structure module and invariant point attention. The pair representation feeds the structure module, which iteratively places residue frames in 3D. IPA computes attention with queries and keys that include 3D points expressed in each residue's local frame; because all geometric quantities transform together under global rotations and translations, the attention is invariant to global rigid motion. The output is a set of backbone frames plus side-chain torsion angles, with refinement by gradient descent on a violation loss to enforce stereochemistry.

Recycling and self-distillation. Running the model three or four times, feeding each pass's pair representation and predicted structure back as additional inputs, improves accuracy. Training also uses self-distillation: AF2 predictions on unlabeled sequences become pseudo-labels for the next training round. This bootstraps from 105\sim 10^5 PDB structures to 108\sim 10^8 training examples.

ESMFold and language-model folding. ESM-2 is a 15B-parameter masked language model trained on UniRef sequences. The internal attention maps already encode contact propensities. ESMFold passes ESM-2 representations into a folding head similar to AF2's structure module, skipping the MSA construction step. This is roughly an order of magnitude faster at inference, with about 1 angstrom RMSD penalty on average. For metagenomic sequences with no MSA available, ESMFold is the only option; the ESM Metagenomic Atlas released 6×108\sim 6 \times 10^8 predicted structures.

RoseTTAFold's three-track design. Baek et al. (2021) ran 1D (sequence), 2D (pair), and 3D (coordinates) tracks in parallel, with information flowing among them at each block. Independently developed, broadly contemporaneous with AF2, somewhat lower accuracy at release but cheaper to train. RoseTTAFold All-Atom (2024) generalized to small molecules and covalent modifications.

Multimer and complexes. AlphaFold-Multimer (Evans et al. 2022) modified the MSA construction and loss to handle protein-protein interfaces. Performance on heterodimers is solid; large complexes and conformational change still fail more often than monomer prediction.

Common Confusions

Watch Out

AlphaFold predicts a single structure, not a conformational ensemble

Many proteins exist in multiple functional states (open vs. closed, apo vs. bound). AF2 returns one structure plus a per-residue confidence (pLDDT). It tends to return the most populated or most templated state. Predicting alternative conformations and disorder is an active area; subsampled MSAs and multiple seeds are partial workarounds.

Watch Out

High pLDDT does not mean correct

pLDDT is a per-residue self-confidence trained against backbone accuracy. It correlates well with correctness on average and is a good proxy for ordered vs. disordered regions. It can still be confidently wrong, especially for designed proteins, novel folds, and multi-state systems. Use it as a triage signal, not a guarantee.

References

Related Topics

Last reviewed: April 18, 2026

Prerequisites

Foundations this topic depends on.

Next Topics