Paper Breakdowns

Landmark machine-learning and statistics papers, broken down to their actual mathematical contributions. Each breakdown links the equations to the topic pages on TheoremPath that develop the machinery in detail.

1992 · COLT 1992
A Training Algorithm for Optimal Margin Classifiers
Bernhard E. Boser, Isabelle M. Guyon, Vladimir N. Vapnik
The first paper to combine the maximum-margin hyperplane with the kernel trick. Replaces explicit feature maps with pairwise kernel evaluations and turns the resulting problem into a convex quadratic program with a clean dual.
2013 · ICLR 2014
Auto-Encoding Variational Bayes
Diederik P. Kingma, Max Welling
Introduces the variational autoencoder. Combines amortised inference with the reparameterisation trick to give a tractable, gradient-based estimator of the evidence lower bound for deep latent-variable models.
2014 · NeurIPS 2014
Generative Adversarial Nets
Ian J. Goodfellow et al.
Reframes generative modelling as a two-player minimax game between a generator and a discriminator. Establishes the equivalence between the optimal-discriminator game and Jensen-Shannon divergence minimisation.
2015 · CVPR 2016
Deep Residual Learning for Image Recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
Introduces residual connections — adding the input of a block to its output — to enable training of networks more than an order of magnitude deeper than was previously stable. Wins ImageNet 2015 with a 152-layer model.
2017 · NeurIPS 2017
Attention Is All You Need
Ashish Vaswani et al.
Replaces recurrence and convolution in sequence transduction with stacked self-attention. Establishes the transformer block — multi-head scaled dot-product attention plus position-wise feed-forward layers — that every modern large language model still uses.
2018 · NAACL 2019
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
Combines a transformer encoder with a masked-language-model objective to learn deep bidirectional context, then fine-tunes on downstream NLP tasks. Establishes pre-train-then-fine-tune as the dominant paradigm for two years and the technical scaffolding for everything after.
2018 · NeurIPS 2018
Neural Tangent Kernel: Convergence and Generalization in Neural Networks
Arthur Jacot, Franck Gabriel, Clément Hongler
Proves that, in the infinite-width limit and at the right initialisation scale, gradient-flow training of a neural network is equivalent to kernel regression with a fixed deterministic kernel. Gives the first proof of global convergence for a non-convex deep-network training procedure.
2019 · NeurIPS 2019
Uniform Convergence May Be Unable to Explain Generalization in Deep Learning
Vaishnavh Nagarajan, J. Zico Kolter
Constructs an overparameterised setting where every uniform-convergence-based generalisation bound — including margin and norm-based bounds — must be vacuous, while SGD generalises well. Establishes a theoretical limit on what classical statistical-learning theory can explain about deep networks.