Shareable map · Bookmark this page
ML Theory Roadmap
The whole curriculum on one page, from measure-theoretic foundations through modern deep learning and the research frontier. Tier-1 landmarks are the 198 core pages worth reading first.
Layer 0A · Axioms
60 topicsSets, functions, logic, linear algebra, real analysis, measure-theoretic basics.
Foundations
- ●Beta Distribution
- ●Common Inequalities
- ●Common Probability Distributions
- ●Compactness and Heine-Borel
- ●Continuity in Rⁿ
- ●Differentiation in Rⁿ
- ●Distributions Atlas
- ●Eigenvalues and Eigenvectors
- ●Expectation, Variance, Covariance, and Moments
- ●Exponential Distribution
- ●Exponential Function Properties
- ●Gamma Distribution
- ●Inner Product Spaces and Orthogonality
- ●Joint, Marginal, and Conditional Distributions
- ●Kolmogorov Probability Axioms
- ●Linear Independence
- ●Matrix Norms
- ●Matrix Operations and Properties
- ●Metric Spaces, Convergence, and Completeness
- ●Normal Distribution
- ●Poisson Distribution
- ●Positive Semidefinite Matrices
- ●Random Variables
- ●Sets, Functions, and Relations
- ●Singular Value Decomposition
- ●Taylor Expansion
- ●Tensors and Tensor Operations
- ●Vectors, Matrices, and Linear Maps
- Basic Logic and Proof Techniques
- Birthday Paradox
- Cantor's Theorem and Uncountability
- Cardinality and Countability
- Category Theory
- Counting and Combinatorics
- Discrete and Continuous Distribution Pairs
- Hypergeometric Distribution
- Integration and Change of Variables
- Inverse and Implicit Function Theorem
- Lognormal Distribution
- Moment Generating Functions
- Monty Hall Problem
- Peano Axioms
- Scale, Location, and Shape Parameters
- Sequences and Series of Functions
- Triangular Distribution
- Type Theory
- Zermelo-Fraenkel Set Theory
- Foundational Dependencies
- Vieta Jumping
Mathematical Infrastructure
Numerical Optimization
Layer 0B · Infrastructure
26 topicsMeasure theory, functional analysis, convex duality, numerical foundations.
Foundations
Mathematical Infrastructure
Statistical Estimation
- ●Asymptotic Statistics: M-Estimators, Delta Method, LAN
- ●Central Limit Theorem
- ●Conjugate Priors
- ●Cramér-Rao Bound: Information Inequality, Achievability, and Sharper Variants
- ●Fisher Information: Curvature, KL Geometry, and the Natural Gradient
- ●Law of Large Numbers
- ●Maximum A Posteriori (MAP) Estimation
- ●Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiency
- ●Shrinkage Estimation and the James-Stein Estimator: Inadmissibility, SURE, and Brown's Characterization
- ●The Multivariate Normal Distribution
- Bayesian Estimation
- Method of Moments
- Stein's Paradox
- Sufficient Statistics and Exponential Families
- Basu's Theorem
Infrastructure
Layer 1 · Core Tools
75 topicsConcentration, estimation, information theory, optimization primitives, CLT.
Foundations
- ●Gram Matrices and Kernel Matrices
- ●KL Divergence
- ●Numerical Stability and Conditioning
- ●Skewness, Kurtosis, and Higher Moments
- ●Total Variation Distance
- Benford's Law
- Cramér-Wold Theorem
- Markov Chains and Steady State
- Multivariate Distributions Atlas
- Pareto Distribution
- Relational Algebra
- Signals and Systems for ML
- Weibull Distribution
Mathematical Infrastructure
Concentration Probability
Statistical Foundations
Statistical Estimation
Numerical Optimization
Optimization Function Classes
Algorithms Foundations
Learning Theory Core
ML Methods
- ●Activation Functions
- ●Cross-Entropy Loss: MLE, KL Divergence, and Classification
- ●Data Preprocessing and Feature Engineering
- ●K-Means Clustering
- ●Linear Regression
- ●Logistic Regression
- ●Loss Functions Catalog
- ●Principal Component Analysis
- ●Ridge Regression
- K-Nearest Neighbors
- Multi-Class and Multi-Label Classification
- Naive Bayes
- Perceptron
Sampling MCMC
Methodology
- ●Confusion Matrices and Classification Metrics
- ●Confusion Matrix: MCC, Kappa, and Cost-Sensitive Evaluation
- ●Model Evaluation Best Practices
- ●Train-Test Split and Data Leakage
- ●Types of Bias in Statistics
- Base Rate Fallacy
- Class Imbalance and Resampling
- Exploratory Data Analysis
- Hardware for ML Practitioners
- ML Project Lifecycle
- Simpson's Paradox
Layer 2 · Learning Theory
163 topicsERM, VC, Rademacher, PAC, stability, kernels, uniform convergence.
Mathematical Infrastructure
Concentration Probability
Statistical Foundations
Statistical Estimation
Decision Theory
Numerical Optimization
Optimization Function Classes
Algorithms Foundations
Learning Theory Core
ML Methods
- ●AIC and BIC
- ●Bagging
- ●Feedforward Networks and Backpropagation
- ●Gauss-Markov Theorem
- ●Gradient Boosting
- ●Lasso Regression
- ●Overfitting and Underfitting
- ●Random Forests
- ●Skip Connections and ResNets
- ●Support Vector Machines
- ●The Kernel Trick
- ●Universal Approximation Theorem
- AdaBoost
- Anomaly Detection
- Autoencoders
- Decision Trees and Ensembles
- Dimensionality Reduction Theory
- Distributional Semantics
- Elastic Net
- Ensemble Methods Theory
- Gaussian Mixture Models and EM
- Generalized Additive Models
- Hyperbolic Embeddings for Graphs
- Natural Language Processing Foundations
- PageRank Algorithm
- Recommender Systems
- Spectral Clustering
- t-SNE and UMAP
- Time Series Forecasting Basics
- Word Embeddings
- XGBoost
- Boltzmann Machines and Hopfield Networks
- Cubist and Model Trees
- Logspline Density Estimation
- MARS (Multivariate Adaptive Regression Splines)
- NMF (Nonnegative Matrix Factorization)
- Self-Organizing Maps
- Wavelet Smoothing
Sampling MCMC
Training Techniques
Methodology
- Convex Tinkering
- Evaluation Metrics and Properties
- Feature Importance and Interpretability
- Hypothesis Testing for ML
- Meta-Analysis
- P-Hacking and Multiple Testing
- Proper Scoring Rules
- Reproducibility and Experimental Rigor
- ROC Curve and AUC
- Statistical Significance and Multiple Comparisons
- Experiment Tracking and Tooling
- Statistical Paradoxes Collection
LLM Construction
RL Theory
Applied Math
Applied Statistics
Learning Theory
Predictive Uncertainty
Sequential Inference
Layer 3 · ML Methods
152 topicsRegression, SVMs, neural nets, optimization, regularization, NTK.
Mathematical Infrastructure
Concentration Probability
Statistical Foundations
Decision Theory
Numerical Optimization
Optimization Function Classes
Algorithms Foundations
Learning Theory Core
Modern Generalization
ML Methods
- ●Score Matching
- ●Variational Autoencoders
- AlexNet and Deep Learning History
- Contrastive Learning
- Convolutional Neural Networks
- Deep Learning for Time Series
- DeepONet
- EM Algorithm Variants
- Fourier Neural Operator
- Gaussian Process Regression
- Generative Adversarial Networks
- Graph Neural Networks
- Meta-Learning
- Object Detection and Segmentation
- Optimal Brain Surgeon and Pruning Theory
- Probability Flow ODE
- Recurrent Neural Networks
- Semantic Search and Embeddings
- Speech and Audio ML
- Transfer Learning
- Bayesian Neural Networks
- Energy-Based Models
- Mixture Density Networks
- Normalizing Flows
- Quantum Machine Learning Overview
- Reservoir Computing and Echo State Networks
- Wave-Based Neural Networks
Sampling MCMC
Training Techniques
Methodology
- ●Causal Inference and the Ladder of Causation
- ●The Bitter Lesson
- Ablation Study Design
- Commons Governance and Institutional Analysis
- Federated Learning
- Leverage Points in Complex Systems
- Synthetic Data Generation
- Anthropic Bias and Observation Selection
- Benchmarking Methodology
- Causal Inference Basics
- Official Statistics and National Surveys
LLM Construction
RL Theory
- ●Policy Gradient Theorem
- ●Reward Design and Reward Misspecification
- Actor-Critic Methods
- DDPG: Deep Deterministic Policy Gradient
- Markov Games and Self-Play
- Model-Based Reinforcement Learning
- No-Regret Learning
- Offline Reinforcement Learning
- Online Learning and Bandits
- Policy Optimization: PPO and TRPO
- Policy Representations
- Self-Play and Multi-Agent RL
- TD3: Twin Delayed Deep Deterministic Policy Gradient
- Options and Temporal Abstraction
- Reinforcement Learning Environments and Benchmarks
AI Safety
Bayesian ML Frontier
Causal Semiparametric
Learning Theory
ML Applications
Optimization
Predictive Uncertainty
Layer 4 · Deep Learning
113 topicsTransformers, attention, training dynamics, double descent, scaling.
Statistical Foundations
Modern Generalization
- ●Implicit Bias and Modern Generalization
- ●Neural Tangent Kernel: Lazy Training, Kernel Equivalence, μP, and the Limits of Width
- Benign Overfitting
- Double Descent
- Grokking
- Lazy vs Feature Learning
- Mean Field Theory
- Neural Network Optimization Landscape
- Gaussian Processes for Machine Learning
- Sparse Recovery and Compressed Sensing
- Wasserstein Distances
LLM Construction
- ●Attention Is All You Need (Paper)
- ●Hallucination Theory
- ●Scaling Laws
- ●Sparse Autoencoders for Interpretability: TopK, JumpReLU, Matryoshka, and Scaling
- Attention Mechanism Theory
- Attention Sinks and Retrieval Decay
- Attention Variants and Efficiency
- BERT and the Pretrain-Finetune Paradigm
- Efficient Transformers Survey
- Forgetting Transformer (FoX)
- Induction Heads
- Iterative Magnitude Pruning and the Lottery Ticket Hypothesis
- Mixture of Experts
- Residual Stream and Transformer Internals
- RLHF and Alignment
- Sparse Attention and Long Context
- Training Dynamics and Loss Landscapes
- Transformer Architecture
- Attention as Kernel Regression
- Byte-Level Language Models
- Neural Architecture Search
- Positional Encoding
- Tokenization and Information Theory
Beyond LLMS
- ●CLIP, OpenCLIP, and SigLIP: Contrastive Language-Image Pretraining
- ●Diffusion Models
- ●Vision Transformer Lineage: ViT, DeiT, Swin, MAE, DINOv2, SAM
- Equilibrium and Implicit-Layer Models
- Equivariant Deep Learning
- Flow Matching
- JEPA and Joint Embedding
- Kolmogorov-Arnold Networks (KANs)
- Mamba and State-Space Models
- Physics-Informed Neural Networks
- Self-Supervised Vision
- World Models and Planning
- 3D Gaussian Splatting
- Occupancy Networks and Neural Fields
AI Safety
Model Timeline
Applied ML
- Causal Inference for Policy Evaluation
- Agent-Based Modeling with ML
- Anomaly Detection for Gravitational Waves
- Attention for Protein Structure: AlphaFold and Successors
- Autoencoders for Low-Dimensional Dynamical Structures
- Autoencoders for Single-Cell RNA-seq
- Clustering for Gene Expression
- CNNs for Medical Imaging
- CNNs for Signal Feature Extraction
- Deep Generative Models for Cosmic Structures
- Deep Generative Models for Molecules
- Deep RL for Control
- Gaussian Processes in Astronomy
- Graph Neural Networks for Molecules
- Hebbian Learning
- Kernel Methods for Molecules
- Lyapunov-Based Machine Learning for Chaos
- Macroeconomic Time-Series Forecasting
- NLP for Economic Text Analysis
- Nonlinear Dynamics and Chaos Fundamentals
- Predictive Coding and Autoencoders in the Brain
- Reinforcement Learning for Auction Design
- Reinforcement Learning for Drug Discovery
- Reinforcement Learning for Synthesis Planning
- Representation Learning in Cosmology
- Reward Systems and Reinforcement Learning Neuroscience
- RNNs for Signal Sequences
- Spiking Neural Networks
- SVM for RF Classification
- Symbolic Regression and Equation Discovery
Formal Verification
Infrastructure
- Broadcast Joins in Distributed Compute
- CUDA Programming Fundamentals
- Dask Parallel Python
- Docker and Containers for ML
- Git and GitLab for ML Research
- Hadoop and Distributed Storage
- Kafka Streaming Platform
- Kubernetes for ML Workloads
- Modal: Serverless GPU Platform
- Pandas and NumPy Fundamentals
- Python for ML Research
- Ray Distributed Python
- Running ML Workloads on GPUs
- Snowflake Data Warehouse
- Weights and Biases for Experiment Tracking
Layer 5 · Frontier
66 topicsRLHF, alignment, interpretability, reasoning, agents, scaling laws.
Modern Generalization
Methodology
LLM Construction
- ●Chain-of-Thought and Reasoning
- ●Reinforcement Learning from Human Feedback
- Context Engineering
- Document Intelligence
- DPO vs GRPO vs RL for Reasoning
- Edge and On-Device ML
- Flash Attention
- Fused Kernels
- GPU Compute Model
- Inference Systems Overview
- Inference-Time Scaling Laws
- KV Cache
- KV Cache Optimization
- Latent Reasoning
- Memory Systems for LLMs
- Multi-Token Prediction
- Multimodal RAG
- PaddleOCR and Practical OCR
- Parallel Processing Fundamentals
- Post-Training Overview
- Prefix Caching
- Prompt Engineering and In-Context Learning
- Reasoning Data Curation
- Scaling Compute-Optimal Training
- Speculative Decoding and Quantization
- Structured Output and Constrained Generation
- Test-Time Compute and Search
- Tool-Augmented Reasoning
- Agent Protocols: MCP and A2A
- AMD Competition Landscape
- ASML and Chip Manufacturing
- Distributed Training Theory
- Donut and OCR-Free Document Understanding
- Megakernels
- Model Merging and Weight Averaging
- NVIDIA GPU Architectures
- Plan-then-Generate
- Quantization Theory
- Table Extraction and Structure Recognition
RL Theory
Beyond LLMS
AI Safety
Model Timeline
AI History
How to use this map
- ● Amber dots are tier-1 landmarks. Read these first.
- Each page links down to its prerequisites and up to what builds on it. No concept floats without grounding.
- Use the gap finder to pick a destination and get a BFS-ordered reading list.
- The interactive graph gives you the same graph with click-to-explore and path tracing.
Planned additions
Topics in progress, primarily AI safety and alignment.
- Scalable oversight. Bowman et al. 2022, debate and market-based precedents, sandwiching experiments. Scope conditions matter: what the setup can and cannot tell us.
- Deceptive alignment. Hubinger et al. 2019/2021 mesa-optimizer framing. Separate the empirical evidence from the philosophical argument.
- Alignment faking. Greenblatt et al. 2024 (Anthropic). Include the limitations section explicitly.
- DPO. Currently folded into dpo-vs-grpo. Deserves its own page: Rafailov et al. 2023, the implicit-reward view, and the overoptimization story. Follow-up on IPO, KTO, SimPO and the broader DPO family.
- Verifiable-reward RL (RLVR). Reasoning training with programmatically checkable rewards: math graders, code executors, proof verifiers. Scope what verifiers can and cannot certify, and the reward-hacking surface when the verifier is imperfect. Needs careful separation from general RLHF.
- Inference-time scaling beyond CoT. Budgeted search, verifier-guided decoding, reward-model reranking, parallel sampling with aggregation. Current inference-time-scaling-laws page covers the scaling story; deserves a systems-level companion on how the compute is actually spent.
- Agent systems as systems. Long-horizon tool use, failure recovery, memory design, evaluation under distribution shift, benchmark contamination. Current agent pages cover the components; a systems-view page on how they compose and fail in production is missing.
- Weak-to-strong generalization. Burns et al. 2023 (OpenAI). What the setup can and cannot tell us about alignment at scale.
- Instrumental convergence. Omohundro, Bostrom framings. Flag explicitly where the philosophical argument outruns the empirical support.
- Jailbreaks. Attack taxonomy, measurement difficulties, why robust alignment is not a solved problem. Needs honest threat-model scoping, not incident anecdotes.
- Superposition. Elhage et al. 2022 toy-models paper, the interference vs capacity trade-off, and the connection to sparse autoencoders.