Curriculum
The Full Theory Library
Every concept organized by depth layer and module. Layer 0 is foundations. Layer 5 is applied systems. Every topic links down to its prerequisites until you hit axioms.
Foundations (Layer 0A)
Axioms, definitions, and notation. The bedrock everything else depends on.
Common Inequalities
The algebraic and probabilistic inequalities that appear everywhere in ML theory: Cauchy-Schwarz, Jensen, AM-GM, Holder, Minkowski, Young, Markov, and Chebyshev.
Common Probability Distributions
The essential probability distributions for ML: Bernoulli through Dirichlet, with PMFs/PDFs, moments, relationships, and when each arises in practice.
Compactness and Heine-Borel
Sequential compactness, the Heine-Borel theorem in finite dimensions, the extreme value theorem, and why compactness is the key assumption in optimization.
Computability Theory
What can be computed? Turing machines, decidability, the Church-Turing thesis, recursive and recursively enumerable sets, reductions, Rice's theorem, and connections to learning theory.
Continuity in R^n
Epsilon-delta continuity, uniform continuity, and Lipschitz continuity in Euclidean space. Lipschitz constants control how fast function values change and appear throughout optimization and generalization theory.
Differentiation in Rn
Partial derivatives, the gradient, directional derivatives, the total derivative (Frechet), and the multivariable chain rule. Why the gradient points in the steepest ascent direction, and why this matters for all of optimization.
Eigenvalues and Eigenvectors
Eigenvalues and eigenvectors: the directions a matrix scales without rotating. Characteristic polynomial, diagonalization, the spectral theorem for symmetric matrices, and the direct connection to PCA.
Expectation, Variance, Covariance, and Moments
Expectation, variance, covariance, correlation, linearity of expectation, variance of sums, and moment-based reasoning in ML.
Exponential Function Properties
The exponential function e^x: series definition, algebraic properties, and why it appears everywhere in ML. Softmax, MGFs, the Chernoff method, Boltzmann distributions, and exponential families all reduce to properties of exp.
Gram Matrices and Kernel Matrices
The Gram matrix G_{ij} = <x_i, x_j> encodes pairwise inner products of a dataset. Always PSD. Appears in kernel methods, PCA, SVD, and attention. Understanding it connects linear algebra to ML.
Inner Product Spaces and Orthogonality
Inner product axioms, Cauchy-Schwarz inequality, orthogonality, Gram-Schmidt, projections, and the bridge to Hilbert spaces.
Joint, Marginal, and Conditional Distributions
Joint distributions, marginalization, conditional distributions, independence, Bayes theorem, and the chain rule of probability.
KL Divergence
Kullback-Leibler divergence measures how one probability distribution differs from another. Asymmetric, always non-negative, and central to variational inference, MLE, and RLHF.
Matrix Norms
Frobenius, spectral, and nuclear norms for matrices. Submultiplicativity. When and why each norm appears in ML theory.
Matrix Operations and Properties
Essential matrix operations for ML: trace, determinant, inverse, pseudoinverse, Schur complement, and the Sherman-Morrison-Woodbury formula. When and why each matters.
Metric Spaces, Convergence, and Completeness
Metric space axioms, convergence of sequences, Cauchy sequences, completeness, and the Banach fixed-point theorem.
Numerical Stability and Conditioning
Continuous math becomes real only through finite-precision approximation. Condition numbers, backward stability, catastrophic cancellation, and why theorems about reals do not transfer cleanly to floating-point.
Positive Semidefinite Matrices
PSD matrices: equivalent characterizations, Cholesky decomposition, Schur complement, and Loewner ordering. Covariance matrices are PSD. Hessians of convex functions are PSD. These facts connect linear algebra to optimization and statistics.
Sets, Functions, and Relations
The language underneath all of mathematics: sets, Cartesian products, functions, injectivity, surjectivity, equivalence relations, and quotient sets.
Singular Value Decomposition
The SVD A = U Sigma V^T: the most important matrix factorization in applied mathematics. Geometric interpretation, relationship to eigendecomposition, low-rank approximation via Eckart-Young, and applications from PCA to pseudoinverses.
Skewness, Kurtosis, and Higher Moments
Distribution shape beyond mean and variance: skewness measures tail asymmetry, kurtosis measures tail extremeness, cumulants are the cleaner language, and heavy-tailed distributions break all of these.
Taylor Expansion
Taylor approximation in one and many variables. Every optimization algorithm is a Taylor approximation: gradient descent uses first order, Newton's method uses second order.
Tensors and Tensor Operations
What a tensor actually is: a multilinear map with specific transformation rules, how tensor contraction generalizes matrix multiplication, Einstein summation, tensor decompositions, and how ML frameworks use the word tensor to mean multidimensional array.
Vectors, Matrices, and Linear Maps
Vector spaces, linear maps, matrix representation, rank, nullity, and the rank-nullity theorem. The algebraic backbone of ML.
Basic Logic and Proof Techniques
The fundamental proof strategies used throughout mathematics: direct proof, contradiction, contrapositive, induction, construction, and cases. Required vocabulary for reading any theorem.
Cantor's Theorem and Uncountability
Cantor's diagonal argument proves the reals are uncountable. The power set of any set has strictly greater cardinality. These results are the origin of the distinction between countable and uncountable infinity.
Cardinality and Countability
Two sets have the same cardinality when a bijection exists between them. The naturals, integers, and rationals are countable. The reals are uncountable, proved by Cantor's diagonal argument.
Counting and Combinatorics
Counting principles, binomial and multinomial coefficients, inclusion-exclusion, and Stirling's approximation. These tools appear whenever you count hypotheses, bound shattering coefficients, or analyze combinatorial arguments in learning theory.
Cramér-Wold Theorem
A multivariate distribution is uniquely determined by all of its one-dimensional projections. This reduces multivariate convergence in distribution to checking univariate projections, and is the standard tool for proving multivariate CLT.
Integration and Change of Variables
Riemann integration, improper integrals, the substitution rule, multivariate change of variables via the Jacobian determinant, and Fubini theorem. The computational backbone of probability and ML.
Inverse and Implicit Function Theorem
The inverse function theorem guarantees local invertibility when the Jacobian is nonsingular. The implicit function theorem guarantees that constraint surfaces are locally graphs. Both are essential for constrained optimization and implicit layers.
Markov Chains and Steady State
Markov chains: the Markov property, transition matrices, stationary distributions, irreducibility, aperiodicity, the ergodic theorem, and mixing time. The backbone of PageRank, MCMC, and reinforcement learning.
Moment Generating Functions
The moment generating function M(t) = E[e^{tX}] encodes all moments of a distribution. The Chernoff method, sub-Gaussian bounds, and exponential family theory all reduce to MGF conditions.
SAT, SMT, and Automated Reasoning
SAT solvers decide Boolean satisfiability (NP-complete). SMT solvers extend SAT with theories like arithmetic and arrays. These tools verify constraints, discharge proof obligations, and complement LLMs in AI agent pipelines.
Sequences and Series of Functions
Pointwise vs uniform convergence of function sequences, the Weierstrass M-test, and why uniform convergence preserves continuity. The concept that makes learning theory work.
Signals and Systems for ML
Linear time-invariant systems, convolution, Fourier transform, and the sampling theorem. The signal processing foundations that underpin CNNs, efficient attention, audio ML, and frequency-domain analysis of training dynamics.
Formal Languages and Automata
Regular languages, context-free grammars, pushdown automata, the Chomsky hierarchy, pumping lemmas, and connections to parsing, neural sequence models, and computational complexity.
Mathematical Infrastructure (Layer 0B)
Serious math machinery: measure theory, functional analysis, convex duality.
Convex Duality
Fenchel conjugates, the Fenchel-Moreau theorem, weak and strong duality, KKT conditions, and why duality gives the kernel trick for SVMs, connects regularization to constraints, and enables adversarial formulations in DRO.
Measure-Theoretic Probability
The rigorous foundations of probability: sigma-algebras, measures, measurable functions as random variables, Lebesgue integration, and the convergence theorems that make modern probability and statistics possible.
Radon-Nikodym and Conditional Expectation
The Radon-Nikodym theorem: what 'density' really means. Absolute continuity, the Radon-Nikodym derivative, conditional expectation as a projection, tower property, and why this undergirds likelihood ratios, importance sampling, and KL divergence.
Functional Analysis Core
The four pillars of functional analysis: Hahn-Banach (extending functionals), Uniform Boundedness (pointwise bounded implies uniformly bounded), Open Mapping (surjective bounded operators have open images), and Banach-Alaoglu (dual unit ball is weak-* compact). These underpin RKHS theory, optimization in function spaces, and duality.
Information Theory Foundations
The core of information theory for ML: entropy, cross-entropy, KL divergence, mutual information, data processing inequality, and the chain rules that connect them. The language of variational inference, generalization bounds, and representation learning.
Ito's Lemma
The chain rule of stochastic calculus: if X_t follows an SDE, then f(X_t) follows a modified SDE with an extra second-order correction term that has no analogue in ordinary calculus.
Martingale Theory
Martingales and their convergence properties: Doob martingale, optional stopping theorem, martingale convergence, Azuma-Hoeffding inequality, and Freedman inequality. The tools behind McDiarmid's inequality, online learning regret bounds, and stochastic approximation.
Information Geometry
Riemannian geometry on the space of probability distributions: the Fisher information metric, natural gradient descent, exponential families as dually flat manifolds, and the connection to mirror descent.
Spectral Theory of Operators
Spectral theorem for compact self-adjoint operators on Hilbert spaces: every such operator has a countable orthonormal eigenbasis with real eigenvalues accumulating only at zero. This is the infinite-dimensional backbone of PCA, kernel methods, and neural tangent kernel theory.
Stochastic Calculus for ML
Brownian motion, Ito integrals, Ito's lemma, and stochastic differential equations: the mathematical machinery behind diffusion models, score-based generative models, and Langevin dynamics.
Statistical Estimation (Layer 0B)
MLE, Fisher information, Cramér-Rao, LLN, CLT — the estimation core.
Central Limit Theorem
The CLT: the sample mean is approximately Gaussian for large n, regardless of the original distribution. Berry-Esseen rate, multivariate CLT, and why CLT explains asymptotic normality of MLE, confidence intervals, and the ubiquity of the Gaussian.
Cramér-Rao Bound
The fundamental lower bound on the variance of any unbiased estimator: no unbiased estimator can have variance smaller than the reciprocal of the Fisher information.
Fisher Information
The Fisher information quantifies how much a sample tells you about an unknown parameter: it measures the curvature of the log-likelihood, sets the Cramér-Rao lower bound on estimator variance, and serves as a natural Riemannian metric on parameter space.
Law of Large Numbers
The weak and strong laws of large numbers: the sample mean converges to the population mean. Kolmogorov's conditions, the rate of convergence from the CLT, and why LLN justifies using empirical risk as a proxy for population risk.
Maximum Likelihood Estimation
MLE: find the parameter that maximizes the likelihood of observed data. Consistency, asymptotic normality, Fisher information, Cramér-Rao efficiency, and when MLE fails.
Shrinkage Estimation and the James-Stein Estimator
In three or more dimensions, the sample mean is inadmissible for estimating a multivariate normal mean. The James-Stein estimator shrinks toward zero and dominates the MLE in total MSE, a result that shocked the statistics world.
Bayesian Estimation
The Bayesian approach to parameter estimation: encode prior beliefs, update with data via Bayes rule, and obtain a full posterior distribution over parameters. Conjugate priors, MAP estimation, and the Bernstein-von Mises theorem showing that the posterior concentrates around the true parameter.
Goodness-of-Fit Tests
KS test, Shapiro-Wilk, Anderson-Darling, and the probability integral transform: testing whether data follows a specified distribution, with exact power comparisons and when each test fails.
Method of Moments
Match sample moments to population moments to estimate parameters. Simpler than MLE but less efficient. Covers classical MoM, generalized method of moments (GMM), and when MoM is the better choice.
Sufficient Statistics and Exponential Families
Sufficient statistics compress data without losing information about the parameter. The Neyman-Fisher factorization theorem, exponential families, completeness, and Rao-Blackwell improvement of estimators.
Asymptotic Statistics
The large-sample toolbox: delta method, Slutsky's theorem, asymptotic normality of MLE, local asymptotic normality, and Fisher efficiency. These results justify nearly every confidence interval and hypothesis test used in practice.
Basu's Theorem
A complete sufficient statistic is independent of every ancillary statistic. This provides the cleanest method for proving independence between statistics without computing joint distributions.
Learning Theory Core (Layer 1-2)
ERM, uniform convergence, VC dimension, Rademacher complexity.
Algorithmic Stability
Algorithmic stability provides generalization bounds by analyzing how much a learning algorithm's output changes when a single training example is replaced: a structurally different lens from complexity-based approaches.
Empirical Risk Minimization
The foundational principle of statistical learning: minimize average loss on training data as a proxy for minimizing true population risk.
Hypothesis Classes and Function Spaces
What is a hypothesis class, why the choice of hypothesis class determines what ERM can learn, and the approximation-estimation tradeoff: bigger classes reduce approximation error but increase estimation error.
PAC Learning Framework
The foundational formalization of what it means to learn from data: a concept is PAC-learnable if an algorithm can, with high probability, find a hypothesis that is approximately correct, using a polynomial number of samples.
Rademacher Complexity
A data-dependent measure of hypothesis class complexity that gives tighter generalization bounds than VC dimension by measuring how well the class can fit random noise on the actual data.
Sample Complexity Bounds
How many samples do you need to learn? Tight answers for finite hypothesis classes, VC classes, and Rademacher-bounded classes, plus matching lower bounds via Fano and Le Cam.
Uniform Convergence
Uniform convergence of empirical risk to population risk over an entire hypothesis class: the key property that makes ERM provably work.
VC Dimension
The Vapnik-Chervonenkis dimension: a combinatorial measure of hypothesis class complexity that characterizes learnability in binary classification.
Kolmogorov Complexity and MDL
Kolmogorov complexity measures the shortest program that produces a string. The Minimum Description Length principle selects models that compress data best, providing a computable approximation to an incomputable ideal.
Concentration & Probability (Layer 1-3)
Hoeffding through matrix Bernstein. The workhorse inequality family.
Chernoff Bounds
The Chernoff method: the universal technique for deriving exponential tail bounds by optimizing over the moment generating function, yielding the tightest possible exponential concentration.
Concentration Inequalities
Bounds on how far random variables deviate from their expectations: Markov, Chebyshev, Hoeffding, and Bernstein. Used throughout generalization theory, bandits, and sample complexity.
Epsilon-Nets and Covering Numbers
Discretizing infinite sets for concentration arguments: epsilon-nets, covering numbers, packing numbers, the Dudley integral, and the connection to Rademacher complexity.
Matrix Concentration
Matrix Bernstein, Matrix Hoeffding, Weyl's inequality, and Davis-Kahan: the operator-norm concentration tools needed for covariance estimation, dimensionality reduction, and spectral analysis in high dimensions.
McDiarmid's Inequality
The bounded-differences inequality: if changing one input to a function changes the output by at most c_i, the function concentrates around its mean with sub-Gaussian tails.
Sub-Exponential Random Variables
The distributional class between sub-Gaussian and heavy-tailed: heavier tails than Gaussian, the psi_1 norm, Bernstein condition, and the two-regime concentration bound.
Sub-Gaussian Random Variables
Sub-Gaussian random variables: the precise characterization of 'light-tailed' behavior that underpins every concentration inequality in learning theory.
Symmetrization Inequality
The symmetrization technique: the proof template that connects the generalization gap to Rademacher complexity by introducing a ghost sample and random signs.
Contraction Inequality
The Ledoux-Talagrand contraction principle: applying an L-Lipschitz function with phi(0)=0 to a function class can only contract Rademacher complexity, letting you bound the complexity of the loss class from the hypothesis class.
Empirical Processes and Chaining
Bounding the supremum of empirical processes via covering numbers and chaining: Dudley's entropy integral and Talagrand's generic chaining, the sharpest tools in classical learning theory.
Hanson-Wright Inequality
Concentration of quadratic forms X^T A X for sub-Gaussian random vectors: the two-term bound involving the Frobenius norm (Gaussian regime) and operator norm (extreme regime).
Measure Concentration and Geometric Functional Analysis
High-dimensional geometry is counterintuitive: Lipschitz functions concentrate, random projections preserve distances, and most of a sphere's measure sits near the equator. Johnson-Lindenstrauss, Gaussian concentration, and Levy's lemma.
Restricted Isometry Property
The restricted isometry property (RIP): when a measurement matrix approximately preserves norms of sparse vectors, enabling exact sparse recovery via L1 minimization. Random Gaussian matrices satisfy RIP with O(s log(n/s)) rows.
Optimization & Function Classes (Layer 1-3)
Convex optimization, regularization, kernels, RKHS.
Convex Optimization Basics
Convex sets, convex functions, gradient descent convergence, strong convexity, and duality: the optimization foundation that every learning-theoretic result silently depends on.
Gradient Descent Variants
From full-batch to stochastic to mini-batch gradient descent, plus momentum, Nesterov acceleration, AdaGrad, RMSProp, and Adam. Why mini-batch SGD with momentum is the practical default.
Gradient Flow and Vanishing Gradients
Why deep networks are hard to train: gradients shrink or explode as they propagate through layers. The Jacobian product chain, sigmoid saturation, ReLU dead neurons, skip connections, normalization, and gradient clipping.
Stochastic Gradient Descent Convergence
SGD convergence rates for convex and strongly convex functions, the role of noise as both curse and blessing, mini-batch variance reduction, learning rate schedules, and the Robbins-Monro conditions.
Bias-Variance Tradeoff
The classical decomposition of mean squared error into bias squared, variance, and irreducible noise. The U-shaped test error curve, why it breaks in modern ML (double descent), and the connection to regularization.
Cross-Validation Theory
The theory behind cross-validation as a model selection tool: LOO-CV, K-fold, the bias-variance tradeoff of the CV estimator itself, and why CV estimates generalization error.
Kernels and Reproducing Kernel Hilbert Spaces
Kernel functions, Mercer's theorem, the RKHS reproducing property, and the representer theorem: the mathematical framework that enables learning in infinite-dimensional function spaces via finite-dimensional computations.
Preconditioned Optimizers: Shampoo, K-FAC, and Natural Gradient
Optimizers that use curvature information to precondition gradients: the natural gradient via Fisher information, K-FAC's Kronecker approximation, and Shampoo's full-matrix preconditioning. How they connect to Riemannian optimization and why they outperform Adam on certain architectures.
Regularization Theory
Why unconstrained ERM overfits and how regularization controls complexity: Tikhonov (L2), sparsity (L1), elastic net, early stopping, dropout, the Bayesian prior connection, and the link to algorithmic stability.
Riemannian Optimization and Manifold Constraints
Optimization on curved spaces: the Stiefel manifold for orthogonal matrices, symmetric positive definite matrices, Riemannian gradient descent, retractions, and why flat-space intuitions break on manifolds. The geometric backbone of Shampoo, Muon, and constrained neural network training.
Stability and Optimization Dynamics
Convergence of gradient descent for smooth and strongly convex objectives, the descent lemma, gradient flow as a continuous-time limit, Lyapunov stability analysis, and the edge of stability phenomenon.
Stochastic Approximation Theory
The Robbins-Monro framework, ODE method, and Polyak-Ruppert averaging: the unified theory behind why SGD, Q-learning, and TD-learning converge.
Statistical Foundations (Layer 2-3)
Minimax, Fano, information-theoretic lower bounds, random matrix theory.
Design-Based vs. Model-Based Inference
Two philosophies of statistical inference from survey data: design-based inference where randomness comes from the sampling design, and model-based inference where randomness comes from a statistical model, with the model-assisted hybrid approach.
Detection Theory
Binary hypothesis testing, the Neyman-Pearson lemma (likelihood ratio tests are most powerful), ROC curves, Bayesian detection, and sequential testing. Classification IS detection theory. ROC/AUC comes directly from here.
Fano Inequality
Fano inequality as the standard tool for information-theoretic lower bounds: if X -> Y -> X_hat, then error probability is bounded below by conditional entropy and alphabet size.
High-Dimensional Covariance Estimation
When dimension d is comparable to sample size n, the sample covariance matrix fails. Shrinkage estimators (Ledoit-Wolf), banding and tapering for structured covariance, and Graphical Lasso for sparse precision matrices.
Kernel Two-Sample Tests
Maximum Mean Discrepancy (MMD): a kernel-based nonparametric test for whether two samples come from the same distribution, with unbiased estimation, permutation testing, and applications to GAN evaluation.
Minimax Lower Bounds
Why upper bounds are not enough: minimax risk, Le Cam two-point method, Fano inequality, and Assouad lemma for proving that no estimator can beat a given rate.
Neyman-Pearson and Hypothesis Testing Theory
The likelihood ratio test is the most powerful test for simple hypotheses (Neyman-Pearson lemma), UMP tests extend this to one-sided composites, and the power function characterizes a test's behavior across the parameter space.
Nonresponse and Missing Data
The taxonomy of missingness mechanisms (MCAR, MAR, MNAR), their consequences for inference, and the major correction methods: multiple imputation, inverse probability weighting, and the EM algorithm.
Order Statistics
Order statistics are the sorted values of a random sample. Their distributions govern quantile estimation, confidence intervals for medians, and the behavior of extremes.
Random Matrix Theory Overview
Why the spectra of random matrices matter for ML: Marchenko-Pastur law, Wigner semicircle, spiked models, and their applications to covariance estimation, PCA, and overparameterization.
Robust Statistics and M-Estimators
When data has outliers or model assumptions are wrong, classical estimators break. M-estimators generalize MLE to handle contamination gracefully.
Sample Size Determination
How to compute the number of observations needed to estimate means, proportions, and treatment effects with specified precision and power, including corrections for finite populations and complex designs.
Survey Sampling Methods
The major probability sampling designs used in survey statistics: simple random, stratified, cluster, systematic, multi-stage, and multi-phase sampling, with their variance properties and estimators.
Survival Analysis
Modeling time-to-event data with censoring: Kaplan-Meier curves, hazard functions, and the Cox proportional hazards model.
Copulas
Copulas separate the dependence structure of a multivariate distribution from its marginals. Sklar's theorem guarantees that any joint CDF can be decomposed into marginals and a copula, making dependence modeling modular.
Longitudinal Surveys and Panel Data
Analysis of data where the same units are measured repeatedly over time: fixed effects, random effects, difference-in-differences, and the problems of attrition and time-varying confounding.
Small Area Estimation
Methods for producing reliable estimates in domains where direct survey estimates have too few observations for useful precision, using Fay-Herriot and unit-level models that borrow strength across areas.
Modern Generalization Theory (Layer 3-4)
Implicit bias, double descent, NTK, mean field — where classical theory fails.
Implicit Bias and Modern Generalization
Why classical generalization theory breaks for overparameterized models: the random labels experiment, the interpolation threshold, implicit bias of gradient descent, double descent, and the frontier of understanding why deep learning works.
Benign Overfitting
When interpolation (zero training error) does not hurt generalization: the min-norm interpolator fits noise in harmless directions while preserving signal. Bartlett et al. 2020, effective rank conditions, and why benign overfitting happens in overparameterized but not classical regimes.
Double Descent
Test error follows a double-descent curve: it decreases, peaks at the interpolation threshold, then decreases again in the overparameterized regime, defying classical bias-variance intuition.
Grokking
Models can memorize training data quickly, then generalize much later after continued training. This delayed generalization, called grokking, breaks the assumption that overfitting is a terminal state and connects to weight decay, implicit regularization, and phase transitions in learning.
Lazy vs Feature Learning
The fundamental dichotomy in neural network training: lazy regime (NTK, kernel-like, weights barely move) versus rich/feature learning regime (weights move substantially, representations emerge).
Mean Field Theory
The mean field limit of neural networks: as width goes to infinity under the right scaling, neurons become independent particles whose weight distribution evolves by Wasserstein gradient flow, capturing feature learning that the NTK regime misses.
Neural Network Optimization Landscape
Loss surface geometry of neural networks: saddle points dominate in high dimensions, mode connectivity, flat vs sharp minima, Sharpness-Aware Minimization, and the edge of stability phenomenon.
Neural Tangent Kernel
In the infinite-width limit, neural networks trained with gradient descent behave like kernel regression with a specific kernel: the Neural Tangent Kernel: connecting deep learning to classical kernel theory.
Optimal Transport and Earth Mover's Distance
The Monge and Kantorovich formulations of optimal transport, the linear programming dual, Sinkhorn regularization, and applications to WGANs, domain adaptation, and fairness.
PAC-Bayes Bounds
Generalization bounds that depend on the KL divergence between a learned posterior and a prior over hypotheses. PAC-Bayes gives non-vacuous bounds for overparameterized networks where VC and Rademacher bounds fail.
Representation Learning Theory
What makes a good learned representation: the information bottleneck, contrastive learning, sufficient statistics, rate-distortion theory, and why representation learning is the central unsolved problem of deep learning.
Gaussian Processes for Machine Learning
A distribution over functions specified by a mean and kernel: closed-form posterior predictions with uncertainty, connection to kernel ridge regression, marginal likelihood for model selection, and the cubic cost bottleneck.
Information Bottleneck
The information bottleneck principle: compress the input X into a representation T that preserves information about the target Y. The Lagrangian formulation, connection to deep learning, Shwartz-Ziv and Tishby claims, and why the compression story may not hold for ReLU networks.
Open Problems in ML Theory
A curated list of genuinely open problems in machine learning theory: why overparameterized networks generalize, the right complexity measure for deep learning, feature learning beyond NTK, why scaling laws hold, emergent abilities, transformer-specific theory, and post-training theory.
Sparse Recovery and Compressed Sensing
Recover a sparse signal from far fewer measurements than its ambient dimension: the restricted isometry property, basis pursuit via L1 minimization, random measurement matrices, and applications from MRI to single-pixel cameras.
Wasserstein Distances
The Wasserstein (earth mover's) distance measures the minimum cost of transporting one probability distribution to another, with deep connections to optimal transport, GANs, and distributional robustness.
LLM Construction (Layer 4-5)
Transformer math, attention, KV cache, optimizers, scaling laws, RLHF.
Fine-Tuning and Adaptation
Methods for adapting pretrained models to new tasks: full fine-tuning, feature extraction, LoRA, QLoRA, adapter layers, and prompt tuning. When to use each, and how catastrophic forgetting constrains adaptation.
Hallucination Theory
Why large language models confabulate, the mathematical frameworks for understanding when model outputs are unreliable, and what current theory says about mitigation.
Optimizer Theory: SGD, Adam, and Muon
Convergence theory of SGD (convex and strongly convex), momentum methods (Polyak and Nesterov), Adam as adaptive + momentum, why SGD can generalize better, the Muon optimizer, and learning rate schedules.
Reinforcement Learning from Human Feedback: Deep Dive
The full RLHF pipeline: supervised fine-tuning, Bradley-Terry reward modeling, PPO with KL penalty, reward hacking via Goodhart, and the post-RLHF landscape of DPO, GRPO, and RLVR.
Attention Mechanism Theory
Mathematical formulation of attention: scaled dot-product attention as soft dictionary lookup, why sqrt(d_k) scaling prevents softmax saturation, multi-head attention, and the connection to kernel methods.
Attention Mechanisms History
The evolution of attention from Bahdanau (2014) additive alignment to Luong dot-product attention to self-attention in transformers. How attention solved the fixed-length bottleneck of seq2seq models.
Attention Sinks and Retrieval Decay
Why transformers disproportionately attend to initial tokens (attention sinks), how StreamingLLM exploits this for infinite-length inference, and how retrieval accuracy degrades with distance and position within the context window.
Attention Variants and Efficiency
Multi-head, multi-query, grouped-query, linear, and sparse attention: how each variant trades expressivity for efficiency, and when to use which.
Bits, Nats, Perplexity, and BPB
The four units people use to measure language model quality, how they relate to each other, when to use each one, and how mixing them up leads to wrong conclusions.
Chain-of-Thought and Reasoning
Chain-of-thought prompting, why intermediate reasoning steps improve LLM performance, self-consistency, tree-of-thought, and the connection to inference-time compute scaling.
Context Engineering
The discipline of building, routing, compressing, retrieving, and persisting context for LLMs: beyond prompt design into systems engineering for what the model sees.
Decoding Strategies
How language models select output tokens: greedy decoding, beam search, temperature scaling, top-k sampling, and nucleus (top-p) sampling. The tradeoffs between coherence, diversity, and quality.
Document Intelligence
Beyond OCR: understanding document layout, tables, figures, and structure using models that combine text, spatial position, and visual features to extract structured information from PDFs, invoices, and contracts.
DPO vs GRPO vs RL for Reasoning
Head-to-head comparison of preference optimization methods: DPO's implicit reward, GRPO's group-relative comparisons, and RL with verifier feedback for reasoning. When each works and when each fails.
Edge and On-Device ML
Running models on phones, embedded devices, and edge servers: pruning, distillation, quantization, TinyML, and hardware-aware neural architecture search under memory, compute, and power constraints.
Efficient Transformers Survey
Sub-quadratic attention variants (linear attention, Linformer, Performer, Longformer, BigBird) and why FlashAttention, a hardware-aware exact method, made most of them unnecessary in practice.
Flash Attention
IO-aware exact attention: tile QKV matrices into SRAM-sized blocks, compute attention without materializing the full attention matrix in HBM, reducing memory reads/writes from quadratic to linear.
Forgetting Transformer (FoX)
FoX adds a data-dependent forget gate to softmax attention. The gate down-weights unnormalized attention scores between past and present positions, giving the transformer a learned, recency-biased decay. FoX is FlashAttention-compatible, works without positional embeddings, and improves long-context language modeling and length extrapolation.
Fused Kernels
Combine multiple GPU operations into a single kernel launch to eliminate intermediate HBM reads and writes. Why kernel fusion is the primary optimization technique for memory-bound ML operations.
GPU Compute Model
How GPUs execute ML workloads: streaming multiprocessors, warps, memory hierarchy (registers, SRAM, L2, HBM), arithmetic intensity, the roofline model, and why most ML operations are memory-bound.
Induction Heads
Induction heads are attention head circuits that implement pattern completion: given a sequence like [A][B]...[A], they predict [B]. They are a leading candidate mechanism for in-context learning, with strong causal evidence in small attention-only models and correlational evidence in large transformers. They emerge through a phase transition during training.
Inference Systems Overview
The modern LLM inference stack: batching strategies, scheduling, memory management with paged attention, model parallelism for serving, and why FLOPs do not equal latency when memory bandwidth is the bottleneck.
Inference-Time Scaling Laws
How additional compute at inference time (repeated sampling, search, verification) improves output quality, why gains are task-dependent, and why verifier quality matters more than raw sample count.
Knowledge Distillation
Training a small student model to mimic a large teacher: soft targets, temperature scaling, dark knowledge, and why the teacher's mistakes carry useful information about class structure.
KV Cache Optimization
Advanced techniques for managing the KV cache memory bottleneck: paged attention for fragmentation-free allocation, prefix caching for shared prompts, token eviction for long sequences, and quantized KV cache for reduced footprint.
KV Cache
Why autoregressive generation recomputes attention at every step, how caching past key-value pairs makes it linear, and the memory bottleneck that drives MQA, GQA, and paged attention.
Latent Reasoning
Reasoning in hidden state space instead of generating chain-of-thought tokens: recurrent computation and continuous thought for scaling inference compute without scaling output length.
Memory Systems for LLMs
Taxonomy of LLM memory: short-term (KV cache), working (scratchpad), long-term (retrieval), and parametric (weights). Why extending context alone is insufficient and how memory consolidation works.
Mixture of Experts
Sparse computation via learned routing: replace dense FFN layers with multiple expert networks, activate only a subset per token, and scale capacity without proportional compute.
Model Compression and Pruning
Reducing model size without proportional accuracy loss: unstructured and structured pruning, magnitude pruning, the lottery ticket hypothesis, entropy coding for compressed weights, and knowledge distillation as compression.
Multi-Token Prediction
Predicting k future tokens simultaneously using auxiliary prediction heads: forces planning, improves code generation, and connects to speculative decoding.
Multimodal RAG
RAG beyond text: retrieving images, tables, charts, and PDFs alongside text. Document parsing, multimodal chunking, vision-language retrievers, agentic RAG, and reasoning RAG with chain-of-thought retrieval.
PaddleOCR and Practical OCR
A practitioner's guide to modern OCR toolkits: PaddleOCR's three-stage pipeline, TrOCR's transformer approach, EasyOCR, and Tesseract. When to use which, and what accuracy to expect.
Parallel Processing Fundamentals
Data, tensor, pipeline, expert, and sequence parallelism: the five strategies for distributing model training and inference across multiple GPUs, and how frontier labs combine all of them.
Perplexity and Language Model Evaluation
Perplexity as exp(cross-entropy): the standard intrinsic metric for language models, its information-theoretic interpretation, connection to bits-per-byte, and why low perplexity alone does not guarantee useful generation.
Post-Training Overview
The full post-training stack in 2026: SFT, RLHF, DPO, GRPO, constitutional AI, verifier-guided training, and self-improvement loops. Why post-training is now its own discipline.
Prefix Caching
Share computed KV cache entries across requests that share the same prefix. Radix attention trees enable efficient lookup. Significant latency savings for prefix-heavy production workloads.
Prompt Engineering and In-Context Learning
In-context learning allows LLMs to adapt to new tasks from examples in the prompt without weight updates. Theories for why it works, prompting strategies, and why prompt engineering is configuring inference-time computation.
Reasoning Data Curation
How to build training data for reasoning models: math with verified solutions, code with test cases, rejection sampling, external verification, self-play for problem generation, and the connection to RLVR.
Residual Stream and Transformer Internals
The residual stream as the central computational highway in transformers: attention and FFN blocks read from and write to it. Pre-norm vs post-norm, FFN as key-value memory, and the logit lens for inspecting intermediate representations.
RLHF and Alignment
The RLHF pipeline for aligning language models with human preferences: reward modeling, PPO fine-tuning, KL penalties, DPO, and why none of it guarantees truthfulness.
Scaling Compute-Optimal Training
Chinchilla scaling: how to optimally allocate a fixed compute budget between model size and training data, why many models were undertrained, and the post-Chinchilla reality of data quality and inference cost.
Scaling Laws
Power-law relationships between loss and compute, parameters, and data: Kaplan scaling, Chinchilla-optimal training, emergent abilities, and whether scaling laws are fundamental or empirical.
Sparse Attention and Long Context
Standard attention is O(n^2). Sparse patterns (Longformer, Sparse Transformer, Reformer), ring attention for distributed sequences, streaming with attention sinks, and why extending context is harder than it sounds.
Sparse Autoencoders for Interpretability
Sparse autoencoders decompose neural network activations into interpretable feature directions by learning an overcomplete dictionary with sparsity constraints. They are the primary tool for extracting monosemantic features from polysemantic neurons.
Speculative Decoding and Quantization
Two core inference optimizations: speculative decoding for latency (draft-verify parallelism) and quantization for memory and throughput (reducing weight precision without destroying quality).
Structured Output and Constrained Generation
Forcing LLM output to conform to schemas via constrained decoding: logit masking for valid tokens, grammar-guided generation, and why constraining the search space can improve both reliability and quality.
Test-Time Compute and Search
One of the biggest frontier shifts: spending more compute at inference through repeated sampling, verifier-guided search, MCTS for reasoning, chain-of-thought as compute, and latent reasoning.
Token Prediction and Language Modeling
Language modeling as probability assignment over sequences. Autoregressive and masked prediction objectives, perplexity evaluation, and the connection between prediction and compression.
Tool-Augmented Reasoning
LLMs that call external tools during reasoning: Toolformer for learning when to invoke APIs, ReAct for interleaving thought and action, and code-as-thought for replacing verbal arithmetic with executed programs.
Training Dynamics and Loss Landscapes
The geometry of neural network loss surfaces: why saddle points dominate over local minima in high dimensions, how flat minima relate to generalization, and why SGD finds solutions that generalize.
Transformer Architecture
The mathematical formulation of the transformer block: self-attention, multi-head attention, layer normalization, FFN blocks, positional encoding, and parameter counting.
AMD Competition Landscape
AMD's MI300X and MI325X GPUs compete with NVIDIA on memory bandwidth and capacity but lag on software ecosystem. Competition matters because pricing, supply diversity, and vendor lock-in determine who can train and serve models.
ASML and Chip Manufacturing
ASML is the sole manufacturer of EUV lithography machines used to produce every advanced AI chip. Understanding the semiconductor supply chain reveals a critical concentration risk for AI compute.
Attention as Kernel Regression
Softmax attention viewed as Nadaraya-Watson kernel regression: the output at each position is a kernel-weighted average of values, with the softmax kernel K(q,k) = exp(q^T k / sqrt(d)). Connects attention to classical nonparametric statistics and motivates linear attention via random features.
Distributed Training Theory
Training frontier models requires thousands of GPUs. Data parallelism, model parallelism, and communication-efficient methods make this possible.
Donut and OCR-Free Document Understanding
End-to-end document understanding without OCR: Donut reads document images directly and generates structured output, bypassing the error-prone OCR pipeline. Nougat extends this to academic paper parsing.
Model Merging and Weight Averaging
Combining trained models by averaging or interpolating their weights: SWA, SLERP, TIES-Merging, DARE. Why it works (loss landscape mode connectivity), when it fails, and applications to combining specialized models.
Neural Architecture Search
Automating network architecture design: search spaces, search strategies (RL, evolutionary, differentiable), performance estimation via weight sharing, and the gap between NAS hype and practical gains.
NVIDIA GPU Architectures
A concrete reference for the GPU hardware that determines what ML models can be trained and served: H100, H200, B200, GB200, and Rubin architectures, with emphasis on memory bandwidth as the primary bottleneck.
Plan-then-Generate
Beyond strict left-to-right token generation: plan what to say before producing tokens. Outline-then-fill, hierarchical generation, and multi-token prediction. Planning reduces incoherence in long outputs and enables structural editing.
Positional Encoding
Why attention needs position information, sinusoidal encoding, learned positions, RoPE (rotary position encoding via 2D rotations), ALiBi, and why RoPE became the default for modern LLMs.
Quantization Theory
Reduce model weight precision from FP32 to FP16, INT8, or INT4. Post-training quantization, quantization-aware training, GPTQ, AWQ, and GGUF. Quantization is how large language models actually get deployed.
Table Extraction and Structure Recognition
Detecting tables in documents, recognizing row and column structure, and extracting cell content. Why tables are hard: merged cells, borderless layouts, nested headers, and cascading pipeline errors.
Tokenization and Information Theory
Tokenization determines an LLM's vocabulary and shapes everything from compression efficiency to multilingual ability. Information theory explains what good tokenization looks like.
Methodology & Experimental Design
Hypothesis testing, ablations, significance, reproducibility.
The Bitter Lesson
Sutton's meta-principle: scalable general methods that exploit computation tend to beat hand-crafted domain-specific approaches in the long run. Search and learning win; brittle cleverness loses.
Causal Inference and the Ladder of Causation
Pearl's three-level hierarchy: association, intervention, counterfactual. Structural causal models, do-calculus, the adjustment formula, and why prediction is not causation.
Confusion Matrices and Classification Metrics
The complete framework for evaluating binary classifiers: confusion matrix entries, precision, recall, F1, specificity, ROC curves, AUC, and precision-recall curves. When to use each metric.
Confusion Matrix Deep Dive
Beyond binary classification: multi-class confusion matrices, micro vs macro averaging, Matthews correlation coefficient, Cohen's kappa, cost-sensitive evaluation, and how to read a confusion matrix without getting the axes wrong.
The Era of Experience
Sutton and Silver's thesis: the next phase of AI moves beyond imitation from human data toward agents that learn predominantly from their own experience. Text is not enough for general intelligence.
Model Evaluation Best Practices
Train/validation/test splits, cross-validation, stratified and temporal splits, data leakage, reporting with confidence intervals, and statistical tests for model comparison. Why single-number comparisons are misleading.
Train-Test Split and Data Leakage
Why you need three data splits, how to construct them correctly, and the common ways information leaks from test data into training. Temporal splits, stratified splits, and leakage detection.
Types of Bias in Statistics
A taxonomy of the major biases that corrupt statistical inference: selection bias, measurement bias, survivorship bias, response bias, attrition bias, publication bias, and their consequences for ML.
Ablation Study Design
How to properly design ablation studies: remove one component at a time, measure the effect against proper baselines, and report results with statistical rigor.
Class Imbalance and Resampling
When class frequencies differ dramatically, standard accuracy is misleading. Resampling, cost-sensitive learning, and threshold tuning restore meaningful evaluation and training.
Convex Tinkering
Taleb's concept applied to ML research: designing small experiments with bounded downside and unbounded upside, and why this strategy dominates scale-first approaches under uncertainty.
Evaluation Metrics and Properties
The metrics that determine whether a model is good: accuracy, precision, recall, F1, AUC-ROC, log loss, RMSE, calibration, and proper scoring rules. Why choosing the right metric matters more than improving the wrong one.
Exploratory Data Analysis
The disciplined practice of looking at data before modeling: summary statistics, distributions, correlations, missing values, outliers, and class balance. You cannot model what you do not understand.
Feature Importance and Interpretability
Methods for attributing model predictions to input features: permutation importance, SHAP values, LIME, partial dependence, and why none of these imply causality.
Federated Learning
Train a global model without centralizing data. FedAvg, communication efficiency, non-IID convergence challenges, differential privacy integration, and applications in healthcare and mobile computing.
Hardware for ML Practitioners
Practical hardware guidance for ML work: GPU memory as the real bottleneck, when local GPUs make sense, cloud options compared, and why you should not spend $5000 before knowing what you need.
Hypothesis Testing for ML
Null and alternative hypotheses, p-values, confidence intervals, error types, multiple comparisons, and proper statistical tests for comparing ML models.
Meta-Analysis
Combining results from multiple studies: fixed-effect and random-effects models, heterogeneity measures, publication bias, and why this matters for ML benchmarking.
ML Project Lifecycle
The full ML project workflow from problem definition through deployment and monitoring. Why most projects fail at data quality, not model architecture. Cross-functional requirements and MLOps basics.
P-Hacking and Multiple Testing
How selective reporting and multiple comparisons inflate false positive rates, and how Bonferroni and Benjamini-Hochberg corrections control them. Why hyperparameter tuning is multiple testing and benchmark shopping is p-hacking.
Proper Scoring Rules
A scoring rule is proper if the expected score is maximized when the forecaster reports their true belief. Log score and Brier score are strictly proper. Accuracy is not. Why this matters for calibrated probability estimates.
Reproducibility and Experimental Rigor
What it takes to make ML experiments truly reproducible: seeds, variance reporting, data hygiene, configuration management, and the discipline of multi-run evaluation.
Statistical Significance and Multiple Comparisons
p-values, significance levels, confidence intervals, and the multiple comparisons problem. Bonferroni correction, Benjamini-Hochberg FDR control, and why these matter for model selection and benchmark evaluation.
Synthetic Data Generation
Using models to generate training data: LLM-generated instructions, diffusion-based image augmentation, code synthesis. When synthetic data helps (low-resource, privacy) and when it hurts (model collapse).
Benchmarking Methodology
What makes a good benchmark, how benchmarks fail (contamination, leaderboard gaming, single-number comparisons), and how to report results honestly with variance, seeds, and proper statistical practice.
Causal Inference Basics
Correlation is not causation. The potential outcomes framework, average treatment effects, confounders, and the methods that let you estimate causal effects from data.
Energy Efficiency and Green AI
The compute cost of training frontier models, carbon footprint, FLOPs vs wall-clock time vs dollars, and why reporting efficiency matters. Efficient alternatives: distillation, pruning, quantization, and scaling laws for optimal compute allocation.
Experiment Tracking and Tooling
MLflow, Weights and Biases, TensorBoard, Hydra, and the discipline of recording everything. What to log, how to compare experiments, and why you cannot reproduce what you did not record.
Official Statistics and National Surveys
How government statistical agencies produce population, economic, and social data through censuses and surveys, with quality frameworks and implications for ML practitioners using these datasets.
Training Techniques & Regularization
Adam, dropout, batch norm, data augmentation, learning rate schedules.
Adam Optimizer
Adaptive moment estimation: first and second moment tracking, bias correction, AdamW vs Adam+L2, learning rate warmup, and when Adam fails compared to SGD.
Batch Normalization
Normalize activations within mini-batches to stabilize and accelerate training, then scale and shift with learnable parameters.
Dropout
Dropout as stochastic regularization: Bernoulli masking, inverted dropout scaling, implicit ensemble interpretation, Bayesian connection via MC dropout, and equivalence to L2 regularization in linear models.
Learning Rate Scheduling
Why learning rate moves the loss more than any other hyperparameter, and how warmup, cosine decay, step decay, cyclic schedules, and 1cycle policy control training dynamics.
Regularization in Practice
Practical guide to regularization: L2 (weight decay), L1 (sparsity), dropout, early stopping, data augmentation, and batch normalization. How each constrains model complexity and how to choose between them.
Weight Initialization
Why initialization determines whether gradients vanish or explode, and how Xavier and He initialization preserve variance across layers.
Activation Checkpointing
Trade compute for memory by recomputing activations during the backward pass instead of storing them all. Reduces memory from O(L) to O(sqrt(L)) for L layers.
Batch Size and Learning Dynamics
How batch size affects what SGD finds: gradient noise, implicit regularization, the linear scaling rule, sharp vs flat minima, and the gradient noise scale as the key quantity governing the tradeoff.
Data Augmentation Theory
Why data augmentation works as a regularizer: invariance injection, effective sample size, MixUp, CutMix, and the connection to Vicinal Risk Minimization.
Label Smoothing and Regularization
Replace hard targets with soft targets to prevent overconfidence: the label smoothing formula, its connection to maximum entropy regularization, calibration effects, and when it hurts.
Mixed Precision Training
Train with FP16 or BF16 for speed while keeping FP32 master weights for accuracy. Loss scaling, overflow prevention, and when mixed precision fails.
Curriculum Learning
Train on easy examples first, gradually increase difficulty. Curriculum learning can speed convergence and improve generalization, but defining difficulty is the hard part. Self-paced learning, anti-curriculum, and the connection to importance sampling.
AI Safety & Alignment
RLHF failure modes, hallucination theory, interpretability, reward hacking.
Adversarial Machine Learning
Attacks across the ML lifecycle: evasion (adversarial examples), poisoning (corrupt training data), model extraction (steal via queries), privacy leakage (membership inference), and LLM jailbreaks. Why defenses are hard.
Calibration and Uncertainty Quantification
When a model says 70% confidence, is it right 70% of the time? Calibration measures the alignment between predicted probabilities and actual outcomes. Temperature scaling, Platt scaling, conformal prediction, and MC dropout provide practical tools for trustworthy uncertainty.
Catastrophic Forgetting
Fine-tuning a neural network on new data destroys knowledge of old data. Understanding the stability-plasticity dilemma and mitigation strategies: EWC, progressive networks, replay: is essential for continual learning and safe LLM fine-tuning.
Constitutional AI
Anthropic's approach to alignment: replace human harmlessness labels with a constitution of principles that the model uses to self-evaluate, enabling scalable AI feedback.
Continual Learning and Forgetting
Learning sequentially without destroying previous knowledge: Elastic Weight Consolidation, progressive networks, replay methods, and the stability-plasticity tradeoff in deployed systems.
Data Contamination and Evaluation
When training data overlaps test benchmarks, model scores become meaningless. Types of contamination, detection methods, dynamic benchmarks, and how to read evaluation claims skeptically.
Differential Privacy
Formal privacy guarantees for algorithms: epsilon-delta DP, Laplace and Gaussian mechanisms, composition theorems, DP-SGD for training neural networks, and the privacy-utility tradeoff.
Ethics and Fairness in ML
Fairness definitions (demographic parity, equalized odds, calibration), the impossibility theorem showing they cannot all hold simultaneously, bias sources, and mitigation strategies at each stage of the pipeline.
LLM Application Security
The OWASP LLM Top 10: prompt injection, insecure output handling, training data poisoning, model denial of service, supply chain vulnerabilities, sensitive information disclosure, insecure plugin design, excessive agency, overreliance, and model theft. Standard application security for the GenAI era.
Mechanistic Interpretability
Understanding what individual neurons and circuits compute inside neural networks: sparse autoencoders, superposition, induction heads, probing, and the limits of interpretability.
Model Collapse and Data Quality
When models train on their own outputs, the learned distribution narrows, tails disappear, and quality degrades across generations. Why synthetic data feedback loops threaten pretraining data quality and how to mitigate them.
Out-of-Distribution Detection
Methods for detecting when test inputs differ from training data, where naive softmax confidence fails and principled alternatives based on energy, Mahalanobis distance, and typicality succeed.
Red-Teaming and Adversarial Evaluation
Systematically trying to make models produce harmful or incorrect outputs: manual and automated red-teaming, jailbreaks, prompt injection, adversarial suffixes, and why adversarial evaluation is necessary before deployment.
Reward Hacking
Goodhart's law for AI: when models exploit reward model weaknesses instead of being genuinely helpful, including verbosity hacking, sycophancy, and structured mitigation strategies.
Reward Models and Verifiers
Reward models trained on human preferences vs verifiers that check output correctness. Bradley-Terry models, process vs outcome rewards, Goodhart's law, and why verifiers are more robust.
Verifier Design and Process Reward
Detailed treatment of verifier types, process vs outcome reward models, verifier-guided search, self-verification, and the connection to test-time compute scaling. How to design reward signals for reasoning models.
Reinforcement Learning Theory
MDPs, Bellman, policy gradients, multi-agent, game theory.
Kalman Filter
Optimal state estimation for linear Gaussian systems via recursive prediction and update steps using the Kalman gain.
Markov Decision Processes
The mathematical framework for sequential decision-making under uncertainty: states, actions, transitions, rewards, and the Bellman equations that make solving them possible.
Policy Gradient Theorem
The fundamental result enabling gradient-based optimization of parameterized policies: the policy gradient theorem and the algorithms it spawns.
Value Iteration and Policy Iteration
The two foundational algorithms for solving MDPs exactly: value iteration applies the Bellman optimality operator until convergence, while policy iteration alternates between exact evaluation and greedy improvement.
Actor-Critic Methods
The dominant paradigm for deep RL and LLM training: an actor (policy network) guided by a critic (value network), with advantage estimation, PPO clipping, and entropy regularization.
Agentic RL and Tool Use
The shift from passive sequence generation to autonomous multi-turn decision making. LLMs as RL policies, tool use as actions, ReAct, AgentRL, and why agentic RL differs from chat RLHF.
Bayesian State Estimation
The filtering problem: recursively estimate a hidden state from noisy observations using predict-update cycles. Kalman filter for linear Gaussian systems, particle filters for the general case.
Exploration vs Exploitation
The fundamental tradeoff in sequential decision-making: exploit known good actions to collect reward now, or explore uncertain actions to discover potentially better strategies. Epsilon-greedy, Boltzmann exploration, UCB, count-based methods, and intrinsic motivation.
GraphSLAM and Factor Graphs
SLAM as graph optimization: poses as nodes, constraints as edges, factor graph representation, MAP estimation via nonlinear least squares, and the sparsity structure that makes large-scale mapping tractable.
Markov Games and Self-Play
Multi-agent extensions of MDPs where multiple agents with separate rewards interact. Nash equilibria, minimax values in zero-sum games, and self-play as a training method.
Minimax and Saddle Points
Minimax theorems characterize when max-min equals min-max. Saddle points arise in zero-sum games, duality theory, GANs, and robust optimization.
Multi-Agent Collaboration
Multiple LLM agents working together on complex tasks: debate for improving reasoning, division of labor across specialist agents, structured communication protocols, and when multi-agent outperforms single-agent systems.
Multi-Armed Bandits Theory
The exploration-exploitation tradeoff formalized: K arms, regret as the cost of not knowing the best arm, and algorithms (UCB, Thompson sampling) that achieve near-optimal regret bounds.
No-Regret Learning
Online learning against adversarial losses: regret as cumulative loss minus the best fixed action in hindsight, multiplicative weights, follow the regularized leader, and why no-regret dynamics converge to Nash equilibria in zero-sum games.
Offline Reinforcement Learning
Learning policies from a fixed dataset without environment interaction: distributional shift as the core challenge, conservative Q-learning (CQL) as the standard fix, and Decision Transformer as an alternative sequence modeling approach.
Online Learning and Bandits
Sequential decision making with adversarial or stochastic feedback: the bandit setting, explore-exploit tradeoff, UCB, Thompson sampling, and regret bounds. Connections to RL and A/B testing.
Policy Optimization: PPO and TRPO
Trust region and clipped surrogate methods for stable policy optimization: TRPO enforces a KL constraint for monotonic improvement, PPO replaces it with a simpler clipped objective, and PPO became the default optimizer for LLM post-training.
Policy Representations
How to parameterize policies in reinforcement learning: categorical for discrete actions, Gaussian for continuous actions, and why the choice affects gradient variance and exploration.
Q-Learning
Model-free, off-policy value learning: the Q-learning update rule, convergence under Robbins-Monro conditions, and the deep Q-network revolution that introduced function approximation, experience replay, and the deadly triad.
Self-Play and Multi-Agent RL
Self-play as a training paradigm for competitive games, fictitious play convergence, AlphaGo/AlphaZero, and the challenges of multi-agent reinforcement learning: non-stationarity, partial observability, and centralized training.
Temporal Difference Learning
Temporal difference methods bootstrap value estimates from other value estimates, enabling online, incremental learning without waiting for episode termination. TD(0), SARSA, and TD(lambda) with eligibility traces.
Active SLAM and POMDPs
Choosing robot actions to simultaneously map an environment and localize, formulated as a partially observable Markov decision process over belief states.
Agent Protocols: MCP and A2A
The protocol layer for AI agents: MCP (Model Context Protocol) for tool access, A2A (Agent-to-Agent) for inter-agent communication, and why standardized interfaces matter for the agent ecosystem.
Mean-Field Games
The many-agent limit of strategic interactions: as the number of agents goes to infinity, each agent solves an MDP against the population distribution, and equilibrium becomes a fixed-point condition on the mean field.
Options and Temporal Abstraction
The options framework for hierarchical RL: temporally extended actions with initiation sets, internal policies, and termination conditions. Semi-MDPs and learning options end-to-end.
Particle Filters
Sequential Monte Carlo: represent the posterior over hidden states as a set of weighted particles, propagate through dynamics, reweight by likelihood, and resample to combat degeneracy.
Reinforcement Learning Environments and Benchmarks
The standard RL evaluation stack: Gymnasium API, classic control tasks, Atari, MuJoCo, ProcGen, the sim-to-real gap, and why benchmark performance is a poor predictor of real-world RL capability.
Robust Adversarial Policies
Robust MDPs optimize against worst-case transition dynamics within an uncertainty set. Adversarial policies formalize distribution shift in RL as a game between agent and environment.
Visual and Semantic SLAM
Replacing laser range finders with cameras for SLAM, and enriching maps with semantic labels to improve data association and planning.
Beyond LLMs
JEPA, world models, vision-first AI, diffusion, state-space models.
CLIP and OpenCLIP in Practice
CLIP learns a shared embedding space for images and text via contrastive learning on 400M pairs. Practical guide to zero-shot classification, image search, OpenCLIP variants, embedding geometry, and known limitations.
Diffusion Models
Generative models that learn to reverse a noise-adding process: the math of score matching, denoising, SDEs, and why diffusion dominates image generation.
Equilibrium and Implicit-Layer Models
Deep Equilibrium Models (DEQ) replace explicit depth with a fixed-point equation: instead of stacking L layers, solve for the equilibrium state where one more layer would not change the output. This enables infinite-depth networks with constant memory, using implicit differentiation for backprop.
Equivariant Deep Learning
Networks that respect symmetry: if the input transforms under a group action, the output transforms predictably. Equivariance generalizes translation equivariance in CNNs to rotations, permutations, and gauge symmetries, reducing sample complexity and improving generalization on structured data.
Florence and Vision Foundation Models
Vision foundation models that unify classification, detection, segmentation, captioning, and VQA under a single pretrained backbone. Florence, Florence-2, and the path toward GPT-4V-style multimodal understanding.
Flow Matching
Learn a velocity field that transports noise to data along straight-line paths. Simpler training than diffusion, faster sampling, and cleaner math.
JEPA and Joint Embedding
LeCun's Joint Embedding Predictive Architecture: learning by predicting abstract representations rather than pixels, energy-based formulation, I-JEPA and V-JEPA implementations, and the connection to contrastive learning and world models.
Mamba and State-Space Models
Linear-time sequence modeling via structured state spaces: S4, HiPPO initialization, selective state-space models (Mamba), and the architectural fork from transformers.
Neural ODEs and Continuous-Depth Networks
Treating neural network depth as a continuous variable: the ODE formulation of residual networks, the adjoint method for memory-efficient backpropagation, connections to dynamical systems theory, and practical limitations.
Self-Supervised Vision
Learning visual representations without labels: contrastive methods (SimCLR, MoCo), self-distillation (DINO/DINOv2), and masked image modeling (MAE). Why self-supervised vision matters for transfer learning and label-scarce domains.
Test-Time Training and Adaptive Inference
Updating model parameters at inference time using self-supervised objectives on the test input itself. TTT layers replace fixed linear recurrences (as in Mamba) with learned update rules, blurring the boundary between training and inference.
Video World Models
Turning pretrained video diffusion models into interactive world simulators: condition on actions to generate future frames, enabling RL agent training, robot planning, and game AI without physical environments.
Vision Transformer Lineage
The evolution of visual representation learning: from CNNs (AlexNet, ResNet) to ViT (pure attention for images), Swin (hierarchical attention), and DINOv2 (self-supervised ViT with self-distillation), with connections to CLIP.
World Models and Planning
Learning a model of the environment and planning inside it: Dreamer's latent dynamics, MuZero's learned model with MCTS, video world models, and why planning in imagination is the path to sample-efficient and safe AI.
Audio Language Models
Models that process and generate speech alongside text: audio tokenization, Whisper for transcription, end-to-end voice models, music generation, and the audio-language frontier.
Continuous Thought Machines
Neural networks that process information through continuous-time internal dynamics rather than discrete layer-by-layer computation. Inspired by neural ODEs and dynamical systems, these architectures let the network 'think' for a variable amount of internal time before producing an output.
3D Gaussian Splatting
Represent a 3D scene as millions of 3D Gaussians, each with position, covariance, opacity, and color. Render by projecting to 2D and alpha-compositing. Real-time, high-quality novel view synthesis without neural networks at render time.
Occupancy Networks and Neural Fields
Representing 3D geometry and appearance as continuous functions parameterized by neural networks: NeRF, occupancy networks, DeepSDF, volume rendering, and the connection to Gaussian splatting.
World Model Evaluation
How to measure whether a learned world model is useful: prediction accuracy, controllability (sim-to-real transfer), planning quality, and why long-horizon evaluation is hard.