Curriculum

The Full Theory Library

Every concept organized by depth layer and module. Layer 0 is foundations. Layer 5 is applied systems. Every topic links down to its prerequisites until you hit axioms.

547 curriculum topics publishedLayered prerequisite mapReference pages included where they clarify the path

Layer measures prerequisite depth. Tier measures effort within that layer: Tier 1 is core, Tier 2 is recommended, and Tier 3 is optional or advanced. Time is an authored reading estimate, not a mastery guarantee.

Foundations

49 topics / L0A, L1 / Tiers 1-2

Axioms, definitions, and notation. The base layer everything else depends on.

Sets, Functions, and Relations

The language underneath all of mathematics: sets, Cartesian products, functions, injectivity, surjectivity, equivalence relations, and quotient sets.

Basic Logic and Proof Techniques

Exponential Function Properties

The exponential function exp(x): series definition, algebraic properties, and why it appears everywhere in ML. Softmax, MGFs, the Chernoff method, Boltzmann distributions, and exponential families all reduce to properties of exp.

Vectors, Matrices, and Linear Maps

Vector spaces, linear maps, matrix representation, rank, nullity, and the rank-nullity theorem. The algebraic backbone of ML.

Sets, Functions, and Relations

Matrix Operations and Properties

Essential matrix operations for ML: trace, determinant, inverse, pseudoinverse, Schur complement, and the Sherman-Morrison-Woodbury formula. When and why each matters.

Sets, Functions, and RelationsLinear Independence

Linear Independence

A set of vectors is linearly independent if no vector is a redundant copy of the others. The concept underwrites basis, dimension, rank, the column-space test for $Ax = b$, and every overdetermined-versus-underdetermined diagnosis in linear regression and gradient-based optimization.

Vectors, Matrices, and Linear Maps

Inner Product Spaces and Orthogonality

Inner product axioms, Cauchy-Schwarz inequality, orthogonality, Gram-Schmidt, projections, and the bridge to Hilbert spaces.

Vectors, Matrices, and Linear Maps

Eigenvalues and Eigenvectors

Eigenvalues and eigenvectors: the directions a matrix scales without rotating. Characteristic polynomial, diagonalization, the spectral theorem for symmetric matrices, and the direct connection to PCA.

Matrix Operations and PropertiesInner Product Spaces and OrthogonalityLinear Independence+2 more

Singular Value Decomposition

The SVD A = UΣVᵀ: the most important matrix factorization in applied mathematics. Geometric interpretation, relationship to eigendecomposition, low-rank approximation via Eckart-Young, and applications from PCA to pseudoinverses.

Eigenvalues and EigenvectorsLinear IndependenceMatrix Norms+1 more

Positive Semidefinite Matrices

PSD matrices: equivalent characterizations, Cholesky decomposition, Schur complement, and Loewner ordering. Covariance matrices are PSD. Hessians of convex functions are PSD. These facts connect linear algebra to optimization and statistics.

Eigenvalues and Eigenvectors

Matrix Norms

Frobenius, spectral, and nuclear norms for matrices. Submultiplicativity. When and why each norm appears in ML theory.

Vectors, Matrices, and Linear Maps

Differentiation in Rⁿ

Partial derivatives, directional derivatives, gradients, total derivatives, and the multivariable chain rule. The point is not notation: differentiability means one linear map predicts all small directions.

Sets, Functions, and RelationsVectors, Matrices, and Linear MapsContinuity in Rⁿ

Taylor Expansion

Taylor approximation in one and many variables. Every optimization algorithm is a Taylor approximation: gradient descent uses first order, Newton's method uses second order.

Continuity in RⁿDifferentiation in Rⁿ

Random Variables

Random variables as measurement rules, their distributions, expectation and variance, and the rigorous measurable-map definition used in probability theory.

Kolmogorov Probability AxiomsSets, Functions, and Relations

Kolmogorov Probability Axioms

The three axioms (non-negativity, normalization, countable additivity) that every probability claim on this site implicitly invokes. Sample space, event sigma-algebra, probability measure, and the immediate consequences.

Sets, Functions, and Relations

Joint, Marginal, and Conditional Distributions

Joint distributions, marginalization, conditional distributions, independence, Bayes theorem, and the chain rule of probability.

Random VariablesCommon Probability DistributionsKolmogorov Probability Axioms

Expectation, Variance, Covariance, and Moments

Expectation as average value, variance as squared spread, covariance as joint movement, and the moment identities used throughout statistics and ML.

Random VariablesCommon Probability DistributionsJoint, Marginal, and Conditional Distributions+1 more

Common Probability Distributions

The essential probability distributions for ML: Bernoulli through Dirichlet, with PMFs/PDFs, moments, relationships, and when each arises in practice.

Sets, Functions, and RelationsExponential Function PropertiesIntegration and Change of Variables+2 more

Normal Distribution

The Normal distribution as a parametric family: density, moment generating function, closure under affine transformations and sums, MLE for mean and variance, Fisher information, and the bridge to the Chi-squared, Student-t, and F sampling distributions.

Common Probability DistributionsDistributions AtlasExponential Function Properties+2 more

KL Divergence

Kullback-Leibler divergence measures how one probability distribution differs from another. Asymmetric, always non-negative, and central to variational inference, MLE, and RLHF.

Common Probability DistributionsInformation Theory FoundationsDistance Metrics Compared+1 more

Common Inequalities

The algebraic and probabilistic inequalities that appear everywhere in ML theory: Cauchy-Schwarz, Jensen, AM-GM, Holder, Minkowski, Young, Markov, and Chebyshev.

Common Probability DistributionsKolmogorov Probability Axioms

Metric Spaces, Convergence, and Completeness

Metric space axioms, convergence of sequences, Cauchy sequences, completeness, and the Banach fixed-point theorem.

Sets, Functions, and Relations

Compactness and Heine-Borel

Sequential compactness, the Heine-Borel theorem in finite dimensions, the extreme value theorem, and why compactness is the key assumption in optimization.

Metric Spaces, Convergence, and Completeness

Distributions Atlas

A connection map for the parametric families used in statistical inference and ML. Lists each family with its support, parameterization, and the transformations that move between families: sum, mixture, conjugacy, limiting case, and ratio constructions.

Common Probability DistributionsRandom VariablesMoment Generating Functions

Continuity in Rⁿ

Epsilon-delta continuity, uniform continuity, and Lipschitz continuity in Euclidean space. Lipschitz constants control how fast function values change and appear throughout optimization and generalization theory.

Metric Spaces, Convergence, and Completeness

Exponential Distribution

The Exponential distribution as the memoryless waiting time: density, CDF, MGF, the memoryless property as a characterization, the Poisson-process inter-arrival construction, the minimum of independent exponentials, MLE for the rate, and the bridge to the Gamma distribution.

Common Probability DistributionsDistributions AtlasExponential Function Properties

Gamma Distribution

The Gamma distribution as the sum of independent Exponentials and as a flexible nonnegative density: shape and rate, density and MGF, conjugacy for Poisson and Exponential likelihoods, Chi-squared as a special case, MLE without closed form.

Common Probability DistributionsDistributions AtlasExponential Distribution+1 more

Poisson Distribution

The Poisson distribution as the rare-event limit of the Binomial and as the count law of a Poisson process: PMF, MGF, mean equals variance, additivity, thinning, superposition, MLE, and the connection to the Exponential and Gamma.

Common Probability DistributionsDistributions AtlasExponential Distribution

Tensors and Tensor Operations

What a tensor actually is: a multilinear map with specific transformation rules, how tensor contraction generalizes matrix multiplication, Einstein summation, tensor decompositions, and how ML frameworks use the word tensor to mean multidimensional array.

Eigenvalues and EigenvectorsPandas and NumPy Fundamentals

Gram Matrices and Kernel Matrices

The Gram matrix encodes pairwise inner products of a dataset. Always PSD. Appears in kernel methods, PCA, SVD, and attention. Connects linear algebra to ML.

Inner Product Spaces and OrthogonalityEigenvalues and EigenvectorsDistance Metrics Compared+2 more

Numerical Stability and Conditioning

Continuous math becomes real only through finite-precision approximation. Condition numbers, backward stability, catastrophic cancellation, and why theorems about reals do not transfer cleanly to floating-point.

Floating-Point ArithmeticMatrix Operations and PropertiesMatrix Norms

Skewness, Kurtosis, and Higher Moments

Distribution shape beyond mean and variance: skewness measures tail asymmetry, kurtosis measures tail extremeness, cumulants are the cleaner language, and heavy-tailed distributions break all of these.

Common Probability DistributionsExpectation, Variance, Covariance, and Moments

Total Variation Distance

Total variation distance measures the largest possible discrepancy between two probability distributions: equivalently half the L1 gap, one minus the overlap mass, or the minimum disagreement probability under a coupling.

Common Probability DistributionsMeasure-Theoretic Probability

Basic Logic and Proof Techniques

The proof vocabulary behind serious mathematics and ML theory: implication, quantifiers, contradiction, contrapositive, induction, construction, cases, and counterexamples.

Cantor's Theorem and Uncountability

Cantor's diagonal argument proves the reals are uncountable. The power set of any set has strictly greater cardinality. These results are the origin of the distinction between countable and uncountable infinity.

Zermelo-Fraenkel Set Theory

Cardinality and Countability

Two sets have the same cardinality when a bijection exists between them. The naturals, integers, and rationals are countable. The reals are uncountable, proved by Cantor's diagonal argument.

Sets, Functions, and Relations

Counting and Combinatorics

Counting principles, binomial and multinomial coefficients, inclusion-exclusion, and Stirling's approximation. These tools appear whenever you count hypotheses, bound shattering coefficients, or analyze combinatorial arguments in learning theory.

Basic Logic and Proof Techniques

Hypergeometric Distribution

The Hypergeometric distribution is the law of the success count in n draws without replacement from a finite population of N items, K of which are successes. The PMF is the ratio of three binomial coefficients. The mean is nK/N, identical to the Binomial with p = K/N, but the variance carries the finite-population correction (N-n)/(N-1). The Binomial is the n << N limit. Fisher's exact test, capture-recapture, and quality-control acceptance sampling read off the Hypergeometric directly.

Common Probability DistributionsDistributions Atlas

Integration and Change of Variables

Riemann integration, improper integrals, the substitution rule, multivariate change of variables via the Jacobian determinant, and Fubini theorem. The computational backbone of probability and ML.

Inverse and Implicit Function Theorem

The inverse function theorem guarantees local invertibility when the Jacobian is nonsingular. The implicit function theorem guarantees that constraint surfaces are locally graphs. Both are essential for constrained optimization and implicit layers.

The Jacobian Matrix

Lognormal Distribution

A random variable is Lognormal if its logarithm is Normal. The density, mean, variance, median, and mode all have closed forms in the two underlying Normal parameters. The Lognormal is the multiplicative analogue of the Normal: a product of many independent positive factors is approximately Lognormal in the same way a sum is approximately Normal. Applications cover financial returns (with the heavy-tail caveat that real returns are heavier than Lognormal), particle sizes, lifetimes, and insurance severity.

Common Probability DistributionsNormal DistributionCentral Limit Theorem+1 more

Moment Generating Functions

Moment generating functions encode moments, control light-tailed behavior, and power Chernoff bounds, sub-Gaussian estimates, and exponential-family theory.

Expectation, Variance, Covariance, and MomentsCommon Probability DistributionsExponential Function Properties

Scale, Location, and Shape Parameters

Three roles a parameter can play in a distribution family: location shifts the support, scale stretches it, and shape changes the form. Conventions vary by source (rate vs scale, especially in Exponential and Gamma), and the group structure of location-scale families is what makes standardization and pivot quantities work.

Common Probability DistributionsExpectation, Variance, Covariance, and Moments

Sequences and Series of Functions

Pointwise vs uniform convergence of function sequences, the Weierstrass M-test, and why uniform convergence preserves continuity. The concept that makes learning theory work.

Metric Spaces, Convergence, and Completeness

Triangular Distribution

A bounded continuous distribution defined by a lower endpoint, upper endpoint, and mode. Includes the PDF, CDF, moments, sum-of-uniforms construction, and why the shape is a useful bridge between geometry and probability.

Common Probability DistributionsRandom Variables

Cramér-Wold Theorem

A multivariate distribution is uniquely determined by all of its one-dimensional projections. This reduces multivariate convergence in distribution to checking univariate projections, and is the standard tool for proving multivariate CLT.

Central Limit TheoremMeasure-Theoretic Probability

Markov Chains and Steady State

Markov chains: the Markov property, transition matrices, stationary distributions, irreducibility, aperiodicity, the ergodic theorem, and mixing time. The backbone of PageRank, MCMC, and reinforcement learning.

Common Probability DistributionsEigenvalues and EigenvectorsPageRank Algorithm

Pareto Distribution

The Pareto distribution is the canonical power-law on a half-line. The Type I parameterization has survival function (x_m/x)^alpha for x at least x_m. The shape parameter alpha is the tail index. Three regimes of alpha matter for the law of large numbers and the central limit theorem: alpha at most 1 has no finite mean and breaks the LLN; 1 < alpha at most 2 has finite mean but infinite variance so the standard CLT fails (generalized CLT to a stable law); alpha greater than 2 admits both LLN and CLT in the usual form. Applications: wealth, city sizes, file sizes, network degree, insurance severity. The 80/20 'Pareto principle' is a specific case requiring alpha approximately 1.16.

Common Probability DistributionsCentral Limit TheoremLaw of Large Numbers+1 more

Signals and Systems for ML

Linear time-invariant systems, convolution, Fourier transform, and the sampling theorem. The signal processing foundations that underpin CNNs, efficient attention, audio ML, and frequency-domain analysis of training dynamics.

Weibull Distribution

The Weibull distribution is the standard parametric model for failure-time data in reliability and survival analysis. Two parameters: a shape k and a scale lambda. The hazard rate is monotone in time, increasing when k > 1 (wear-out), constant when k = 1 (Exponential, memoryless), decreasing when k < 1 (early-failure / infant-mortality). Mean is lambda Gamma(1 + 1/k); variance is lambda squared times (Gamma(1 + 2/k) minus Gamma(1 + 1/k) squared). MLE for k has no closed form and is solved by Newton-Raphson on the profile score equation. Applications: component lifetimes, time to event in clinical trials, wind speeds, extreme-value Type III.

Common Probability DistributionsExponential DistributionGamma Distribution+1 more

Mathematical Infrastructure

29 topics / L0A, L0B, L1, L2, L3 / Tiers 1-3

Serious math machinery: measure theory, functional analysis, convex duality.

The Hessian Matrix

The matrix of second partial derivatives: encodes curvature, classifies nondegenerate critical points (and is inconclusive at degenerate ones), and is the central object in second-order optimization.

Matrix Operations and PropertiesEigenvalues and EigenvectorsDifferentiation in Rⁿ+2 more

The Jacobian Matrix

The matrix of all first partial derivatives of a vector-valued function: encodes the best linear approximation, connects to the chain rule in matrix form, and is the backbone of backpropagation.

Differentiation in Rⁿ

Vector Calculus Chain Rule

The chain rule for compositions of multivariable maps. Jacobians multiply when functions compose; gradients of scalar-valued compositions become vector-Jacobian products. The result that makes backpropagation a one-line theorem.

The Jacobian MatrixVectors, Matrices, and Linear MapsDifferentiation in Rⁿ

Borel-Cantelli Lemmas

The two lemmas that bridge in-probability convergence to almost-sure convergence. First lemma: summable probabilities imply the events occur only finitely often; Second lemma: independence plus divergence implies infinitely often.

Measure-Theoretic ProbabilityModes of Convergence of Random Variables

Measure-Theoretic Probability

The rigorous foundations of probability: sigma-algebras, measures, measurable functions as random variables, Lebesgue integration, and the convergence theorems that make modern probability and statistics possible.

Cardinality and CountabilityIntegration and Change of VariablesKolmogorov Probability Axioms+3 more

Modes of Convergence of Random Variables

The four standard senses in which a sequence of random variables can converge: almost surely, in probability, in Lp, and in distribution. Their hierarchy, the strict counterexamples that separate them, and the supporting tools (Slutsky, continuous mapping, Skorokhod).

Measure-Theoretic ProbabilityMetric Spaces, Convergence, and Completeness

Radon-Nikodym and Conditional Expectation

The Radon-Nikodym theorem: what 'density' really means. Absolute continuity, the Radon-Nikodym derivative, conditional expectation as a projection, tower property, and why this undergirds likelihood ratios, importance sampling, and KL divergence.

Measure-Theoretic Probability

Automatic Differentiation

Forward mode computes Jacobian-vector products, reverse mode computes vector-Jacobian products: backpropagation is reverse-mode autodiff, and the asymmetry between the two modes explains why training neural networks is efficient.

The Jacobian MatrixDifferentiation in RⁿMatrix Calculus+3 more

Characteristic Functions

The Fourier transform of a probability distribution. Always exists (unlike MGFs), uniquely determines the distribution, multiplies under independent sums, and powers the rigorous proof of the central limit theorem via Levy's continuity theorem.

Measure-Theoretic ProbabilityMoment Generating Functions

Matrix Calculus

The differentiation identities you actually use in ML: derivatives of traces, log-determinants, and quadratic forms with respect to matrices and vectors.

The Jacobian MatrixThe Hessian Matrix

Convex Duality

Fenchel conjugates, the Fenchel-Moreau theorem, weak and strong duality, KKT conditions, and why duality gives the kernel trick for SVMs, connects regularization to constraints, and enables adversarial formulations in DRO.

Convex Optimization BasicsInverse and Implicit Function TheoremSubgradients and Subdifferentials

Complex Numbers for Fourier

The minimum complex-number machinery you need to read Fourier-domain pages: the field C, the imaginary unit i, modulus and argument, polar form, Euler's formula, and roots of unity. Not complex analysis, just the grounders.

Vectors, Matrices, and Linear Maps

Divergence, Curl, and Line Integrals

The vector-calculus operators that appear in Fokker-Planck, score-based diffusion, and PINN papers. Definitions of divergence, curl, gradient, line integrals, and Green's theorem in the plane. Compact, not encyclopedic.

The Jacobian MatrixVector Calculus Chain Rule

Functional Analysis Core

The four pillars of functional analysis: Hahn-Banach (extending functionals), Uniform Boundedness (pointwise bounded implies uniformly bounded), Open Mapping (surjective bounded operators have open images), and Banach-Alaoglu (dual unit ball is weak-* compact). These underpin RKHS theory, optimization in function spaces, and duality.

Metric Spaces, Convergence, and CompletenessInner Product Spaces and OrthogonalityMeasure-Theoretic Probability

Information Theory Foundations

The core of information theory for ML: entropy, cross-entropy, KL divergence, mutual information, data processing inequality, and the chain rules that connect them. The language of variational inference, generalization bounds, and representation learning.

Martingale Theory

Martingales and their convergence properties: Doob martingale, optional stopping theorem, martingale convergence, Azuma-Hoeffding inequality, and Freedman inequality. The tools behind McDiarmid's inequality, online learning regret bounds, and stochastic approximation.

Measure-Theoretic Probability

Non-Euclidean and Hyperbolic Geometry

The geometry that drops the parallel postulate. Hyperbolic and spherical models, sectional curvature, the Poincare disk, and why hyperbolic spaces embed tree-structured data with low distortion. The grounding for graph embeddings on curved spaces.

Metric Spaces, Convergence, and CompletenessVectors, Matrices, and Linear Maps

PDE Fundamentals for Machine Learning

The partial differential equations that appear in modern machine learning: heat and Fokker-Planck for diffusion, continuity for flow matching, Hamilton-Jacobi-Bellman for reinforcement learning, Poisson for score matching. Classification, solution concepts, and where ML actually needs PDE theory versus where it just uses the vocabulary.

Fast Fourier TransformEigenvalues and EigenvectorsStochastic Differential Equations+3 more

Implicit Differentiation

Differentiating through implicit equations and optimization problems: the implicit function theorem gives dy/dx without solving for y explicitly. Applications to bilevel optimization, deep equilibrium models, hyperparameter optimization, and meta-learning.

The Jacobian MatrixAutomatic Differentiation

Backward Stochastic Differential Equations

The Pardoux-Peng framework: an SDE with a terminal condition and an adapted solution pair. Linear BSDEs reduce to Feynman-Kac; nonlinear BSDEs are dual to Hamilton-Jacobi-Bellman PDEs and are the mathematical object that the deep BSDE method approximates.

Stochastic Differential EquationsIto's LemmaFeynman–Kac Formula+1 more

Feynman–Kac Formula

The probabilistic representation of solutions to linear parabolic PDEs as expectations over SDE trajectories. The bridge that lets you Monte Carlo a PDE, the reason high-dimensional Black-Scholes is tractable, and the foundation under every backward-SDE method including deep BSDE.

Stochastic Differential EquationsIto's Lemma

Fokker–Planck Equation

The deterministic PDE for the time-evolving density of an SDE. The bridge that lets you reason about Langevin samplers, diffusion models, and stochastic optimization in PDE language: stationary distributions become null spaces, mixing times become spectral gaps, score functions become drift-density couplings.

Stochastic Differential EquationsPDE Fundamentals for Machine LearningDivergence, Curl, and Line Integrals+1 more

Hamilton–Jacobi–Bellman Equation

The PDE characterizing the value function of a continuous-time stochastic optimal control problem. The continuous-time analog of the discrete Bellman equation, the fully nonlinear PDE that nonlinear Feynman–Kac inverts via BSDEs, and the equation Deep BSDE solves numerically in high dimensions.

Stochastic Differential EquationsFeynman–Kac Formula

Ito's Lemma

The chain rule of stochastic calculus: if a process follows an SDE, applying a smooth function to it yields a modified SDE with an extra second-order correction term that has no analogue in ordinary calculus.

Stochastic Calculus for ML

Stochastic Differential Equations

SDEs of the form dX = b dt + sigma dB: strong and weak solutions, existence and uniqueness under Lipschitz conditions, Euler-Maruyama discretization, and the canonical examples that appear throughout ML (Ornstein-Uhlenbeck, geometric Brownian motion, Langevin dynamics).

brownian motionIto's LemmaStochastic Calculus for ML

Time Reversal of SDEs

Anderson 1982: any forward Ito SDE has an explicit time-reversed SDE whose drift involves the score function (gradient of log density). The single result that turns a forward noising process into a generative sampler and underlies every score-based diffusion model.

Stochastic Differential EquationsFokker–Planck Equation

Spectral Theory of Operators

Spectral theorem for compact self-adjoint operators on Hilbert spaces: every such operator has a countable orthonormal eigenbasis with real eigenvalues accumulating only at zero. This is the infinite-dimensional backbone of PCA, kernel methods, and neural tangent kernel theory.

Eigenvalues and EigenvectorsComplex Numbers for FourierFunctional Analysis Core

Information Geometry

Riemannian geometry on the space of probability distributions: the Fisher information metric, natural gradient descent, exponential families as dually flat manifolds, and the connection to mirror descent.

Fisher Information: Curvature, KL Geometry, and the Natural GradientConvex DualityNon-Euclidean and Hyperbolic Geometry+1 more

Stochastic Calculus for ML

Brownian motion, Ito integrals, Ito's lemma, and stochastic differential equations: the mathematical machinery behind diffusion models, score-based generative models, and Langevin dynamics.

Martingale TheoryMeasure-Theoretic ProbabilityClassical ODEs: Existence, Stability, and Numerical Methods

Statistical Estimation

27 topics / L0B, L1, L2 / Tiers 1-3

MLE, Fisher information, Cramér-Rao, LLN, CLT — the estimation core.

Asymptotic Statistics: M-Estimators, Delta Method, LAN

The large-sample toolbox for statistical inference: continuous mapping theorem, Slutsky, the delta method, M- and Z-estimator consistency and asymptotic normality, MLE as a special M-estimator, local asymptotic normality (Le Cam), the asymptotic equivalence of Wald / score / likelihood-ratio tests, and influence-function representations. These results justify essentially every confidence interval, standard error, and p-value in applied statistics, and they are the language of modern semiparametric theory.

Central Limit TheoremMaximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyModes of Convergence of Random Variables+3 more

Central Limit Theorem

The CLT family: classical Lindeberg-Levy, the Lindeberg-Feller condition for non-identical summands, Berry-Esseen rate, multivariate CLT, the delta method, and the martingale CLT that underwrites SGD asymptotics. Why these results power MLE asymptotic normality, bootstrap validity, and the Gaussian-noise approximation of stochastic optimization, and where the CLT breaks in high dimensions and under heavy tails.

Law of Large NumbersExpectation, Variance, Covariance, and MomentsCommon Probability Distributions+2 more

Conjugate Priors

When the prior and likelihood are paired so the posterior stays in the same family as the prior. Definition via exponential families, the standard table (Beta-Bernoulli, Dirichlet-multinomial, Normal-Normal, Normal-inverse-gamma, Gamma-Poisson), worked Normal-Normal updates in 1D and the multivariate case, and the pseudo-observation interpretation that makes conjugacy a feature, not a coincidence.

Bayesian EstimationMaximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencySufficient Statistics and Exponential Families+3 more

Cramér-Rao Bound: Information Inequality, Achievability, and Sharper Variants

The fundamental information inequality for unbiased estimation. Coverage of the scalar and multivariate Cramér-Rao bounds, the chain rule for biased estimators, achievability in exponential families, the Bhattacharyya higher-order bounds, the Hammersley-Chapman-Robbins bound (no regularity required), the van Trees Bayesian inequality, and efficient information with nuisance parameters.

Fisher Information: Curvature, KL Geometry, and the Natural GradientMaximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyKL Divergence

Fisher Information: Curvature, KL Geometry, and the Natural Gradient

The Fisher information measures how much a sample tells you about an unknown parameter. Coverage of the score function and Bartlett identities, the equivalence of variance-of-score and expected-negative-Hessian forms, the KL-divergence Hessian identity, the Fisher information matrix and Loewner ordering, the Cramér-Rao link, the natural gradient (parameterization invariance), K-FAC and empirical Fisher in deep learning, and applications to MLE asymptotic efficiency, EWC, and information geometry.

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyKL DivergenceBasu's Theorem+2 more

Law of Large Numbers

The weak and strong laws of large numbers: the sample mean converges to the population mean. Kolmogorov's conditions, the rate of convergence from the CLT, and why LLN justifies using empirical risk as a proxy for population risk.

Random VariablesExpectation, Variance, Covariance, and MomentsCommon Probability Distributions+2 more

Maximum A Posteriori (MAP) Estimation

Maximum a posteriori estimation as the posterior mode of a Bayesian model: derivation, the flat-prior recovery of MLE, the worked L2-norm-equals-Gaussian-prior and L1-norm-equals-Laplace-prior equivalences that make ridge and lasso Bayesian, and the invariance failure under reparameterization that distinguishes MAP from MLE.

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyBayesian EstimationCommon Probability Distributions+1 more

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiency

MLE: pick the parameter that maximizes the likelihood of observed data. Score function, Bartlett identities, regularity conditions, consistency, asymptotic normality, Wilks' theorem, Cramér-Rao efficiency, exponential families, QMLE under misspecification, and the bridge to deep-learning negative log-likelihood training.

Common Probability DistributionsDifferentiation in RⁿCentral Limit Theorem+5 more

Shrinkage Estimation and the James-Stein Estimator: Inadmissibility, SURE, and Brown's Characterization

Why the sample mean is inadmissible in three or more dimensions. Stein's identity, the James-Stein estimator and its positive-part refinement, Stein's unbiased risk estimate, Brown's admissibility characterization, the empirical Bayes interpretation, and the link to ridge regression and modern regularization.

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyCramér-Rao Bound: Information Inequality, Achievability, and Sharper VariantsMinimax Lower Bounds: Le Cam, Fano, Assouad, and the Reduction to Testing

The Multivariate Normal Distribution

The multivariate Gaussian as the joint of d correlated random variables: density derivation from standard normals via affine maps, the completing-the-square recipe, Schur-complement marginals and conditionals, the MGF and characteristic function, and the algebraic identities that power every Bayesian Gaussian derivation downstream.

Common Probability DistributionsJoint, Marginal, and Conditional DistributionsExpectation, Variance, Covariance, and Moments+4 more

Chi-Squared Distribution and Tests

The Chi-squared distribution as sum of squared standard Normals and as the sampling distribution of the scaled sample variance, plus the two Pearson Chi-squared tests: goodness of fit for cell counts and independence in contingency tables.

Distributions AtlasNormal DistributionGamma Distribution+1 more

F-Distribution and ANOVA

The F-distribution as a ratio of two scaled Chi-squareds, and the one-way analysis of variance F-test built on it: between-group versus within-group variance decomposition, exact null distribution under Normality and equal variances, and the link to two-sample t-tests.

Distributions AtlasChi-Squared Distribution and TestsHypothesis Testing for ML+1 more

Student-t Distribution and t-Test

The Student-t distribution as a ratio of a standard Normal and a root-Chi-squared, and the one-sample, two-sample, and paired t-tests it powers: exact null distribution under Normality, Welch correction for unequal variances, and large-sample equivalence to the Wald z-test.

Distributions AtlasNormal DistributionChi-Squared Distribution and Tests+2 more

Bayesian Linear Regression

Gaussian prior, Gaussian likelihood, Gaussian posterior. Full posterior derivation by completing the square in the exponent: the posterior mean equals the ridge estimator, the predictive distribution has irreducible plus epistemic variance, and the marginal likelihood gives a closed-form hyperparameter selection criterion. Worked numeric example with three data points carries the algebra end to end.

Linear RegressionRidge RegressionThe Multivariate Normal Distribution+4 more

Bootstrap Methods

The nonparametric bootstrap: resample with replacement to approximate sampling distributions, construct confidence intervals, and quantify uncertainty without distributional assumptions.

Common Probability DistributionsAsymptotic Statistics: M-Estimators, Delta Method, LANCentral Limit Theorem+4 more

Likelihood-Ratio, Wald, and Score Tests

The three asymptotic tests built from the likelihood: LRT compares maximized likelihoods, Wald compares the MLE to the null using the inverse Fisher information, score uses the gradient of the log-likelihood at the null. All three are asymptotically Chi-squared and equivalent under regularity; they disagree in finite samples and choice depends on the problem.

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyFisher Information: Curvature, KL Geometry, and the Natural GradientNeyman-Pearson and Hypothesis Testing Theory+2 more

Permutation Tests

Exchangeability-based hypothesis testing: under the null of no group effect, the labels are exchangeable, so the distribution of any test statistic under random relabeling gives an exact null reference. Exact for small samples, approximated by Monte Carlo for large samples, robust under non-Normality and heavy tails.

Hypothesis Testing for MLBootstrap MethodsNeyman-Pearson and Hypothesis Testing Theory

The EM Algorithm

Expectation-Maximization: the principled way to do maximum likelihood when some variables are unobserved. Derives the ELBO, proves monotonic convergence, and shows why EM is the backbone of latent variable models.

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyConvex Optimization BasicsSufficient Statistics and Exponential Families+1 more

Bayesian Estimation

The Bayesian approach to parameter estimation: encode prior beliefs, update with data via Bayes rule, and obtain a full posterior distribution over parameters. Conjugate priors, MAP estimation, and the Bernstein-von Mises theorem showing that the posterior concentrates around the true parameter.

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyCommon Probability DistributionsJoint, Marginal, and Conditional Distributions+1 more

Method of Moments

Match sample moments to population moments to estimate parameters. Simpler than MLE but less efficient. Covers classical MoM, generalized method of moments (GMM), and when MoM is the better choice.

Common Probability Distributions

Sufficient Statistics and Exponential Families

Sufficient statistics compress data without losing information about the parameter. The Neyman-Fisher factorization theorem, exponential families, completeness, and Rao-Blackwell improvement of estimators.

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiency

De Moivre-Laplace Theorem

The first central limit theorem, historically. Bin(n,p) approximates N(np, np(1-p)) for large n, with explicit continuity correction. Stirling-based proof, Berry-Esseen rate, and where the approximation breaks down (small p, small n, skewed binomials).

Common Probability DistributionsCentral Limit TheoremMoment Generating Functions

Goodness-of-Fit Tests

KS test, Shapiro-Wilk, Anderson-Darling, and the probability integral transform: testing whether data follows a specified distribution, with exact power comparisons and when each test fails.

Common Probability DistributionsBenford's LawHypothesis Testing for ML

Poisson Limit Theorem and Le Cam's Bound

Bin(n, lambda/n) converges to Pois(lambda) as n grows. The classical product-of-PMFs proof, then Le Cam's total-variation bound that makes the approximation quantitative. When to use Poisson vs Normal approximation. Disambiguation: Le Cam published multiple famous theorems.

Common Probability DistributionsCharacteristic FunctionsMoment Generating Functions

Empirical Bayes vs Hierarchical Bayes

What changes when hyperparameters are estimated and plugged in versus assigned a prior and integrated out, and why the gap is mostly about uncertainty rather than point estimates.

Bayesian EstimationShrinkage Estimation and the James-Stein Estimator: Inadmissibility, SURE, and Brown's CharacterizationAdjusted Density Maximization+2 more

REML and Variance Component Estimation

Why restricted maximum likelihood estimates variance components from error contrasts rather than the full data likelihood, and why that usually behaves better than ML when fixed effects are present.

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyLinear RegressionExpectation, Variance, Covariance, and Moments

Basu's Theorem

A complete sufficient statistic is independent of every ancillary statistic. This provides the cleanest method for proving independence between statistics without computing joint distributions.

Sufficient Statistics and Exponential Families

Bayesian ML Frontier

1 topics / L3 / Tier 1

Bayesian and probabilistic views of tabular foundation models and statistical ML frontiers.

Tabular Foundation Models as Bayesian Inference Engines

Prior-data fitted networks are transformers pre-trained on datasets drawn from a prior, then used as amortized Bayesian inference engines at test time with no gradient updates. TabPFN is the canonical instance. Operationally they compete with gradient-boosted trees; conceptually they are closer to amortized Bayesian posterior predictive inference, with the expensive computation paid once during pretraining and reused at every prediction.

Bayesian EstimationTransformer ArchitecturePrompt Engineering and In-Context Learning

Learning Theory Core

14 topics / L1, L2, L3 / Tiers 1-2

ERM, uniform convergence, VC dimension, Rademacher complexity.

PAC Learning Framework

The foundational formalization of what it means to learn from data: a concept is PAC-learnable if and only if an algorithm can, with high probability, find a hypothesis that is approximately correct, using a polynomial number of samples.

Concentration InequalitiesUniform ConvergenceCounting and Combinatorics+7 more

Empirical Risk Minimization

The foundational principle of statistical learning: minimize average loss on training data as a proxy for minimizing true population risk.

Concentration InequalitiesCommon Probability DistributionsCommon Inequalities+7 more

Hypothesis Classes and Function Spaces

What is a hypothesis class, why the choice of hypothesis class determines what ERM can learn, and the approximation-estimation tradeoff: bigger classes reduce approximation error but increase estimation error.

Empirical Risk Minimization

Realizability Assumption

The technical assumption that the target function lies inside the hypothesis class. Realizable PAC learning is the simpler half of the story; agnostic PAC drops this assumption.

Empirical Risk MinimizationHypothesis Classes and Function Spaces

Sample Complexity Bounds

How many samples do you need to learn? Tight answers for finite hypothesis classes, VC classes, and Rademacher-bounded classes, plus matching lower bounds via Fano and Le Cam.

VC DimensionRealizability Assumption

Uniform Convergence

Uniform convergence of empirical risk to population risk over an entire hypothesis class: the key property that makes ERM provably work.

Empirical Risk MinimizationAdaptive Learning Is Not IIDBernstein Inequality+4 more

VC Dimension

The Vapnik-Chervonenkis dimension: a combinatorial measure of hypothesis class complexity that characterizes learnability in binary classification.

Empirical Risk MinimizationConcentration InequalitiesCounting and Combinatorics+9 more

Algorithmic Stability

Algorithmic stability provides generalization bounds by analyzing how much a learning algorithm's output changes when a single training example is replaced: a structurally different lens from complexity-based approaches.

Empirical Risk MinimizationVC DimensionConcentration Inequalities+9 more

Rademacher Complexity

A data-dependent measure of hypothesis class complexity that gives tighter generalization bounds than VC dimension by measuring how well the class can fit random noise on the actual data.

Empirical Risk MinimizationVC DimensionConcentration Inequalities+6 more

Loss Functions

The generalized loss-function framework from Shalev-Shwartz & Ben-David Ch 3, the six standard losses (0-1, MSE, MAE, Huber, cross-entropy, hinge), and the bridge from supervised learning to ERM and PAC learnability.

Random VariablesExpectation, Variance, Covariance, and MomentsHypothesis Classes and Function Spaces

Bias-Complexity Tradeoff

The error of an ERM hypothesis decomposes into approximation error (the price of restricting H) and estimation error (the price of finite data). Richer classes shrink approximation but inflate estimation. The optimal class is the smallest one that still captures the truth.

PAC Learning FrameworkEmpirical Risk MinimizationNo-Free-Lunch Theorem

Glivenko-Cantelli Theorem

Uniform almost-sure convergence of the empirical CDF to the true CDF, and the function-class generalization that ML uses for ERM. Covers the classical theorem, the DKW exponential rate (Massart 1990), the Glivenko-Cantelli class definition, and the bridge to VC dimension and Rademacher complexity.

Concentration InequalitiesUniform ConvergenceVC Dimension+1 more

Kolmogorov Complexity and MDL

Kolmogorov complexity measures the shortest program that produces a string. The Minimum Description Length principle selects models that compress data best, providing a computable approximation to an incomputable ideal.

No-Free-Lunch Theorem

For binary classification with 0-1 loss, no learning algorithm can succeed on every distribution: for any algorithm and any sample size m smaller than half the domain, some realizable distribution forces error at least 1/8 with probability at least 1/7. Universal learners do not exist; prior knowledge enters through the choice of hypothesis class.

PAC Learning FrameworkEmpirical Risk MinimizationLoss Functions Catalog

Concentration & Probability

21 topics / L1, L2, L3 / Tiers 1-2

Hoeffding through matrix Bernstein. The workhorse inequality family.

Chernoff Bounds

The Chernoff method: the universal technique for deriving exponential tail bounds by optimizing over the moment generating function, yielding the tightest possible exponential concentration.

Concentration InequalitiesMoment Generating Functions

Concentration Inequalities

Bounds on how far random variables deviate from their expectations: Markov, Chebyshev, Hoeffding, and Bernstein. Used throughout generalization theory, bandits, and sample complexity.

Common Probability DistributionsExpectation, Variance, Covariance, and MomentsCentral Limit Theorem+9 more

Hoeffding's Lemma

The MGF bound that powers Hoeffding's inequality: a centered random variable on [a,b] has sub-Gaussian moment generating function with parameter (b-a)^2/4.

Concentration InequalitiesMoment Generating FunctionsChernoff Bounds+3 more

Bennett's Inequality

A variance-aware concentration inequality for independent bounded random variables. The exponent uses the function h(a) = (1+a)log(1+a) - a, the same h that controls the multiplicative Chernoff bound for Bernoullis.

Concentration InequalitiesChernoff BoundsMoment Generating Functions+1 more

Bernstein Inequality

A variance-sensitive concentration inequality for independent bounded random variables. Bernstein sharpens Hoeffding when the variance is much smaller than the worst-case range.

Concentration InequalitiesExpectation, Variance, Covariance, and MomentsMoment Generating Functions+2 more

Chi-Squared Concentration

Two-sided exponential concentration for chi-squared sums of squared standard Gaussians: P(|Z/k - 1| > t) <= 2 exp(-t^2 k / 6) for t in (0, 1/2). Drives sub-exponential tails, variance-component inference, and Lipschitz Gaussian concentration.

Concentration InequalitiesChernoff BoundsCommon Probability Distributions+1 more

Fat Tails and Heavy-Tailed Distributions

When the tails dominate. Power laws, Pareto distributions, subexponential tails, why the law of large numbers converges slowly or fails, and why most of ML silently assumes thin tails.

Common Probability DistributionsExpectation, Variance, Covariance, and MomentsLaw of Large Numbers+4 more

LLN and CLT Failures Under Heavy Tails

What breaks when finite mean or finite variance fails. Cauchy: the sample mean stays Cauchy no matter how large the sample. Pareto across the alpha regimes: when LLN still holds but CLT does not. Generalized CLT and stable-law limits. The consistency illusion in finance and reinsurance.

Law of Large NumbersCentral Limit TheoremCharacteristic Functions+3 more

Sub-Exponential Random Variables

The distributional class between sub-Gaussian and heavy-tailed: heavier tails than Gaussian, the Orlicz norm characterization, Bernstein condition, and the two-regime concentration bound.

Sub-Gaussian Random VariablesConcentration InequalitiesBernstein Inequality+3 more

Sub-Gaussian Random Variables

Sub-Gaussian random variables: the precise characterization of 'light-tailed' behavior that underpins exponential-tail concentration inequalities and the Gaussian-rate generalization bounds in learning theory.

Concentration InequalitiesChernoff BoundsSkewness, Kurtosis, and Higher Moments+1 more

Epsilon-Nets and Covering Numbers

Discretizing infinite sets for concentration arguments: epsilon-nets, covering numbers, packing numbers, the Dudley integral, and the connection to Rademacher complexity.

Sub-Gaussian Random VariablesConcentration InequalitiesContraction Inequality+2 more

Matrix Concentration

Matrix Bernstein, Matrix Hoeffding, Weyl's inequality, and Davis-Kahan: the operator-norm concentration tools needed for covariance estimation, dimensionality reduction, and spectral analysis in high dimensions.

Sub-Gaussian Random VariablesSub-Exponential Random VariablesConcentration Inequalities+4 more

McDiarmid's Inequality

The bounded-differences inequality: if changing any single input to a function changes the output by at most a fixed constant, the function concentrates around its mean with sub-Gaussian tails.

Concentration InequalitiesSub-Gaussian Random VariablesMartingale Theory+1 more

Measure Concentration and Geometric Functional Analysis

High-dimensional geometry is counterintuitive: Lipschitz functions concentrate, random projections preserve distances, and most of a sphere's measure sits near the equator. Johnson-Lindenstrauss, Gaussian concentration, and Levy's lemma.

Sub-Gaussian Random VariablesEpsilon-Nets and Covering Numbers

Symmetrization Inequality

The symmetrization technique: the proof template that connects the generalization gap to Rademacher complexity by introducing a ghost sample and random signs.

Rademacher ComplexityConcentration Inequalities

Slud's Inequality

A lower bound on binomial tail probabilities. For X ~ Bin(n, p) with p = (1 - epsilon)/2, P(X >= n/2) is at least a normal-tail-style quantity. Used in VC lower bounds to show learning is genuinely hard.

Common Probability DistributionsConcentration Inequalities

Stochastic Processes for ML

Random processes as the unifying framework behind Gaussian processes, neural tangent kernels, empirical processes, and martingale concentration: definitions, sample-path regularity, and the bridges between learning theory's main probabilistic tools.

Measure-Theoretic ProbabilityExpectation, Variance, Covariance, and MomentsConcentration Inequalities

Contraction Inequality

The Ledoux-Talagrand contraction principle: applying a Lipschitz function anchored at zero to a function class can only contract Rademacher complexity, letting you bound the complexity of the loss class from the hypothesis class.

Rademacher Complexity

Empirical Processes and Chaining

Bounding the supremum of empirical processes via covering numbers and chaining: Dudley's entropy integral and Talagrand's generic chaining, the sharpest tools in classical learning theory.

Rademacher ComplexityEpsilon-Nets and Covering NumbersAsymptotic Statistics: M-Estimators, Delta Method, LAN+3 more

Hanson-Wright Inequality

Concentration of quadratic forms XᵀAX for sub-Gaussian random vectors: the two-term bound involving the Frobenius norm (Gaussian regime) and operator norm (extreme regime).

Sub-Gaussian Random VariablesMatrix ConcentrationChi-Squared Concentration

Restricted Isometry Property

The restricted isometry property (RIP): when a measurement matrix approximately preserves norms of sparse vectors, enabling exact sparse recovery via L1 minimization. Random Gaussian matrices satisfy RIP with O(s log(n/s)) rows.

Sub-Gaussian Random Variables

Optimization & Function Classes

16 topics / L1, L2, L3 / Tiers 1-2

Convex optimization, regularization, kernels, RKHS.

Convex Optimization Basics

Convex sets, convex functions, gradient descent convergence, strong convexity, and duality: the optimization foundation that every learning-theoretic result silently depends on.

Differentiation in RⁿMatrix Operations and PropertiesCommon Inequalities+6 more

Gradient Descent Variants

From full-batch to stochastic to mini-batch gradient descent, plus momentum, Nesterov acceleration, AdaGrad, RMSProp, and Adam. Why mini-batch SGD with momentum is the practical default.

Convex Optimization BasicsDifferentiation in Rⁿ

Subgradients and Subdifferentials

The non-smooth generalization of the gradient for convex functions. Subgradients enable optimality conditions, calculus rules, and convergence guarantees for L1-regularized problems, hinge loss SVMs, and proximal algorithms where the objective is not differentiable.

Convex Optimization Basics

B-Splines

A numerically stable basis for piecewise polynomials, defined by de Boor's recurrence. Local support, partition-of-unity, banded design matrices, and why every numerical spline implementation uses B-splines rather than the truncated-power basis.

Linear RegressionSmoothing SplinesFunctional Analysis Core

Gradient Flow and Vanishing Gradients

Why deep networks are hard to train: gradients shrink or explode as they propagate through layers. The Jacobian product chain, sigmoid saturation, ReLU dead neurons, skip connections, normalization, and gradient clipping.

Feedforward Networks and BackpropagationThe Jacobian Matrix

Smoothing Splines

Solve a roughness-penalized least squares: minimize residual sum of squares plus the integrated squared second derivative. The minimizer is the natural cubic spline interpolating the data, the smoothing-parameter selection has a closed form via degrees of freedom and GCV, and the estimator lives in a reproducing kernel Hilbert space.

Linear RegressionRidge RegressionKernels and Reproducing Kernel Hilbert Spaces+2 more

Stochastic Gradient Descent Convergence

SGD convergence rates for convex and strongly convex functions, the role of noise as both curse and blessing, mini-batch variance reduction, learning rate schedules, and the Robbins-Monro conditions.

Gradient Descent VariantsConcentration InequalitiesCoordinate Descent+3 more

Thin-Plate Splines

Smoothing splines in two and higher dimensions. Penalize integrated squared second-derivative magnitude across the surface; the minimizer is a sum of radial basis functions plus a low-degree polynomial. Green-Silverman 1994 and Wahba 1990 are the canonical references.

Smoothing SplinesKernels and Reproducing Kernel Hilbert SpacesLinear Regression+1 more

Bias-Variance Tradeoff

The classical decomposition of mean squared error into bias squared, variance, and irreducible noise. The U-shaped test error curve, why it breaks in modern ML (double descent), and the connection to regularization.

Expectation, Variance, Covariance, and MomentsEmpirical Risk MinimizationElastic Net+5 more

Cross-Validation Theory

The theory behind cross-validation as a model selection tool: LOO-CV, K-fold, the bias-variance tradeoff of the CV estimator itself, and why CV estimates generalization error.

Empirical Risk MinimizationBias-Variance TradeoffAIC and BIC+12 more

Regularization Theory

Why unconstrained ERM overfits and how regularization controls complexity: Tikhonov (L2), sparsity (L1), elastic net, early stopping, dropout, the Bayesian prior connection, and the link to algorithmic stability.

Convex Optimization BasicsBias-Variance TradeoffAdaBoost+7 more

Stability and Optimization Dynamics

Convergence of gradient descent for smooth and strongly convex objectives, the descent lemma, gradient flow as a continuous-time limit, Lyapunov stability analysis, and the edge of stability phenomenon.

Convex Optimization Basicsinvariants and monovariants

Stochastic Approximation Theory

The Robbins-Monro framework, ODE method, and Polyak-Ruppert averaging: the unified theory behind why SGD, Q-learning, and TD-learning converge.

Convex Optimization BasicsMartingale TheoryAdaptive Learning Is Not IID+2 more

Kernels and Reproducing Kernel Hilbert Spaces

Kernel functions, Mercer's theorem, the RKHS reproducing property, and the representer theorem: the mathematical framework that enables learning in infinite-dimensional function spaces via finite-dimensional computations.

Convex Optimization BasicsRademacher ComplexityCharacteristic Functions+11 more

Preconditioned Optimizers: Shampoo, K-FAC, and Natural Gradient

Optimizers that use curvature information to precondition gradients: the natural gradient via Fisher information, K-FAC's Kronecker approximation, and Shampoo's full-matrix preconditioning. How they connect to Riemannian optimization and why they outperform Adam on certain architectures.

Convex Optimization BasicsFisher Information: Curvature, KL Geometry, and the Natural GradientThe Hessian Matrix+1 more

Riemannian Optimization and Manifold Constraints

Optimization on curved spaces: the Stiefel manifold for orthogonal matrices, symmetric positive definite matrices, Riemannian gradient descent, retractions, and why flat-space intuitions break on manifolds. The geometric backbone of Shampoo, Muon, and constrained neural network training.

Convex Optimization BasicsThe Hessian MatrixEigenvalues and Eigenvectors+5 more

Statistical Foundations

18 topics / L1, L2, L3, L4 / Tiers 1-3

Minimax, Fano, information-theoretic lower bounds, random matrix theory.

Minimax Lower Bounds: Le Cam, Fano, Assouad, and the Reduction to Testing

Why upper bounds are not enough. The reduction-to-testing principle, Le Cam's two-point method, Fano's inequality, the Assouad hypercube lemma, packing and metric-entropy bridges, and applications to nonparametric, sparse, and locally-private estimation.

Concentration InequalitiesMaximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyKL Divergence+3 more

Order Statistics

Order statistics are the sorted values of a random sample. Their distributions govern quantile estimation, confidence intervals for medians, and the behavior of extremes.

Common Probability DistributionsTriangular Distribution

Design-Based vs. Model-Based Inference

Two philosophies of statistical inference from survey data: design-based inference where randomness comes from the sampling design, and model-based inference where randomness comes from a statistical model, with the model-assisted hybrid approach.

Survey Sampling MethodsCausal Inference for Policy EvaluationOfficial Statistics and National Surveys

Detection Theory

Binary hypothesis testing, the Neyman-Pearson lemma (likelihood ratio tests are most powerful), ROC curves, Bayesian detection, and sequential testing. Classification IS detection theory. ROC/AUC comes directly from here.

Hypothesis Testing for MLBayesian Estimation

Neyman-Pearson and Hypothesis Testing Theory

The likelihood ratio test is the most powerful test for simple hypotheses (Neyman-Pearson lemma), UMP tests extend this to one-sided composites, and the power function characterizes a test's behavior across the parameter space.

Common Probability DistributionsMaximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiency

Nonresponse and Missing Data

The taxonomy of missingness mechanisms (MCAR, MAR, MNAR), their consequences for inference, and the major correction methods: multiple imputation, inverse probability weighting, and the EM algorithm.

Common Probability DistributionsDesign-Based vs. Model-Based Inferencefuzzy matching and record linkage+3 more

Sample Size Determination

How to compute the number of observations needed to estimate means, proportions, and treatment effects with specified precision and power, including corrections for finite populations and complex designs.

Hypothesis Testing for MLCommon Probability DistributionsSurvey Sampling Methods

Survey Sampling Methods

The major probability sampling designs used in survey statistics: simple random, stratified, cluster, systematic, multi-stage, and multi-phase sampling, with their variance properties and estimators.

Common Probability DistributionsExpectation, Variance, Covariance, and MomentsTypes of Bias in Statistics

Fano Inequality

Fano inequality as the standard tool for information-theoretic lower bounds: error probability in estimating a random variable from a noisy observation is bounded below by conditional entropy and alphabet size.

Minimax Lower Bounds: Le Cam, Fano, Assouad, and the Reduction to TestingInformation Theory Foundations

GREG Estimator

The generalized regression estimator is the standard model-assisted survey estimator: Horvitz-Thompson plus a regression correction using known auxiliary totals.

Linear RegressionSurvey Sampling MethodsDesign-Based vs. Model-Based Inference

High-Dimensional Covariance Estimation

When dimension d is comparable to sample size n, the sample covariance matrix fails. Shrinkage estimators (Ledoit-Wolf), banding and tapering for structured covariance, and Graphical Lasso for sparse precision matrices.

Matrix ConcentrationLasso Regression

Kernel Two-Sample Tests

Maximum Mean Discrepancy (MMD): a kernel-based nonparametric test for whether two samples come from the same distribution, with unbiased estimation, permutation testing, and applications to GAN evaluation.

Kernels and Reproducing Kernel Hilbert Spaces

Robust Statistics and M-Estimators

When data has outliers or model assumptions are wrong, classical estimators break. M-estimators generalize MLE to handle contamination gracefully.

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyMinimax and Saddle PointsSkewness, Kurtosis, and Higher Moments+1 more

Survival Analysis

Modeling time-to-event data with censoring: Kaplan-Meier curves, hazard functions, and the Cox proportional hazards model.

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiency

Random Matrix Theory Overview

Why the spectra of random matrices matter for ML: Marchenko-Pastur law, Wigner semicircle, spiked models, and their applications to covariance estimation, PCA, and overparameterization.

Matrix ConcentrationEpsilon-Nets and Covering NumbersMeasure-Theoretic Probability+3 more

Copulas

Copulas separate the dependence structure of a multivariate distribution from its marginals. Sklar's theorem guarantees that any joint CDF can be decomposed into marginals and a copula, making dependence modeling modular.

Common Probability Distributions

Longitudinal Surveys and Panel Data

Analysis of data where the same units are measured repeatedly over time: fixed effects, random effects, difference-in-differences, and the problems of attrition and time-varying confounding.

Linear RegressionCausal Inference for Policy EvaluationNonresponse and Missing Data+1 more

Small Area Estimation

Methods for producing reliable estimates in domains where direct survey estimates have too few observations for useful precision, using Fay-Herriot and unit-level models that borrow strength across areas.

Bayesian EstimationLinear RegressionREML and Variance Component Estimation+1 more

Modern Generalization Theory

19 topics / L3, L4, L5 / Tiers 1-3

Implicit bias, double descent, NTK, mean field — where classical theory fails.

Continuous-Time Gradient Flow (SLT View)

Gradient flow as the step-size-to-zero limit of gradient descent: an ODE on weight space. On least squares it converges to the minimum-norm OLS solution; early stopping is implicit ridge regularization; on overparameterized two-layer networks the mean-field limit yields the global-optimum convergence theorems of Mei-Montanari and Chizat-Bach.

Ridge RegressionStochastic Gradient Descent ConvergenceNeural Tangent Kernel: Lazy Training, Kernel Equivalence, μP, and the Limits of Width+2 more

PAC-Bayes Bounds

PAC-Bayes bounds control the generalization gap of a stochastic predictor by the KL divergence between a learner-chosen posterior and a data-independent prior. They have produced some of the few non-vacuous generalization bounds reported for trained neural networks (Dziugaite-Roy 2017 onward); how tight the bound gets depends heavily on the choice of prior.

Rademacher ComplexityBayesian EstimationPAC Learning Framework+1 more

Ridge Resolvents

The ridge estimator as a function of lambda lives on a one-parameter spectral path. The resolvent (X^T X + lambda I)^{-1} controls everything: derivatives, prediction risk, debiased-lasso connections, and the proportional-asymptotics analysis of Patil and collaborators.

Ridge RegressionSingular Value DecompositionLasso Regression+2 more

Singular Learning Theory

Singular Learning Theory (SLT), developed by Sumio Watanabe, is the Bayesian asymptotic theory of models whose Fisher information matrix is degenerate at the true parameter. Neural networks, mixture models, and hidden Markov models all fall in this class. The Real Log Canonical Threshold (RLCT) replaces half the parameter count in the Bayes free-energy expansion, and the Local Learning Coefficient (LLC) gives an empirical proxy that the developmental-interpretability community uses to study trained networks.

Bayesian EstimationKL DivergenceFisher Information: Curvature, KL Geometry, and the Natural Gradient+4 more

Implicit Bias and Modern Generalization

Why classical generalization theory breaks for overparameterized models: the random labels experiment, the interpolation threshold, implicit bias of gradient descent, double descent, and the frontier of understanding why deep learning works.

Gradient Descent VariantsLinear RegressionVC Dimension+11 more

Neural Tangent Kernel: Lazy Training, Kernel Equivalence, μP, and the Limits of Width

In the infinite-width NTK parameterization, training a neural network with gradient descent is mathematically equivalent to kernel regression. The same limit suppresses feature learning, which is why μP and mean-field parameterizations exist. NTK is the precise boundary between the kernel and feature-learning regimes.

Kernels and Reproducing Kernel Hilbert SpacesRidge RegressionImplicit Bias and Modern Generalization+3 more

Optimal Transport and Earth Mover's Distance

The Monge and Kantorovich formulations of optimal transport, the linear programming dual, Sinkhorn regularization, and applications to WGANs, domain adaptation, and fairness.

Convex DualityWasserstein Distances

Representation Learning Theory

What makes a good learned representation: the information bottleneck, contrastive learning, sufficient statistics, rate-distortion theory, and why representation learning is the central unsolved problem of deep learning.

Information Theory FoundationsVariational AutoencodersEquivariant Deep Learning+2 more

Benign Overfitting

When interpolation (zero training error) does not hurt generalization: the min-norm interpolator fits noise in harmless directions while preserving signal. Bartlett et al. 2020, effective rank conditions, and why benign overfitting happens in overparameterized but not classical regimes.

Implicit Bias and Modern GeneralizationRandom Matrix Theory OverviewDouble Descent+3 more

Double Descent

Test error follows a double-descent curve: it decreases, peaks at the interpolation threshold, then decreases again in the overparameterized regime, defying classical bias-variance intuition.

Implicit Bias and Modern GeneralizationRandom Matrix Theory OverviewBias-Variance Tradeoff+4 more

Grokking

Models can memorize training data quickly, then generalize much later after continued training. This delayed generalization, called grokking, breaks the assumption that overfitting is a terminal state and connects to weight decay, implicit regularization, and phase transitions in learning.

Regularization TheoryStochastic Gradient Descent ConvergenceImplicit Bias and Modern Generalization+1 more

Lazy vs Feature Learning

The fundamental dichotomy in neural network training: lazy regime (NTK, kernel-like, weights barely move) versus rich/feature learning regime (weights move substantially, representations emerge).

Neural Tangent Kernel: Lazy Training, Kernel Equivalence, μP, and the Limits of WidthMean Field Theory

Mean Field Theory

The mean field limit of neural networks: as width goes to infinity under the right scaling, neurons become independent particles whose weight distribution evolves by Wasserstein gradient flow, capturing feature learning that the NTK regime misses.

Neural Tangent Kernel: Lazy Training, Kernel Equivalence, μP, and the Limits of WidthInformation Geometry

Neural Network Optimization Landscape

Loss surface geometry of neural networks: saddle points dominate in high dimensions, mode connectivity, flat vs sharp minima, Sharpness-Aware Minimization, and the edge of stability phenomenon.

Training Dynamics and Loss LandscapesThe Hessian Matrix

Information Bottleneck

The information bottleneck principle: compress the input X into a representation T that preserves information about the target Y. The Lagrangian formulation, connection to deep learning, Shwartz-Ziv and Tishby claims, and why the compression story may not hold for ReLU networks.

Information Theory Foundations

Gaussian Processes for Machine Learning

A distribution over functions specified by a mean and kernel: closed-form posterior predictions with uncertainty, connection to kernel ridge regression, marginal likelihood for model selection, and the cubic cost bottleneck.

Kernels and Reproducing Kernel Hilbert SpacesJoint, Marginal, and Conditional DistributionsRidge Regression+7 more

Sparse Recovery and Compressed Sensing

Recover a sparse signal from far fewer measurements than its ambient dimension: the restricted isometry property, basis pursuit via L1 minimization, random measurement matrices, and applications from MRI to single-pixel cameras.

Lasso RegressionSub-Gaussian Random Variablessparse coding and efficient coding

Wasserstein Distances

The Wasserstein (earth mover's) distance measures the minimum cost of transporting one probability distribution to another, with deep connections to optimal transport, GANs, and distributional robustness.

Common Probability DistributionsMeasure-Theoretic ProbabilityConvex Duality+2 more

Open Problems in ML Theory

A curated list of genuinely open problems in machine learning theory: why overparameterized networks generalize, the right complexity measure for deep learning, feature learning beyond NTK, why scaling laws hold, emergent abilities, transformer-specific theory, and post-training theory.

Implicit Bias and Modern GeneralizationScaling LawsContinuous Thought Machines+2 more

LLM Construction

72 topics / L2, L3, L4, L5 / Tiers 1-3

Transformer math, attention, KV cache, optimizers, scaling laws, RLHF.

Linear Layer: Shapes, Bias, and Memory

A systems-first note on the linear layer: tensor shapes, the bias term, forward pass, backward pass, parameter memory, FLOPs, and finite-difference gradient tests.

Matrix Operations and PropertiesMatrix CalculusFeedforward Networks and Backpropagation

Fine-Tuning and Adaptation

Methods for adapting pretrained models to new tasks: full fine-tuning, feature extraction, LoRA, QLoRA, adapter layers, and prompt tuning. When to use each, and how catastrophic forgetting constrains adaptation.

Feedforward Networks and Backpropagation

Optimizer Theory: SGD, Adam, and Muon

Convergence theory of SGD (convex and strongly convex), momentum methods (Polyak and Nesterov), Adam as adaptive + momentum, why SGD can generalize better, the Muon optimizer, and learning rate schedules.

Convex Optimization BasicsAdam OptimizerAutomatic Differentiation+5 more

Hallucination Theory

Why language models confabulate, framed by the Kalai-Nachum-Vempala-Zhang (2025) statistical lower bound on pretraining error and the evaluation-incentive equilibrium that prevents post-training from removing it. Calibration, conformal prediction, structural failure modes (reversal curse, snowballing), and the measurement gap (FActScore, SAFE).

Empirical Risk MinimizationTransformer ArchitectureCalibration and Uncertainty Quantification+1 more

Scaling Laws

Power-law scaling of LLM loss in parameters, data, and compute: Kaplan, Chinchilla, the Muennighoff data-constrained law for repetition, the Schaeffer metric-induced-emergence proposition, MoE and muP extensions, and the test-time compute axis.

Convex Optimization BasicsData Contamination and EvaluationDistributed Training Theory+5 more

Sparse Autoencoders for Interpretability: TopK, JumpReLU, Matryoshka, and Scaling

Sparse autoencoders (SAEs) decompose neural network activations into interpretable feature directions by learning an overcomplete dictionary with a sparsity constraint. Coverage of the superposition hypothesis, the L1 SAE objective and its shrinkage failure mode, TopK SAEs (Gao et al. 2024 / OpenAI), JumpReLU SAEs (Rajamanoharan et al. 2024 / DeepMind), Matryoshka SAEs (Bushnaq / Bussmann 2024), Anthropic's million-feature Claude 3 Sonnet decomposition (Templeton et al. 2024), feature splitting, dead features, evaluation, and steering.

AutoencodersMechanistic Interpretability: Features, Circuits, and Causal FaithfulnessLasso Regression+3 more

Chain-of-Thought and Reasoning

Chain-of-thought as serial computation in token-space: the Merrill-Sabharwal expressiveness hierarchy, the Lanham-Turpin faithfulness gap, self-consistency bounds, hidden CoT, and the post-o1/R1 reframing where CoT is the artifact of RL on verifiable rewards.

Prompt Engineering and In-Context LearningTransformer ArchitectureDecoding Strategies+1 more

Reinforcement Learning from Human Feedback

The RLHF pipeline as math, not folklore: Bradley-Terry with sycophancy and intransitivity, KL-shielded PPO with overoptimization mitigation, the DPO implicit-reward identity and its likelihood-displacement failure, online-vs-offline preference learning, Nash-LHF, and the LIMA challenge to whether you need RL at all.

Policy Gradient TheoremRLHF and AlignmentReinforcement Learning for Synthesis Planning+1 more

Attention Mechanisms History

The evolution of attention from Bahdanau (2014) additive alignment to Luong dot-product attention to self-attention in transformers. How attention solved the fixed-length bottleneck of seq2seq models.

Recurrent Neural NetworksByte-Level Language Models

Bits, Nats, Perplexity, and BPB

The four units people use to measure language model quality, how they relate to each other, when to use each one, and how mixing them up leads to wrong conclusions.

Information Theory FoundationsKL Divergence

Decoding Strategies

How language models select output tokens: greedy decoding, beam search, temperature scaling, top-k sampling, and nucleus (top-p) sampling. The tradeoffs between coherence, diversity, and quality.

Softmax and Numerical Stability

Knowledge Distillation

Training a small student model to mimic a large teacher: soft targets, temperature scaling, dark knowledge, and why the teacher's mistakes carry useful information about class structure.

Feedforward Networks and BackpropagationIterative Magnitude Pruning and the Lottery Ticket Hypothesis

Model Compression and Pruning

Reducing model size without proportional accuracy loss: unstructured and structured pruning, magnitude pruning, the lottery ticket hypothesis, entropy coding for compressed weights, and knowledge distillation as compression.

Feedforward Networks and Backpropagation

Perplexity and Language Model Evaluation

Perplexity as exp(cross-entropy): the standard intrinsic metric for language models, its information-theoretic interpretation, connection to bits-per-byte, and why low perplexity alone does not guarantee useful generation.

Information Theory FoundationsBits, Nats, Perplexity, and BPBLog-Probability Computation

Synthetic Data Distillation

Data-centric distillation: instead of matching teacher logits, train the student on a carefully synthesized corpus written by a stronger teacher. Phi and Orca style pipelines, explanation tuning, filtering, and why the corpus is the model.

Knowledge DistillationSynthetic Data Generation

Token Prediction and Language Modeling

Language modeling as probability assignment over sequences. Autoregressive and masked prediction objectives, perplexity evaluation, and the connection between prediction and compression.

Information Theory FoundationsSoftmax and Numerical StabilityFeedforward Networks and Backpropagation+1 more

Attention Mechanism Theory

Mathematical formulation of attention: scaled dot-product attention as soft dictionary lookup, why scaling by the square root of key dimension prevents softmax saturation, multi-head attention, and the connection to kernel methods.

Matrix Operations and PropertiesSoftmax and Numerical StabilityGram Matrices and Kernel Matrices+2 more

Attention Sinks and Retrieval Decay

Why evaluated transformer decoders disproportionately attend to initial tokens, how StreamingLLM uses that pattern for stable streaming inference, and how retrieval accuracy can vary with position inside the context window.

Attention Mechanism TheoryForgetting Transformer (FoX)

Attention Variants and Efficiency

Multi-head, multi-query, grouped-query, linear, and sparse attention: how each variant trades expressivity for efficiency, and when to use which.

Attention Mechanism TheoryFast Fourier Transform

Efficient Transformers Survey

Sub-quadratic attention variants (linear attention, Linformer, Performer, Longformer, BigBird) and why FlashAttention, a hardware-aware exact method, made most of them unnecessary in practice.

Attention Variants and Efficiency

Forgetting Transformer (FoX)

FoX adds a data-dependent forget gate to softmax attention. The gate down-weights unnormalized attention scores between past and present positions, giving the transformer a learned, recency-biased decay. FoX is FlashAttention-compatible, works without positional embeddings, and improves long-context language modeling and length extrapolation.

Attention Mechanism TheoryRecurrent Neural NetworksTransformer Architecture+2 more

Induction Heads

Induction heads are attention head circuits that implement a specific kind of pattern completion: given a sequence like [A][B]...[A], they predict [B]. Olsson et al. (2022) give strong causal (ablation) evidence that these heads exist in small attention-only models, and correlational co-occurrence evidence linking their formation to a sudden jump in in-context-learning ability in larger transformers. They are one mechanism among several that have been proposed for in-context learning, not the whole story.

Attention Mechanism TheoryTransformer ArchitectureMechanistic Interpretability: Features, Circuits, and Causal Faithfulness+2 more

Iterative Magnitude Pruning and the Lottery Ticket Hypothesis

Iterative magnitude pruning repeatedly trains, prunes, rewinds, and retrains a network to search for sparse subnetworks that still learn well. The point is not cheap training; the point is understanding trainable sparsity, rewind stability, and when a sparse mask still preserves optimization geometry.

Model Compression and PruningFeedforward Networks and Backpropagation

Mixture of Experts

Sparse computation via learned routing: replace dense FFN layers with multiple expert networks, activate only a subset per token, and scale capacity without proportional compute.

Transformer ArchitectureModel Compression and PruningSpeculative Decoding and Quantization

Residual Stream and Transformer Internals

The residual stream as the central computational highway in transformers: attention and FFN blocks read from and write to it. Pre-norm vs post-norm, FFN as key-value memory, and the logit lens for inspecting intermediate representations.

Transformer ArchitectureForgetting Transformer (FoX)Gradient Flow and Vanishing Gradients

RLHF and Alignment

The RLHF pipeline for aligning language models with human preferences: reward modeling, PPO fine-tuning, KL penalties, DPO, and why none of it guarantees truthfulness.

Policy Gradient TheoremMarkov Decision ProcessesActor-Critic Methods+2 more

Sparse Attention and Long Context

Standard attention is O(n²). Sparse patterns (Longformer, Sparse Transformer, Reformer), ring attention for distributed sequences, streaming with attention sinks, and why extending context is harder than it sounds.

Attention Mechanism TheoryGemini and Google Models

Training Dynamics and Loss Landscapes

The geometry of neural network loss surfaces: why saddle points dominate over local minima in high dimensions, how flat minima relate to generalization, and why SGD finds solutions that generalize.

Convex Optimization BasicsThe Hessian MatrixStability and Optimization Dynamics

Transformer Architecture

The mathematical formulation of the transformer block: self-attention, multi-head attention, layer normalization, FFN blocks, positional encoding, and parameter counting.

Attention Mechanism TheoryFeedforward Networks and BackpropagationSoftmax and Numerical Stability+13 more

Context Engineering

The discipline of building, routing, compressing, retrieving, and persisting context for LLMs: beyond prompt design into systems engineering for what the model sees.

KV CacheAttention Mechanism TheoryAgent Protocols: MCP and A2A+7 more

Document Intelligence

Beyond OCR: understanding document layout, tables, figures, and structure using models that combine text, spatial position, and visual features to extract structured information from PDFs, invoices, and contracts.

DPO vs GRPO vs RL for Reasoning

Head-to-head comparison of preference optimization methods: DPO's implicit reward, GRPO's group-relative comparisons, and RL with verifier feedback for reasoning. When each works and when each fails.

RLHF and AlignmentPolicy Gradient TheoremActor-Critic Methods+6 more

Edge and On-Device ML

Running models on phones, embedded devices, and edge servers: pruning, distillation, quantization, TinyML, and hardware-aware neural architecture search under memory, compute, and power constraints.

Speculative Decoding and Quantization

Flash Attention

IO-aware exact attention: tile QKV matrices into SRAM-sized blocks so the full N-by-N attention matrix is never materialized in HBM. Peak activation memory drops from O(N²) to O(N); HBM read/write traffic drops by a large constant factor (not asymptotic linearity); FLOP count is unchanged.

Attention Mechanism TheorySoftmax and Numerical StabilityAttention Is All You Need (Paper)+4 more

Fused Kernels

Combine multiple GPU operations into a single kernel launch to eliminate intermediate HBM reads and writes. Why kernel fusion is the primary optimization technique for memory-bound ML operations.

GPU Compute ModelCUDA Programming FundamentalsFlash Attention+2 more

GPU Compute Model

How GPUs execute ML workloads: streaming multiprocessors, warps, memory hierarchy (registers, SRAM, L2, HBM), arithmetic intensity, the roofline model, and why most ML operations are memory-bound.

ASML and Chip ManufacturingDocker and Containers for MLKubernetes for ML Workloads+1 more

Inference Systems Overview

The modern LLM inference stack: batching strategies, scheduling, memory management with paged attention, model parallelism for serving, and why FLOPs do not equal latency when memory bandwidth is the bottleneck.

KV CacheSpeculative Decoding and QuantizationDocker and Containers for ML+4 more

Inference-Time Scaling Laws

How additional compute at inference time (repeated sampling, search, verification) improves output quality, why gains are task-dependent, and why verifier quality matters more than raw sample count.

Scaling LawsTest-Time Compute and Search

KV Cache

Why autoregressive generation recomputes attention at every step, how caching past key-value pairs makes it linear, and the memory bottleneck that drives MQA, GQA, and paged attention.

Attention Mechanism TheoryAttention Is All You Need (Paper)Attention Variants and Efficiency+2 more

KV Cache Optimization

Advanced techniques for managing the KV cache memory bottleneck: paged attention for fragmentation-free allocation, prefix caching for shared prompts, token eviction for long sequences, and quantized KV cache for reduced footprint.

Latent Reasoning

Techniques that spend computation in hidden state rather than visible chain-of-thought tokens: pause tokens, continuous thought, recurrent depth, and what is empirical vs speculative.

Test-Time Compute and SearchMemory Systems for LLMsMulti-Token Prediction

Memory Systems for LLMs

Taxonomy of LLM memory: short-term (KV cache), working (scratchpad), long-term (retrieval), and parametric (weights). Why extending context alone is insufficient and how memory consolidation works.

Context EngineeringKV Cache

Multi-Token Prediction

Predicting k future tokens simultaneously using auxiliary prediction heads: forces planning, improves code generation, and connects to speculative decoding.

Transformer Architecture

Multimodal RAG

RAG beyond text: retrieving images, tables, charts, and PDFs alongside text. Document parsing, multimodal chunking, vision-language retrievers, agentic RAG, and reasoning RAG with chain-of-thought retrieval.

Context EngineeringAudio Language ModelsCLIP, OpenCLIP, and SigLIP: Contrastive Language-Image Pretraining+1 more

PaddleOCR and Practical OCR

A practitioner's guide to modern OCR toolkits: PaddleOCR's three-stage pipeline, TrOCR's transformer approach, EasyOCR, and Tesseract. When to use which, and what accuracy to expect.

Document Intelligencehough transform and circle detection

Parallel Processing Fundamentals

Data, tensor, pipeline, expert, and sequence parallelism: the five strategies for distributing model training and inference across multiple GPUs, and how frontier labs combine all of them.

Stochastic Gradient Descent ConvergenceBroadcast Joins in Distributed ComputeDask Parallel Python+1 more

Post-Training Overview

How post-training turns a pretrained language model into a deployable assistant: SFT, preference optimization, safety tuning, verifiable rewards, evaluation gates, and the failure modes each stage introduces.

RLHF and AlignmentTransformer ArchitectureAgentic RL and Tool Use+3 more

Prefix Caching

Share computed KV cache entries across requests that share the same prefix. Radix attention trees enable efficient lookup. Significant latency savings for prefix-heavy production workloads.

KV CacheKV Cache Optimization

Prompt Engineering and In-Context Learning

In-context learning allows LLMs to adapt to new tasks from examples in the prompt without weight updates. Theories for why it works, prompting strategies, and why prompt engineering is configuring inference-time computation.

Transformer Architecture

Reasoning Data Curation

How to build training data for reasoning models: math with verified solutions, code with test cases, rejection sampling, external verification, self-play for problem generation, and the connection to RLVR.

Post-Training OverviewAlphaProof and AI-Assisted Theorem ProvingSynthetic Data Distillation

Ring vs. Tree Attention

Exact attention distributed across many devices needs a way to combine local softmax normalizers into one global result. Ring Attention rotates full K/V shards through a sequential ring; Tree Attention reduces small partial states through a balanced binary tree. Both use the same exact real-arithmetic reduction, while their communication costs differ.

Flash AttentionKV CacheAttention Mechanism Theory

Scaling Compute-Optimal Training

Chinchilla scaling: how to optimally allocate a fixed compute budget between model size and training data, why many models were undertrained, and the post-Chinchilla reality of data quality and inference cost.

Speculative Decoding and Quantization

Two core inference optimizations: speculative decoding for latency (draft-verify parallelism) and quantization for memory and throughput (reducing weight precision without destroying quality).

Transformer ArchitectureKV CacheMegakernels+1 more

Structured Output and Constrained Generation

Forcing LLM output to conform to schemas via constrained decoding: logit masking for valid tokens, grammar-guided generation, and why constraining the search space can improve both reliability and quality.

Transformer ArchitectureTool-Augmented Reasoning

Test-Time Compute and Search

Spending more inference compute through best-of-N, verifier-guided search, tree search, process rewards, and latent/internal computation, with gains limited by verifier quality and base pass rate.

Scaling LawsAgentic RL and Tool Use

Tool-Augmented Reasoning

LLMs that call external tools during reasoning: Toolformer for learning when to invoke APIs, ReAct for interleaving thought and action, function calling and MCP for structured invocation, and code-as-thought for replacing verbal arithmetic with executed programs.

Transformer ArchitectureChain-of-Thought and ReasoningPrompt Engineering and In-Context Learning+1 more

Attention as Kernel Regression

Softmax attention viewed as Nadaraya-Watson kernel regression: the output at each position is a kernel-weighted average of values. Connects attention to classical nonparametric statistics and motivates linear attention via random feature approximations.

Attention Mechanism TheoryKernels and Reproducing Kernel Hilbert Spaces

Byte-Level Language Models

Skip the tokenizer and feed raw bytes to the model. ByT5, MegaByte, and Byte Latent Transformer: why operating on bytes is attractive, why it is expensive, and how hierarchical patching closes the compute gap.

Tokenization and Information Theorymorphology and subword modeling

Neural Architecture Search

Automating network architecture design: search spaces, search strategies (RL, evolutionary, differentiable), performance estimation via weight sharing, and the gap between NAS hype and practical gains.

Feedforward Networks and Backpropagation

Positional Encoding

Why attention needs position information, sinusoidal encoding, learned positions, RoPE (rotary position encoding via 2D rotations), ALiBi, and why RoPE became the default for modern LLMs.

Attention Mechanism TheoryAttention Is All You Need (Paper)Attention Mechanisms History

Tokenization and Information Theory

Tokenization determines an LLM's vocabulary and shapes everything from compression efficiency to multilingual ability. Information theory explains what good tokenization looks like.

Information Theory FoundationsCommon Probability Distributionsmorphology and subword modeling

Agent Protocols: MCP and A2A

The protocol layer for AI agents: MCP (Model Context Protocol) for tool access, A2A (Agent-to-Agent) for inter-agent communication, and why standardized interfaces matter for the agent ecosystem.

Agentic RL and Tool UseTool-Augmented Reasoning

AMD Competition Landscape

AMD's MI300X, MI325X, and MI350-series GPUs compete with NVIDIA on memory bandwidth and capacity but still face the CUDA software moat. Competition matters because pricing, supply diversity, and vendor lock-in determine who can train and serve models.

GPU Compute Model

ASML and Chip Manufacturing

ASML is the sole manufacturer of EUV lithography machines used to produce every advanced AI chip. Understanding the semiconductor supply chain reveals a critical concentration risk for AI compute.

Distributed Training Theory

Training frontier models requires thousands of GPUs. Data parallelism, model parallelism, and communication-efficient methods make this possible.

Optimizer Theory: SGD, Adam, and MuonParallel Processing FundamentalsBatch Size and Learning Dynamics+7 more

Donut and OCR-Free Document Understanding

End-to-end document understanding without OCR: Donut reads document images directly and generates structured output, bypassing the error-prone OCR pipeline. Nougat extends this to academic paper parsing.

Transformer ArchitectureDocument IntelligencePaddleOCR and Practical OCR

Megakernels

Fuse an entire LLM forward pass or decode step into a single GPU kernel launch to eliminate kernel launch overhead and cross-kernel HBM round-trips. Why persistent kernels and CUDA Graphs dominate low-latency inference.

Fused KernelsGPU Compute Model

Model Merging and Weight Averaging

Combining trained models by averaging or interpolating their weights: SWA, SLERP, TIES-Merging, DARE. Why it works (loss landscape mode connectivity), when it fails, and applications to combining specialized models.

Transformer Architecture

NVIDIA GPU Architectures

A concrete reference for the GPU hardware that determines what ML models can be trained and served: H100, H200, B200, GB200, and Rubin architectures, with emphasis on memory bandwidth as the primary bottleneck.

GPU Compute ModelCUDA Programming FundamentalsParallel Processing Fundamentals+1 more

Plan-then-Generate

Beyond strict left-to-right token generation: plan what to say before producing tokens. Outline-then-fill, hierarchical generation, and multi-token prediction. Planning reduces incoherence in long outputs and enables structural editing.

Transformer Architecture

Quantization Theory

Reduce model weight precision from FP32 to FP16, INT8, or INT4. Post-training quantization, quantization-aware training, GPTQ, AWQ, and GGUF. Quantization is how large language models actually get deployed.

Matrix Operations and PropertiesEigenvalues and EigenvectorsMatrix Calculus+3 more

Table Extraction and Structure Recognition

Detecting tables in documents, recognizing row and column structure, and extracting cell content. Why tables are hard: merged cells, borderless layouts, nested headers, and cascading pipeline errors.

Document IntelligencePaddleOCR and Practical OCR

Methodology & Experimental Design

33 topics / L1, L2, L3, L4, L5 / Tiers 1-3

Evaluation, ablations, reproducibility, research habits, and data or benchmark failure modes.

Confusion Matrices and Classification Metrics

The complete framework for evaluating binary classifiers: confusion matrix entries, precision, recall, F1, specificity, ROC curves, AUC, and precision-recall curves. When to use each metric.

Common Probability DistributionsMulti-Class and Multi-Label ClassificationSignal Detection Theory

Confusion Matrix: MCC, Kappa, and Cost-Sensitive Evaluation

Beyond binary classification: multi-class confusion matrices, micro vs macro averaging, Matthews correlation coefficient, Cohen's kappa, cost-sensitive evaluation, and how to read a confusion matrix without getting the axes wrong.

Model Evaluation Best Practices

Train/validation/test splits, cross-validation, stratified and temporal splits, data leakage, reporting with confidence intervals, and statistical tests for model comparison. Why single-number comparisons are misleading.

Confusion Matrices and Classification MetricsBayesian Optimization for Hyperparameters

Train-Test Split and Data Leakage

Why you need three data splits, how to construct them correctly, and the common ways information leaks from test data into training. Temporal splits, stratified splits, and leakage detection.

ML Project Lifecycle

Types of Bias in Statistics

A taxonomy of the major biases that corrupt statistical inference: selection bias, measurement bias, survivorship bias, response bias, attrition bias, publication bias, and their consequences for ML.

Anthropic Bias and Observation Selection

Causal Inference and the Ladder of Causation

Pearl's three-level hierarchy: association, intervention, counterfactual. Structural causal models, do-calculus, the adjustment formula, and why prediction is not causation.

Common Probability DistributionsBayesian EstimationCausal Inference Basics+2 more

The Bitter Lesson

Sutton's meta-principle: scalable general methods that exploit computation tend to beat hand-crafted domain-specific approaches in the long run. Search and learning win; brittle cleverness loses.

The Era of Experience

Sutton and Silver's thesis: the next phase of AI moves beyond imitation from human data toward agents that learn predominantly from their own experience. Text is not enough for general intelligence.

The Bitter LessonMarkov Decision Processes

Class Imbalance and Resampling

When class frequencies differ dramatically, standard accuracy is misleading. Resampling, cost-sensitive learning, and threshold tuning restore meaningful evaluation and training.

Confusion Matrices and Classification Metrics

Exploratory Data Analysis

The disciplined practice of looking at data before modeling: summary statistics, distributions, correlations, missing values, outliers, and class balance. You cannot model what you do not understand.

ML Project LifecyclePandas and NumPy FundamentalsTrain-Test Split and Data Leakage

Hardware for ML Practitioners

Practical hardware guidance for ML work: GPU memory as the real bottleneck, when local GPUs make sense, cloud options compared, and why you should not spend $5000 before knowing what you need.

ML Project Lifecycle

The full ML project workflow from problem definition through deployment and monitoring. Why most projects fail at data quality, not model architecture. Cross-functional requirements and MLOps basics.

Hardware for ML Practitioners

Convex Tinkering

Taleb's concept applied to ML research: designing small experiments with bounded downside and unbounded upside, and the precise conditions (convex payoff, nonzero variance, ex-post selection) under which this can dominate scale-first approaches.

Common Inequalitieseditorial principlesNon-Probability Sampling

Evaluation Metrics and Properties

The metrics that determine whether a model is good: accuracy, precision, recall, F1, AUC-ROC, log loss, RMSE, calibration, and proper scoring rules. Why choosing the right metric matters more than improving the wrong one.

Feature Importance and Interpretability

Methods for attributing model predictions to input features: permutation importance, SHAP values, LIME, partial dependence, and why none of these imply causality.

Decision Trees and EnsemblesLinear RegressionExploratory Data Analysis+2 more

Hypothesis Testing for ML

Null and alternative hypotheses, p-values, confidence intervals, error types, multiple comparisons, and proper statistical tests for comparing ML models.

Benford's LawConfusion Matrix: MCC, Kappa, and Cost-Sensitive EvaluationDifferential Privacy+9 more

Meta-Analysis

Combining results from multiple studies: fixed-effect and random-effects models, heterogeneity measures, publication bias, and why this matters for ML benchmarking.

Hypothesis Testing for MLBayesian EstimationREML and Variance Component Estimation

P-Hacking and Multiple Testing

How selective reporting and multiple comparisons inflate false positive rates, and how Bonferroni and Benjamini-Hochberg corrections control them. Why hyperparameter tuning is multiple testing and benchmark shopping is p-hacking.

Hypothesis Testing for MLMeta-Analysis

Proper Scoring Rules

A scoring rule is proper exactly when the expected score is maximized by reporting the true belief. Log score and Brier score are strictly proper. Accuracy is not. Why this matters for calibrated probability estimates.

Evaluation Metrics and PropertiesROC Curve and AUC

Reproducibility and Experimental Rigor

What it takes to make ML experiments truly reproducible: seeds, variance reporting, data hygiene, configuration management, and the discipline of multi-run evaluation.

Git and GitLab for ML ResearchPython for ML ResearchWeights and Biases for Experiment Tracking

ROC Curve and AUC

Theory of the receiver operating characteristic curve. AUC as a Wilcoxon-Mann-Whitney probability, cost-weighted operating-point selection, the ROC convex hull, the Neyman-Pearson connection, and partial AUC for restricted FPR regions.

Confusion Matrices and Classification MetricsCommon Probability Distributions

Statistical Significance and Multiple Comparisons

p-values, significance levels, confidence intervals, and the multiple comparisons problem. Bonferroni correction, Benjamini-Hochberg FDR control, and why these matter for model selection and benchmark evaluation.

Hypothesis Testing for ML

Ablation Study Design

How to properly design ablation studies: remove one component at a time, measure the effect against proper baselines, and report results with statistical rigor.

Hypothesis Testing for MLBenchmarking MethodologyExperiment Tracking and Tooling+1 more

Federated Learning

Train a global model without centralizing data. FedAvg, communication efficiency, non-IID convergence challenges, differential privacy integration, and applications in healthcare and mobile computing.

Optimizer Theory: SGD, Adam, and Muon

Synthetic Data Generation

Using models to generate training data: LLM-generated instructions, diffusion-based image augmentation, code synthesis. When synthetic data helps (low-resource, privacy) and when it hurts (model collapse).

Common Probability Distributions

Prasad-Rao MSE Correction

Why the naive Fay-Herriot MSE is too small once the area variance is estimated, and how the classical Prasad-Rao decomposition adds the missing second-order term.

Small Area EstimationExpectation, Variance, Covariance, and MomentsREML and Variance Component Estimation

Experiment Tracking and Tooling

MLflow, Weights and Biases, TensorBoard, Hydra, and the discipline of recording everything. What to log, how to compare experiments, and why you cannot reproduce what you did not record.

Reproducibility and Experimental RigorHardware for ML PractitionersML Project Lifecycle+1 more

Benchmarking Methodology

What makes a good benchmark, how benchmarks fail (contamination, leaderboard gaming, single-number comparisons), and how to report results honestly with variance, seeds, and proper statistical practice.

Evaluation Metrics and PropertiesReproducibility and Experimental Rigor

Causal Inference Basics

Correlation is not causation. The potential outcomes framework, average treatment effects, confounders, and the methods that let you estimate causal effects from data.

Hypothesis Testing for MLFeature Importance and InterpretabilitySurvival Analysis

FSRS: Spaced-Repetition Scheduling as Parameter Estimation

FSRS models memory as a difficulty-stability-retrievability (DSR) triple with a power-law forgetting curve, fits its weights by maximum-likelihood optimization over real review logs, and beats SM-2 on held-out calibration benchmarks. What TheoremPath's own scheduler does, and does not, implement.

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyGradient Descent Variants

Official Statistics and National Surveys

How government statistical agencies produce population, economic, and social data through censuses and surveys, with quality frameworks and implications for ML practitioners using these datasets.

Survey Sampling MethodsPrasad-Rao MSE CorrectionSmall Area Estimation

Adjusted Density Maximization

Why some small-area methods adjust the likelihood or posterior-like density to estimate shrinkage factors more stably when the variance component is near zero.

Small Area EstimationREML and Variance Component EstimationPrasad-Rao MSE Correction

Energy Efficiency and Green AI

The compute cost of training frontier models, carbon footprint, FLOPs vs wall-clock time vs dollars, and why reporting efficiency matters. Efficient alternatives: distillation, pruning, quantization, and scaling laws for optimal compute allocation.

Training Techniques & Regularization

12 topics / L2, L3 / Tiers 1-3

Adam, dropout, batch norm, data augmentation, learning rate schedules.

Adam Optimizer

Adaptive moment estimation: first and second moment tracking, bias correction, AdamW vs Adam+L2, learning rate warmup, and when Adam fails compared to SGD.

Gradient Descent VariantsStochastic Gradient Descent Convergence

Batch Normalization

Normalize activations within mini-batches to stabilize and accelerate training, then scale and shift with learnable parameters.

Feedforward Networks and BackpropagationExpectation, Variance, Covariance, and MomentsActivation Functions+5 more

Dropout

Dropout as stochastic regularization: Bernoulli masking, inverted dropout scaling, implicit ensemble interpretation, Bayesian connection via MC dropout, and equivalence to L2 regularization in linear models.

Feedforward Networks and BackpropagationCommon Probability Distributions

Learning Rate Scheduling

Why learning rate moves the loss more than any other hyperparameter, and how warmup, cosine decay, step decay, cyclic schedules, and 1cycle policy control training dynamics.

Stochastic Gradient Descent ConvergenceAdam OptimizerBatch Size and Learning Dynamics+1 more

Regularization in Practice

Practical guide to regularization: L2 (weight decay), L1 (sparsity), dropout, early stopping, data augmentation, and batch normalization. How each constrains model complexity and how to choose between them.

Regularization TheoryCross-Entropy Loss: MLE, KL Divergence, and Classification

Weight Initialization

Why initialization determines whether gradients vanish or explode, and how Xavier and He initialization preserve variance across layers.

Feedforward Networks and BackpropagationEigenvalues and EigenvectorsActivation Functions

Batch Size and Learning Dynamics

How batch size affects what SGD finds: gradient noise, implicit regularization, the linear scaling rule, sharp vs flat minima, and the gradient noise scale as the key quantity governing the tradeoff.

Stochastic Gradient Descent ConvergenceAdam Optimizer

Data Augmentation Theory

Why data augmentation works as a regularizer: invariance injection, effective sample size, MixUp, CutMix, and the connection to Vicinal Risk Minimization.

Contrastive LearningRegularization in PracticeSelf-Supervised Vision+1 more

Label Smoothing and Regularization

Replace hard targets with soft targets to prevent overconfidence: the label smoothing formula, its connection to maximum entropy regularization, calibration effects, and when it hurts.

Logistic Regression

Activation Checkpointing

Trade compute for memory by recomputing activations during the backward pass instead of storing them all. Reduces activation memory from linear to square-root in the number of layers.

Feedforward Networks and BackpropagationMixed Precision Training

Mixed Precision Training

Train with FP16 or BF16 for speed while keeping FP32 master weights for accuracy. Loss scaling, overflow prevention, and when mixed precision fails.

Floating-Point ArithmeticAdam OptimizerDistributed Training Theory+3 more

Curriculum Learning

Train on easy examples first, gradually increase difficulty. Curriculum learning can speed convergence and improve generalization, but defining difficulty is the hard part. Self-paced learning, anti-curriculum, and the connection to importance sampling.

Synthetic Data Distillation

AI Safety & Alignment

18 topics / L3, L4, L5 / Tiers 1-2

RLHF failure modes, hallucination theory, interpretability, reward hacking.

Mechanistic Interpretability: Features, Circuits, and Causal Faithfulness

Reverse-engineering trained neural networks. Coverage of the superposition hypothesis, sparse autoencoders for feature extraction, the linear representation hypothesis and its counterexamples, induction heads and IOI as canonical circuits, sparse feature circuits (Marks et al. 2024), cross-layer transcoders (Lindsey et al. 2024), activation patching (noising vs denoising), faithfulness checks, frontier-scale evidence from Anthropic's Scaling Monosemanticity (Templeton et al. 2024) and DeepMind's Gemma Scope (Lieberum et al. 2024), and the limits of current interpretability.

Transformer ArchitecturePrincipal Component AnalysisKolmogorov-Arnold Networks (KANs)+2 more

Calibration and Uncertainty Quantification

When a model says 70% confidence, is it right 70% of the time? Calibration measures the alignment between predicted probabilities and actual outcomes. Temperature scaling, Platt scaling, conformal prediction, and MC dropout provide practical tools for trustworthy uncertainty.

Logistic RegressionBits, Nats, Perplexity, and BPBDecoding Strategies+3 more

Continual Learning and Forgetting

Learning sequentially without destroying previous knowledge: Elastic Weight Consolidation, progressive networks, replay methods, and the stability-plasticity tradeoff in deployed systems.

Optimizer Theory: SGD, Adam, and Muon

Differential Privacy

Formal privacy guarantees for algorithms: epsilon-delta DP, Laplace and Gaussian mechanisms, composition theorems, DP-SGD for training neural networks, and the privacy-utility tradeoff.

Common Probability DistributionsFederated Learning

Ethics and Fairness in ML

Fairness definitions (demographic parity, equalized odds, calibration), the impossibility theorem showing they cannot all hold simultaneously, bias sources, and mitigation strategies at each stage of the pipeline.

Out-of-Distribution Detection

Methods for detecting when test inputs differ from training data, where naive softmax confidence fails and principled alternatives based on energy, Mahalanobis distance, and typicality succeed.

Calibration and Uncertainty QuantificationAnomaly Detection for Gravitational WavesCNNs for Medical Imaging

Adversarial Machine Learning

Attacks across the ML lifecycle: evasion (adversarial examples), poisoning (corrupt training data), model extraction (steal via queries), privacy leakage (membership inference), and LLM jailbreaks. Why defenses are hard.

Feedforward Networks and BackpropagationMinimax and Saddle PointsVon Neumann Minimax Theorem

Catastrophic Forgetting

Fine-tuning a neural network on new data destroys knowledge of old data. The stability-plasticity dilemma, mitigation strategies (EWC, LwF, SI, MAS, replay, progressive networks), and implications for continual learning and LLM fine-tuning.

Fine-Tuning and Adaptation

Jacobian Lens and Global Workspace Interpretability

The Jacobian lens maps intermediate residual-stream activations into final-layer coordinates with a corpus-averaged Jacobian. A 2026 Transformer Circuits study uses this readout and targeted interventions to test whether a small token-aligned component supports report, internal reasoning, flexible use, and selective processing.

Mechanistic Interpretability: Features, Circuits, and Causal FaithfulnessResidual Stream and Transformer Internals

Truth Directions and Linear Probes

The geometry-of-truth program asks whether transformer residual streams linearly separate true from false factual statements. Linear probes can reveal such structure; activation interventions test whether the direction is merely correlational or actually behavior-changing.

Mechanistic Interpretability: Features, Circuits, and Causal FaithfulnessResidual Stream and Transformer Internals

Constitutional AI

Anthropic's approach to alignment: replace human harmlessness labels with a constitution of principles that the model uses to self-evaluate, enabling scalable AI feedback.

RLHF and AlignmentReinforcement Learning from Human FeedbackReward Hacking

Data Contamination and Evaluation

When training data overlaps test benchmarks, model scores become meaningless. Types of contamination, detection methods, dynamic benchmarks, and how to read evaluation claims skeptically.

Hypothesis Testing for MLBenchmarking MethodologyModel Collapse and Data Quality+1 more

LLM Application Security

The OWASP LLM Top 10 (2025): prompt injection, sensitive information disclosure, supply chain, data and model poisoning, improper output handling, excessive agency, system prompt leakage, vector and embedding weaknesses, misinformation, and unbounded consumption. Standard application security for the GenAI era.

Adversarial Machine LearningRLHF and Alignment

Model Collapse and Data Quality

When models train on their own outputs, the learned distribution narrows, tails disappear, and quality degrades across generations. Why synthetic data feedback loops threaten pretraining data quality and how to mitigate them.

Synthetic Data Generation

Red-Teaming and Adversarial Evaluation

Systematically trying to make models produce harmful or incorrect outputs: manual and automated red-teaming, jailbreaks, prompt injection, adversarial suffixes, and why adversarial evaluation is necessary before deployment.

RLHF and AlignmentCalibration and Uncertainty Quantification

Reward Hacking

Goodhart's law for AI: when models exploit reward model weaknesses instead of being genuinely helpful, including verbosity hacking, sycophancy, and structured mitigation strategies.

Reward Models and VerifiersRLHF and AlignmentHallucination Theory+1 more

Reward Models and Verifiers

How preference reward models, outcome verifiers, process reward models, executable checks, and ensembles provide different training signals, and where Goodhart pressure enters.

RLHF and AlignmentPost-Training OverviewReasoning Data Curation+1 more

Verifier Design and Process Reward

Detailed treatment of verifier types, process vs outcome reward models, verifier-guided search, self-verification, and the connection to test-time compute scaling. How to design reward signals for reasoning models.

Reward Models and Verifiers

Reinforcement Learning Theory

26 topics / L2, L3, L4, L5 / Tiers 1-3

MDPs, Bellman, policy gradients, multi-agent, game theory.

Bellman Equations

The recursive backbone of RL. State-value and action-value Bellman equations, the contraction mapping property, convergence of value iteration, and why recursive decomposition is the central idea in sequential decision-making.

Markov Decision ProcessesExpectation, Variance, Covariance, and Moments

Markov Decision Processes

The mathematical framework for sequential decision-making under uncertainty: states, actions, transitions, rewards, and the Bellman equations that make solving them possible.

Convex Optimization BasicsConcentration InequalitiesBayesian State Estimation+3 more

Q-Learning

Model-free, off-policy value learning: the Q-learning update rule, convergence under Robbins-Monro conditions, and the deep Q-network revolution that introduced function approximation, experience replay, and the deadly triad.

Value Iteration and Policy IterationBellman EquationsStochastic Approximation Theory+1 more

Value Iteration and Policy Iteration

The two foundational algorithms for solving MDPs exactly: value iteration applies the Bellman optimality operator until convergence, while policy iteration alternates between exact evaluation and greedy improvement.

Markov Decision ProcessesBellman Equations

Policy Gradient Theorem

The fundamental result enabling gradient-based optimization of parameterized policies: the policy gradient theorem and the algorithms it spawns.

Markov Decision ProcessesConvex Optimization BasicsMulti-Armed Bandits Theory+4 more

Reward Design and Reward Misspecification

The hardest problem in RL: specifying what you want. Reward shaping, potential-based shaping theorem, specification gaming, Goodhart's law in RL, and the bridge from classic RL to alignment.

Markov Decision ProcessesBellman EquationsReinforcement Learning for Drug Discovery

Exploration vs Exploitation

The fundamental tradeoff in sequential decision-making: exploit known good actions to collect reward now, or explore uncertain actions to discover potentially better strategies. Epsilon-greedy, Boltzmann exploration, UCB, count-based methods, and intrinsic motivation.

Multi-Armed Bandits TheoryMarkov Decision ProcessesThe Bitter Lesson+1 more

Minimax and Saddle Points

Minimax theorems characterize when max-min equals min-max. Saddle points arise in zero-sum games, duality theory, GANs, and robust optimization.

Convex Optimization BasicsConvex Duality

Multi-Armed Bandits Theory

The exploration-exploitation tradeoff formalized: K arms, regret as the cost of not knowing the best arm, and algorithms (UCB, Thompson sampling) that achieve near-optimal regret bounds.

Common Probability DistributionsBayesian Optimization for HyperparametersNo-Regret Learning+1 more

Temporal Difference Learning

Temporal difference methods bootstrap value estimates from other value estimates, enabling online, incremental learning without waiting for episode termination. TD(0), SARSA, and TD(lambda) with eligibility traces.

Markov Decision ProcessesValue Iteration and Policy IterationBellman Equations+1 more

Actor-Critic Methods

The dominant paradigm for deep RL and LLM training: an actor (policy network) guided by a critic (value network), with advantage estimation, PPO clipping, and entropy regularization.

Policy Gradient TheoremQ-LearningReward Systems and Reinforcement Learning Neuroscience+1 more

DDPG: Deep Deterministic Policy Gradient

An off-policy actor-critic algorithm for continuous action spaces that combines a deterministic policy gradient with a DQN-style critic, using replay buffers and polyak-averaged target networks for stability.

Policy Gradient TheoremQ-LearningActor-Critic Methods

Markov Games and Self-Play

Multi-agent extensions of MDPs where multiple agents with separate rewards interact. Nash equilibria, minimax values in zero-sum games, and self-play as a training method.

Markov Decision ProcessesReinforcement Learning for Auction Design

No-Regret Learning

Online learning against adversarial losses: regret as cumulative loss minus the best fixed action in hindsight, multiplicative weights, follow the regularized leader, and why no-regret dynamics converge to Nash equilibria in zero-sum games.

Common Probability DistributionsConcentration Inequalities

Offline Reinforcement Learning

Learning policies from a fixed dataset without environment interaction: distributional shift as the core challenge, conservative Q-learning (CQL) as the standard fix, and Decision Transformer as an alternative sequence modeling approach.

Online Learning and Bandits

Sequential decision making with adversarial or stochastic feedback: the bandit setting, explore-exploit tradeoff, UCB, Thompson sampling, and regret bounds. Connections to RL and A/B testing.

No-Regret LearningAdaptive Learning Is Not IIDTest-Time Training and Adaptive Inference

Policy Optimization: PPO and TRPO

Trust region and clipped surrogate methods for stable policy optimization: TRPO enforces a KL constraint for monotonic improvement, PPO replaces it with a simpler clipped objective, and PPO became the default optimizer for LLM post-training.

Policy Gradient TheoremActor-Critic MethodsDDPG: Deep Deterministic Policy Gradient+2 more

Policy Representations

How to parameterize policies in reinforcement learning: categorical for discrete actions, Gaussian for continuous actions, and why the choice affects gradient variance and exploration.

Markov Decision Processes

Self-Play and Multi-Agent RL

Self-play as a training paradigm for competitive games, fictitious play convergence, AlphaGo/AlphaZero, and the challenges of multi-agent reinforcement learning: non-stationarity, partial observability, and centralized training.

Markov Decision ProcessesAgent-Based Modeling with MLNo-Regret Learning

TD3: Twin Delayed Deep Deterministic Policy Gradient

An off-policy actor-critic algorithm that fixes DDPG's overestimation bias with clipped double-Q learning, target policy smoothing, and delayed policy updates. The minimum-complexity robust continuous-control algorithm.

DDPG: Deep Deterministic Policy GradientQ-LearningPolicy Gradient Theorem

Multi-Agent Collaboration

Multiple LLM agents working together on complex tasks: debate for improving reasoning, division of labor across specialist agents, structured communication protocols, and when multi-agent outperforms single-agent systems.

Markov Decision ProcessesPolicy Gradient Theorem

Agentic RL and Tool Use

LLMs as multi-step policies: observations, tool calls, environment feedback, sparse rewards, credit assignment, and why agent training differs from single-turn RLHF.

Markov Decision ProcessesPolicy Gradient TheoremOffline Reinforcement Learning+2 more

Options and Temporal Abstraction

The options framework for hierarchical RL: temporally extended actions with initiation sets, internal policies, and termination conditions. Semi-MDPs and learning options end-to-end.

Markov Decision ProcessesValue Iteration and Policy IterationPolicy Representations

Reinforcement Learning Environments and Benchmarks

The standard RL evaluation stack: Gymnasium API, classic control tasks, Atari, MuJoCo, ProcGen, the sim-to-real gap, and why benchmark performance is a poor predictor of real-world RL capability.

Markov Decision ProcessesDeep RL for Control

Mean-Field Games

The many-agent limit of strategic interactions: as the number of agents goes to infinity, each agent solves an MDP against the population distribution, and equilibrium becomes a fixed-point condition on the mean field.

Markov Decision ProcessesMean Field TheoryAgent-Based Modeling with ML

Robust Adversarial Policies

Robust MDPs optimize against worst-case transition dynamics within an uncertainty set. Adversarial policies formalize distribution shift in RL as a game between agent and environment.

Markov Decision ProcessesMinimax Lower Bounds: Le Cam, Fano, Assouad, and the Reduction to Testing

Beyond LLMs

20 topics / L4, L5 / Tiers 1-3

JEPA, world models, vision-first AI, diffusion, audio, 3D, state-space, and continuous-time models.

CLIP, OpenCLIP, and SigLIP: Contrastive Language-Image Pretraining

Radford et al. 2021 (CLIP) trained two encoders, one for images and one for text, with a symmetric InfoNCE objective on 400M web pairs. The result was a shared embedding space that powers zero-shot classification, retrieval, and serves as the visual backbone of every modern vision-language model. This page covers the contrastive objective as a mutual-information bound, the OpenCLIP scaling laws (Cherti et al. 2023), the SigLIP pairwise-sigmoid alternative (Zhai et al. 2023), the modality gap (Liang et al. 2022), and the practical pipeline from training corpus to LLaVA-style VLM backbone.

Contrastive LearningVision Transformer Lineage: ViT, DeiT, Swin, MAE, DINOv2, SAMInformation Theory Foundations+1 more

Diffusion Models

Generative models that learn to reverse a noise-adding process: forward and reverse SDEs, score matching, the DDPM ELBO, deterministic ODE sampling, classifier-free guidance, latent diffusion, and the bridge to flow matching.

Variational AutoencodersScore MatchingBoltzmann Machines and Hopfield Networks+16 more

Vision Transformer Lineage: ViT, DeiT, Swin, MAE, DINOv2, SAM

Dosovitskiy et al. 2021 (ViT) showed that a plain transformer applied to image patches matches CNN backbones once given enough pretraining data. The follow-on lineage (DeiT, Swin, MAE, DINOv2, SAM, register tokens, NaViT, ViT-22B, SigLIP-2) traded design knobs along three axes: how much locality bias to keep, what supervision to train against, and whether tokens are spatial-uniform or content-adaptive. This page covers the patch-embedding complexity, Swin's shifted-window argument, the MAE 75% masking ratio, DINO self-distillation with centering, and the 2024 register-token finding that closes the long-standing CLIP/ViT artifact-token gap.

Transformer ArchitectureConvolutional Neural NetworksCNNs for Medical Imaging

Equilibrium and Implicit-Layer Models

Deep Equilibrium Models (DEQ) replace explicit depth with a fixed-point equation: instead of stacking L layers, solve for the equilibrium state where one more layer would not change the output. This enables infinite-depth networks with constant memory, using implicit differentiation for backprop.

Skip Connections and ResNetsImplicit Differentiation

Equivariant Deep Learning

Networks that respect symmetry: if the input transforms under a group action, the output transforms predictably. Equivariance generalizes translation equivariance in CNNs to rotations, permutations, and gauge symmetries, reducing sample complexity and improving generalization on structured data.

Convolutional Neural NetworksGraph Neural NetworksAttention for Protein Structure: AlphaFold and Successors

Flow Matching

Learn a velocity field that transports noise to data along straight-line paths. Simpler training than diffusion, faster sampling, and cleaner math.

Diffusion ModelsIto's LemmaPDE Fundamentals for Machine Learning

JEPA and Joint Embedding

LeCun's Joint Embedding Predictive Architecture: learning by predicting abstract representations rather than pixels, energy-based formulation, I-JEPA, V-JEPA, V-JEPA 2, and the connection to contrastive learning and world models.

AutoencodersVariational AutoencodersHistory of Artificial Intelligence+3 more

Kolmogorov-Arnold Networks (KANs)

An alternative to MLPs where learnable univariate functions (typically B-splines) sit on edges and pure summation sits on nodes. Motivated by the Kolmogorov-Arnold representation theorem, competitive on small smooth scientific tasks, unproven at frontier scale.

Universal Approximation TheoremFeedforward Networks and BackpropagationActivation Functions

Mamba and State-Space Models

Linear-time sequence modeling via structured state spaces: S4, HiPPO initialization, selective state-space models (Mamba), and the architectural fork from transformers.

Recurrent Neural NetworksAttention Mechanism TheoryDeep Learning for Time Series+4 more

Physics-Informed Neural Networks

Embedding PDE constraints directly into the neural network loss function via automatic differentiation. When physics-informed learning works, when it fails, and what alternatives exist.

The Jacobian MatrixAutomatic DifferentiationFeedforward Networks and Backpropagation+6 more

Self-Supervised Vision

Learning visual representations without labels: contrastive methods (SimCLR, MoCo), self-distillation (DINO/DINOv2), and masked image modeling (MAE). Why self-supervised vision matters for transfer learning and label-scarce domains.

Vision Transformer Lineage: ViT, DeiT, Swin, MAE, DINOv2, SAMAttention for Protein Structure: AlphaFold and SuccessorsCNNs for Signal Feature Extraction

World Models and Planning

Learning a model of the environment and planning inside it: Dreamer's latent dynamics, MuZero's learned model with MCTS, video world models, and why planning in imagination is the path to sample-efficient and safe AI.

Markov Decision ProcessesThe Era of ExperienceHistory of Artificial Intelligence+1 more

Florence and Vision Foundation Models

Vision foundation models that unify classification, detection, segmentation, captioning, and VQA under a single pretrained backbone. Florence, Florence-2, and the path toward GPT-4V-style multimodal understanding.

Vision Transformer Lineage: ViT, DeiT, Swin, MAE, DINOv2, SAMSelf-Supervised Vision

Test-Time Training and Adaptive Inference

Updating model parameters at inference time using self-supervised objectives on the test input itself. TTT layers replace fixed linear recurrences (as in Mamba) with learned update rules, blurring the boundary between training and inference.

Stochastic Gradient Descent ConvergenceRecurrent Neural NetworksContinuous Thought Machines

Video World Models

Turning pretrained video diffusion models into interactive world simulators: condition on actions to generate future frames, enabling RL agent training, robot planning, and game AI without physical environments.

World Models and PlanningDiffusion Models

3D Gaussian Splatting

Represent a 3D scene as millions of 3D Gaussians, each with position, covariance, opacity, and color. Render by projecting to 2D and alpha-compositing. Real-time, high-quality novel view synthesis without neural networks at render time.

Occupancy Networks and Neural FieldsPositive Semidefinite MatricesDiffusion Models+1 more

Occupancy Networks and Neural Fields

Representing 3D geometry and appearance as continuous functions parameterized by neural networks: NeRF, occupancy networks, DeepSDF, volume rendering, and the connection to Gaussian splatting.

Feedforward Networks and Backpropagation

Audio Language Models

Models that process and generate speech alongside text: audio tokenization, Whisper for transcription, end-to-end voice models, music generation, and the audio-language frontier.

Speech and Audio MLTransformer Architecture

Continuous Thought Machines

Neural networks that process information through continuous-time internal dynamics rather than discrete layer-by-layer computation. Inspired by neural ODEs and dynamical systems, these architectures let the network 'think' for a variable amount of internal time before producing an output.

Neural ODEs and Continuous-Depth NetworksEquilibrium and Implicit-Layer Models

World Model Evaluation

How to measure whether a learned world model is useful: prediction accuracy, controllability (sim-to-real transfer), planning quality, and why long-horizon evaluation is hard.

World Models and Planning

ML Methods

74 topics / L1, L2, L3 / Tiers 1-3

Core supervised / unsupervised algorithms: CNNs, autoencoders, boosting, clustering, contrastive learning.

Activation Functions

Nonlinear activation functions in neural networks: sigmoid, tanh, ReLU, Leaky ReLU, GELU, and SiLU. Their gradients, saturation behavior, and impact on trainability.

Differentiation in RⁿConvex Optimization Basics

Cross-Entropy Loss: MLE, KL Divergence, and Classification

Why cross-entropy is the correct loss for classification: its derivation as negative log-likelihood, connection to KL divergence, why MSE fails for classification, and practical variants including label smoothing and focal loss.

Information Theory FoundationsLogistic RegressionLog-Probability Computation+1 more

Data Preprocessing and Feature Engineering

Standardization, scaling, encoding, imputation, and feature selection. Why most algorithms assume centered, scaled inputs and what breaks when you skip preprocessing.

Common Probability DistributionsLinear Regression

K-Means Clustering

Lloyd's algorithm for partitional clustering: the within-cluster sum of squares objective, convergence guarantees, k-means++ initialization, choosing k, and the connection to EM for Gaussians.

Common Probability DistributionsConvex Optimization BasicsNMF (Nonnegative Matrix Factorization)+2 more

Linear Regression

Ordinary least squares as projection, the normal equations, the hat matrix, Gauss-Markov optimality, and the connection to maximum likelihood under Gaussian noise.

Matrix Operations and PropertiesMaximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyThe Elements of Statistical Learning (Hastie, Tibshirani, Friedman)+1 more

Logistic Regression

The foundational linear classifier: sigmoid link function, maximum likelihood estimation, cross-entropy loss, gradient derivation, and regularized variants.

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyConvex Optimization BasicsData Preprocessing and Feature Engineering+2 more

Loss Functions Catalog

A systematic catalog of loss functions for classification, regression, and representation learning: cross-entropy, MSE, hinge, Huber, focal, KL divergence, and contrastive loss.

Logistic Regression

Principal Component Analysis

Dimensionality reduction via variance maximization: PCA as eigendecomposition of the covariance matrix, PCA as truncated SVD of the centered data matrix, reconstruction error, and when sample PCA works.

Eigenvalues and EigenvectorsSingular Value DecompositionGram Matrices and Kernel Matrices+4 more

Ridge Regression

L2-regularized least squares: closed-form shrinkage, SVD geometry, MSE dominance over OLS, the Bayesian Gaussian-prior connection, effective degrees of freedom, kernel ridge, and the modern overparameterized regime where ridgeless OLS can still generalize.

Linear RegressionConvex Optimization BasicsShrinkage Estimation and the James-Stein Estimator: Inadmissibility, SURE, and Brown's Characterization+3 more

AIC and BIC

Akaike and Bayesian information criteria for model selection: how they trade off fit versus complexity, when to use each, and their connection to cross-validation.

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyDecision Trees and EnsemblesGauss-Markov Theorem+1 more

Bagging

Bootstrap Aggregation: train B models on bootstrap samples and average their predictions. Why averaging reduces variance, how correlation limits the gain, and out-of-bag error estimation.

Bootstrap MethodsDecision Trees and Ensembles

Feedforward Networks and Backpropagation

Feedforward neural networks as compositions of affine transforms and nonlinearities, the universal approximation theorem, and backpropagation as reverse-mode automatic differentiation on the computational graph.

Differentiation in RⁿMatrix CalculusActivation Functions+8 more

Gauss-Markov Theorem

Among all linear unbiased estimators, ordinary least squares has the smallest variance. This is the BLUE theorem. And understanding when its assumptions fail is just as important as the result itself.

Linear RegressionThe Multivariate Normal Distribution

Gradient Boosting

Gradient boosting as functional gradient descent: fit weak learners to pseudo-residuals sequentially, reducing bias at each round. Covers AdaBoost, shrinkage, XGBoost second-order methods, and LightGBM leaf-wise growth.

Decision Trees and EnsemblesGradient Descent VariantsAdaBoost+2 more

Lasso Regression

OLS with L1 regularization: sparsity, the geometry of why L1 selects variables, proximal gradient descent, LARS, and elastic net.

Linear RegressionConvex Optimization BasicsRidge Regression+2 more

Overfitting and Underfitting

The two failure modes of supervised learning: models that memorize noise versus models too simple to capture signal. Diagnosis via training-validation gaps.

Empirical Risk MinimizationBias-Variance Tradeoff

Random Forests

Random forests combine bagging with random feature subsampling to decorrelate trees, reducing ensemble variance beyond what pure bagging achieves. Out-of-bag estimation, variable importance, consistency theory, and practical strengths and weaknesses.

Decision Trees and EnsemblesBootstrap MethodsBagging

Skip Connections and ResNets

Residual connections let gradients flow through identity paths, enabling training of very deep networks. ResNets learn residual functions F(x) = H(x) - x, which is easier than learning H(x) directly.

Feedforward Networks and Backpropagation

Support Vector Machines

Maximum-margin classifiers via convex optimization: hard margin, soft margin with slack variables, hinge loss, the dual formulation, and the kernel trick.

Convex Optimization BasicsConvex DualityLogistic Regression+2 more

The Kernel Trick

Any algorithm whose computation accesses data only through inner products can be lifted to a feature space implicitly defined by a kernel, without ever computing the features. Worked polynomial-kernel inner-product equivalence, infinite-feature interpretation of the Gaussian RBF, Mercer's condition, and the four canonical applications: kernel SVM, kernel ridge regression, kernel PCA, kernel k-means.

Support Vector MachinesRidge RegressionBayesian Linear Regression+2 more

Universal Approximation Theorem

A single hidden layer neural network can approximate any continuous function on a compact set to arbitrary accuracy. Why this is both important and misleading: it says nothing about width, weight-finding, or generalization.

Feedforward Networks and Backpropagation

Score Matching

Hyvärinen 2005: train a model to estimate the score (gradient of log density) without computing the normalization constant. Integration by parts converts the intractable density-matching loss into a tractable gradient-based objective. Sliced score matching makes the Jacobian-trace term scale, denoising score matching reparameterizes the loss as ε-regression, and Tweedie's formula identifies the score with a posterior-mean denoiser. Together these are the training half of every modern diffusion model and energy-based model.

Stochastic Differential EquationsFokker–Planck EquationExpectation, Variance, Covariance, and Moments+4 more

Variational Autoencoders

Deriving the ELBO, the reparameterization trick for backpropagation through sampling, and how VAEs turn autoencoders into principled generative models via amortized variational inference.

AutoencodersMaximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyAutoencoders for Low-Dimensional Dynamical Structures+4 more

K-Nearest Neighbors

Classify by majority vote of the k closest training points: no training phase, universal consistency as n and k grow, and the curse of dimensionality that makes distance meaningless in high dimensions.

Common Probability DistributionsOrder Statistics

Multi-Class and Multi-Label Classification

Multi-class (exactly one label, softmax + cross-entropy) vs multi-label (multiple labels, sigmoid + binary cross-entropy). One-vs-rest, one-vs-one, hierarchical classification, and evaluation metrics.

Logistic Regression

Naive Bayes

The simplest generative classifier: assume conditional independence of features given the class, estimate class-conditional densities, and classify via MAP. Why it works despite the wrong independence assumption.

Common Probability Distributions

Perceptron

Rosenblatt's perceptron (1958): the simplest neural network, the first learning algorithm with a convergence proof, and the lesson that linear separability is both powerful and limiting.

AdaBoost

AdaBoost as iterative reweighting of misclassified samples, exponential loss minimization, weak-to-strong learner amplification, margin theory, and the connection to coordinate descent.

Decision Trees and Ensembles

Anomaly Detection

Methods for identifying data points that deviate from expected patterns: isolation forests, one-class SVMs, autoencoders, statistical distances, and why the absence of anomaly labels makes this problem structurally harder than classification.

Common Probability DistributionsAnomaly Detection for Gravitational Waves

Autoencoders

Encoder-decoder architectures for unsupervised representation learning: undercomplete bottlenecks, sparse and denoising variants, and the connection between linear autoencoders and PCA.

Feedforward Networks and BackpropagationVectors, Matrices, and Linear MapsBoltzmann Machines and Hopfield Networks+3 more

Decision Trees and Ensembles

Greedy recursive partitioning with splitting criteria, pruning, and why combining weak learners via bagging (random forests) and boosting (gradient boosting) yields strong predictors.

Empirical Risk MinimizationBias-Variance TradeoffK-Nearest Neighbors

Dimensionality Reduction Theory

Why and how to reduce dimensions: the curse of dimensionality, PCA, random projections (JL lemma), t-SNE, UMAP, and when each method preserves the structure you care about.

Principal Component AnalysisEigenvalues and EigenvectorsMeasure Concentration and Geometric Functional Analysis

Elastic Net

Combining L1 and L2 penalties: elastic net gets sparsity from lasso and grouping stability from ridge, solving the failure mode where lasso arbitrarily selects among correlated features.

Ridge RegressionLasso Regression

Ensemble Methods Theory

Why combining multiple models outperforms any single model: bias-variance decomposition for ensembles, diversity conditions, and the theoretical foundations of bagging, boosting, and stacking.

BaggingGradient Boosting

Gaussian Mixture Models and EM

GMMs as soft clustering with per-component Gaussians: EM derivation (E-step responsibilities, M-step parameter updates), convergence guarantees, model selection with BIC/AIC, and the connection to k-means as the hard-assignment limit.

K-Means ClusteringThe EM AlgorithmMaximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiency

Generalized Additive Models

GAMs model the response as a sum of smooth univariate functions, one per predictor. Interpretable nonlinear regression with backfitting, P-splines, and partial effect plots.

Linear RegressionMARS (Multivariate Adaptive Regression Splines)

Hyperbolic Embeddings for Graphs

Negatively curved space embeds tree-like graphs with exponentially less distortion than Euclidean space at the same dimension. The Poincare ball model, Mobius gyrovector arithmetic, and Sala et al.'s sharp distortion bound make this concrete; Ganea et al. lift the construction to neural network layers.

Non-Euclidean and Hyperbolic GeometryMetric Spaces, Convergence, and Completeness

Natural Language Processing Foundations

The progression from bag-of-words to transformers: tokenization, language modeling, TF-IDF, sequence-to-sequence, attention, and why the pre-train then fine-tune paradigm replaced task-specific architectures.

Word Embeddings

PageRank Algorithm

PageRank as the stationary distribution of a random walk on a graph: damping factor, power iteration, eigenvector interpretation, and applications beyond web search.

Eigenvalues and EigenvectorsGraph Algorithms Essentials

Recommender Systems

User-item interaction modeling via matrix factorization, collaborative filtering, and content-based methods: the math of SVD-based recommendations, cold start, implicit feedback, and why evaluation is harder than the model.

Eigenvalues and Eigenvectors

Spectral Clustering

Clustering via the eigenvectors of a graph Laplacian: embed data using the bottom eigenvectors, then run k-means in the embedding space. Finds non-convex clusters that k-means alone cannot.

Eigenvalues and EigenvectorsK-Means ClusteringPageRank Algorithm

t-SNE and UMAP

Two dominant nonlinear dimensionality reduction methods: t-SNE preserves local neighborhoods via KL divergence with a Student-t kernel, UMAP uses fuzzy simplicial sets and cross-entropy. Both excel at visualization but have important limitations.

Principal Component AnalysisSelf-Organizing Maps

Time Series Forecasting Basics

Stationarity, autocorrelation, AR, MA, ARIMA, exponential smoothing, modern architectures (PatchTST, iTransformer, TSMixer), and time-series foundation models (TimesFM, Chronos, Moirai).

Linear RegressionTime Series Foundations

Word Embeddings

Dense vector representations of words: Word2Vec (skip-gram, CBOW), negative sampling, GloVe, the distributional hypothesis, and why embeddings transformed NLP from sparse features to learned representations.

Logistic RegressionSingular Value DecompositionMaximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiency+1 more

XGBoost

XGBoost as second-order gradient boosting: Taylor expansion of the loss, regularized objective, optimal leaf weights, split gain formula, and the system optimizations that made it dominant on tabular data.

Gradient Boosting

Contrastive Learning

Learning representations by pulling positive pairs together and pushing negative pairs apart, with theoretical grounding in mutual information maximization.

Feedforward Networks and Backpropagation

Convolutional Neural Networks

How weight sharing and local connectivity exploit spatial structure: convolution as cross-correlation, translation equivariance, pooling for approximate invariance, and the conv-pool-fc architecture.

Feedforward Networks and BackpropagationVectors, Matrices, and Linear MapsFast Fourier Transform+3 more

Deep Learning for Time Series

RNN/LSTM, Temporal Convolutional Networks, and Transformers (PatchTST, Informer, Autoformer) for time series forecasting, with N-BEATS basis decomposition and the Zeng et al. linear-baseline controversy.

Time Series FoundationsState Space ModelsRecurrent Neural Networks+1 more

DeepONet

DeepONet (Lu, Karniadakis et al., 2021) approximates nonlinear operators between function spaces by splitting a network into a branch (encoding the input function at fixed sensors) and a trunk (encoding query coordinates), then taking an inner product. The architecture is the practical realization of Chen and Chen's 1995 universal approximation theorem for operators.

Spectral Theory of Operatorsnavier stokes for mlFourier Neural Operator

EM Algorithm Variants

Variants of EM for when the standard algorithm is intractable: Monte Carlo EM, Variational EM, Stochastic EM, and ECM. Connection to VAEs as amortized variational EM.

The EM Algorithm

Fourier Neural Operator

The Fourier Neural Operator (Li, Kovachki, Anandkumar et al., ICLR 2021) parameterizes the kernel of an integral operator directly in Fourier space, giving a resolution-invariant architecture for learning maps between function spaces. Canonical baseline for data-driven PDE solvers and the architectural backbone of FourCastNet weather prediction.

Fast Fourier Transformnavier stokes for mlSpectral Theory of Operators+1 more

Gaussian Process Regression

Inference with Gaussian processes: the prior-to-posterior update in closed form, the role of kernel choice, marginal likelihood for hyperparameter selection, sparse approximations for scalability, and the connection to Bayesian optimization.

Kernels and Reproducing Kernel Hilbert SpacesJoint, Marginal, and Conditional DistributionsRidge Regression+5 more

Generative Adversarial Networks

The minimax game between generator and discriminator: Nash equilibrium at the data distribution, mode collapse, the Wasserstein distance fix, StyleGAN, and why diffusion models have largely replaced GANs for image generation.

Feedforward Networks and Backpropagation

Graph Neural Networks

Message passing on graphs: GCN, GAT, GraphSAGE, the WL isomorphism test as an expressivity ceiling, over-smoothing in deep GNNs, and applications to molecules, social networks, and knowledge graphs.

Convolutional Neural NetworksEigenvalues and EigenvectorsClustering for Gene Expression+1 more

Meta-Learning

Learning to learn: find model initializations or embedding spaces that enable fast adaptation to new tasks from few examples. MAML, prototypical networks, and the connection to few-shot learning and in-context learning in LLMs.

Feedforward Networks and BackpropagationTest-Time Training and Adaptive Inference

Object Detection and Segmentation

Localizing and classifying objects in images: two-stage (R-CNN), one-stage (YOLO, SSD), anchor-free (CenterNet, FCOS) detectors, semantic and instance segmentation, and the IoU/mAP evaluation framework.

Convolutional Neural Networkshough transform and circle detection

Optimal Brain Surgeon and Pruning Theory

Principled weight pruning via second-order information: Optimal Brain Damage uses the Hessian diagonal, Optimal Brain Surgeon uses the full inverse Hessian, and both reveal why magnitude pruning is a crude but popular approximation.

The Hessian MatrixFeedforward Networks and BackpropagationIterative Magnitude Pruning and the Lottery Ticket Hypothesis

Probability Flow ODE

Song et al. 2021: every diffusion SDE has a deterministic ODE that produces the same time-marginals. The deterministic dual of Anderson's reverse SDE; the basis of DDIM, DPM-Solver, EDM samplers, exact-likelihood computation, and the conceptual bridge to flow matching.

Stochastic Differential EquationsFokker–Planck EquationScore Matching+1 more

Recurrent Neural Networks

Sequential processing via hidden state recurrence: the simple RNN, vanishing and exploding gradients, LSTM gating mechanisms, and why transformers have largely replaced RNNs.

Feedforward Networks and BackpropagationConvolutional Neural NetworksMacroeconomic Time-Series Forecasting

Semantic Search and Embeddings

Dense vector representations for semantic similarity: bi-encoders, cross-encoders, approximate nearest neighbor search, cosine similarity geometry, and the RAG retrieval pipeline.

Word EmbeddingsInner Product Spaces and OrthogonalityInformation Retrieval Foundations+2 more

Speech and Audio ML

Machine learning for audio: mel spectrograms as 2D representations, CTC loss for sequence alignment, Whisper for speech recognition, text-to-speech synthesis, and why continuous audio signals are harder than discrete text.

Signals and Systems for MLRecurrent Neural NetworksCNNs for Signal Feature Extraction

Transfer Learning

Pretrain on a large dataset, fine-tune on a smaller target: why lower layers learn transferable features, feature extraction vs fine-tuning, domain adaptation, negative transfer, and the foundation model paradigm.

Feedforward Networks and BackpropagationRepresentation Learning in Cosmology

Boltzmann Machines and Hopfield Networks

Energy-based models for associative memory and generative learning: Hopfield networks store patterns via energy minimization, Boltzmann machines add stochasticity and hidden units, and RBMs enable tractable learning through contrastive divergence.

Common Probability Distributions

Cubist and Model Trees

M5 model trees put linear regression at each leaf of a decision tree. Cubist extends this with rule simplification and smoothing. A useful hybrid of interpretability and prediction power.

Decision Trees and EnsemblesLinear Regression

Logspline Density Estimation

Model the log-density as a spline, then normalize to get a smooth, positive density estimate. Connection to exponential families, knot selection by BIC, and flexible nonparametric density estimation.

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiency

MARS (Multivariate Adaptive Regression Splines)

MARS: automatically discover nonlinear relationships using piecewise linear hinge functions, forward-backward selection, and a direct connection to ReLU networks.

Linear Regression

NMF (Nonnegative Matrix Factorization)

Factor V into W*H with all entries nonnegative: parts-based additive representation, multiplicative update rules, and applications to topic modeling and image decomposition.

Eigenvalues and Eigenvectors

Self-Organizing Maps

Kohonen networks: competitive learning on a grid that produces topology-preserving mappings from high-dimensional input to low-dimensional discrete maps.

Wavelet Smoothing

Wavelet transforms decompose signals into localized frequency components. Thresholding wavelet coefficients denoises signals while adapting to local smoothness, achieving minimax-optimal rates over Besov spaces.

Signals and Systems for ML

Bayesian Neural Networks

Place a prior over neural network weights and compute the posterior given data. Exact inference is intractable, so we approximate: variational inference, MC dropout, Laplace approximation, SWAG. Principled uncertainty, high cost, limited scaling evidence.

Bayesian EstimationFeedforward Networks and BackpropagationGaussian Processes for Machine Learning+1 more

Energy-Based Models

A unifying framework for generative modeling: assign low energy to likely configurations via E(x), define probability through the Boltzmann distribution, and train without computing the intractable partition function.

Maximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyFeedforward Networks and BackpropagationNeural SDEs and the Diffusion Bridge+2 more

Mixture Density Networks

Neural networks that output the parameters of a mixture model instead of a single point prediction: handling multi-modal conditional distributions, the negative log-likelihood loss, and applications to inverse problems.

Gaussian Mixture Models and EMFeedforward Networks and Backpropagation

Normalizing Flows

Generative models that transform a simple base distribution through invertible mappings, enabling exact log-likelihood computation via the change of variables formula.

Common Probability DistributionsThe Jacobian MatrixVariational Autoencoders

Reservoir Computing and Echo State Networks

Fixed random recurrent networks with trained linear readouts: the echo state property, why random high-dimensional projections carry computational power, extreme learning machines, and connections to state-space models.

Recurrent Neural NetworksAutoencoders for Low-Dimensional Dynamical StructuresLyapunov-Based Machine Learning for Chaos+2 more

Numerical Optimization

26 topics / L0A, L1, L2, L3 / Tiers 1-3

Newton, conjugate gradient, ADMM, coordinate descent, interior point, line search. The algorithmic toolbox.

Floating-Point Arithmetic

How computers represent real numbers, why they get it wrong, and why ML uses float32, float16, bfloat16, and int8. IEEE 754, machine epsilon, overflow, underflow, and catastrophic cancellation.

Conditioning and Condition Number

The condition number measures how sensitive a problem is to perturbations in its input. Ill-conditioned matrices turn small errors into catastrophic ones, and understanding conditioning is essential for any computation involving linear algebra.

Eigenvalues and EigenvectorsMatrix Operations and PropertiesMatrix Norms+3 more

Log-Probability Computation

Working in log space prevents underflow when multiplying many small probabilities. The log-sum-exp trick provides a numerically stable way to compute log of a sum of exponentials, and it underlies stable softmax, log-likelihoods, and the forward algorithm for HMMs.

Softmax and Numerical Stability

Newton's Method

The gold standard for fast local convergence: use second-order information (the Hessian) to take optimal quadratic steps. Quadratic convergence when it works, catastrophic failure when it doesn't.

Convex Optimization BasicsTaylor ExpansionThe Hessian Matrix

Softmax and Numerical Stability

The softmax function maps arbitrary reals to a probability distribution. Getting it right numerically: avoiding overflow and underflow: is the first lesson in writing ML code that actually works.

Proximal Gradient Methods

Optimize composite objectives by alternating gradient steps on smooth terms with proximal operators on nonsmooth terms. ISTA and its accelerated variant FISTA.

Convex Optimization BasicsQuasi-Newton MethodsSubgradients and Subdifferentials

Quasi-Newton Methods

Approximate the Hessian instead of computing it: BFGS builds a dense approximation, L-BFGS stores only a few vectors. Full BFGS has classical local superlinear convergence; L-BFGS is excellent in practice but its fixed-memory theory is weaker and not generally superlinear.

Newton's MethodLine Search MethodsSecant Method

Ascent Algorithms and Hill Climbing

Gradient ascent, hill climbing, and their failure modes: local optima, plateaus, and ridges. Random restarts and simulated annealing as strategies for escaping local optima.

Convex Optimization Basics

Numerical Linear Algebra

Algorithms for solving linear systems, computing eigenvalues, and factoring matrices. Every linear regression, PCA, and SVD computation depends on these methods.

Eigenvalues and EigenvectorsMatrix Operations and Properties

Secant Method

A derivative-free root-finding method that approximates Newton's method using two previous function evaluations, achieving superlinear convergence of order approximately 1.618.

Newton's Method

Augmented Lagrangian and ADMM

The augmented Lagrangian fixes the conditioning problem of pure penalties by adding dual feedback. ADMM uses that idea to split structured convex problems into alternating subproblems with primal and dual residual diagnostics.

Convex Optimization BasicsConvex DualityNonlinear Gauss-Seidel+1 more

Conjugate Gradient Methods

Conjugate gradient solves large symmetric positive definite linear systems using matrix-vector products, Krylov subspaces, A-conjugate directions, condition-number-dependent rates, and preconditioning.

Line Search MethodsMatrix Operations and PropertiesMatrix Norms+1 more

Coordinate Descent

Optimize by updating one coordinate (or block) at a time while holding others fixed. The default solver for Lasso because each coordinate update has a closed-form solution.

Convex Optimization BasicsMirror Descent and Frank-WolfeProximal Gradient Methods

Line Search Methods

Line search chooses a step size along a descent direction. Armijo guarantees enough decrease, Wolfe prevents uselessly tiny steps, and backtracking gives a practical step-size controller without knowing the Lipschitz constant.

Convex Optimization BasicsDifferentiation in RⁿNewton's Method

Projected Gradient Descent

Constrained convex optimization by alternating gradient steps with projections onto the feasible set. Same convergence rates as unconstrained gradient descent when projections are cheap.

Convex Optimization Basics

Trust Region Methods

Trust region methods minimize a local model only inside a region where that model is trusted, then grow or shrink the region using actual versus predicted decrease. They are robust Newton-style methods for difficult non-convex landscapes.

Newton's MethodThe Hessian MatrixLine Search Methods+2 more

Whitening and Decorrelation

Transform data to have identity covariance, removing correlations and normalizing scales. ZCA and PCA whitening, why whitening helps optimization, and connections to batch normalization.

Eigenvalues and EigenvectorsPrincipal Component AnalysisFloating-Point Arithmetic

Bayesian Optimization for Hyperparameters

Hyperparameter tuning as black-box optimization with a Gaussian process surrogate. Acquisition functions (EI, UCB, PI), sample efficiency for expensive evaluations, TPE as an alternative, and why grid search is wasteful.

Gaussian Process RegressionGaussian Processes for Machine Learning

Mirror Descent and Frank-Wolfe

Mirror descent generalizes gradient descent via Bregman divergences, recovering multiplicative weights and exponentiated gradient as special cases. Frank-Wolfe replaces projections with linear minimization, making it projection-free.

Convex Optimization BasicsConvex DualityOnline Convex Optimization+1 more

Online Convex Optimization

A general framework for sequential decision-making with convex losses: online gradient descent, follow the regularized leader, adaptive methods, and the square-root-T regret guarantee that unifies many algorithms.

Convex Optimization BasicsNo-Regret Learning

Second-Order Optimization Methods

Newton's method, Gauss-Newton, natural gradient, and K-FAC: how curvature information accelerates convergence, why the Hessian is too expensive to compute at scale, and Hessian-free alternatives that use Hessian-vector products.

Newton's MethodThe Hessian MatrixConjugate Gradient Methods+3 more

Winsorization

Clip extreme values to a fixed percentile instead of removing them. Preserves sample size, reduces outlier sensitivity, and improves stability of downstream estimators.

Order StatisticsCommon Probability Distributions

Tabu Search

Local search with memory: maintain a list of recently visited solutions to prevent cycling, use aspiration criteria to override the tabu when a move leads to a new best, and balance intensification against diversification.

Greedy AlgorithmsAscent Algorithms and Hill Climbing

Interior Point Methods

Barrier functions transform constrained optimization into unconstrained problems. Newton steps on the barrier objective trace the central path to the constrained optimum with polynomial convergence.

Convex Optimization BasicsNewton's MethodAugmented Lagrangian and ADMM+1 more

Nonlinear Gauss-Seidel

Block coordinate optimization with latest-value updates. Nonlinear Gauss-Seidel explains alternating minimization, EM-style sweeps, and why block coupling decides whether cycling, stalling, or convergence happens.

Coordinate DescentNewton's Method

Submodular Optimization

Submodular functions exhibit diminishing returns. The greedy algorithm achieves a (1-1/e) approximation for monotone submodular maximization under cardinality constraints, with applications in feature selection, sensor placement, and data summarization.

Greedy Algorithms

Sampling & MCMC

20 topics / L1, L2, L3 / Tiers 1-3

Gibbs, Metropolis-Hastings, HMC, importance sampling, coupling arguments, perfect sampling.

Gibbs Sampling

The classical conditional-update MCMC algorithm: sample each variable from its full conditional, accept every proposal, and then pay close attention to mixing, blocking, and hierarchical parameterization.

Metropolis-Hastings AlgorithmMarkov Chain Monte Carlo

Importance Sampling

Estimate expectations under one distribution by sampling from another and reweighting: a technique that is powerful when done right and catastrophically unreliable when done wrong.

Common Probability DistributionsMonte Carlo MethodsNumber Theory and Machine Learning+3 more

Markov Chain Monte Carlo

When you can evaluate a target up to a normalizing constant but cannot sample from it, build a Markov chain whose stationary distribution is the target. Detailed balance, ergodicity, mixing time, autocorrelation, and the variance penalty for non-iid samples.

Markov Chains and Steady StateMonte Carlo MethodsStochastic Processes for ML

Metropolis-Hastings Algorithm

The foundational MCMC algorithm: build a Markov chain with the right stationary distribution by accepting or rejecting proposals according to a carefully balanced ratio. The real story is not only correctness, but also proposal geometry, diagnostics, and where plain MH becomes painfully slow.

Common Probability DistributionsMarkov Chain Monte CarloMarkov Chains and Steady State+1 more

Monte Carlo Methods

Approximate expectations by sampling: the Monte Carlo estimator, its variance, the $\sqrt{N}$ convergence rate, and the variance-reduction tricks that make practical Bayesian inference, REINFORCE, and ELBO estimation work.

Expectation, Variance, Covariance, and MomentsLaw of Large NumbersCentral Limit Theorem

Rejection Sampling

The simplest exact sampling method: propose from an envelope distribution and accept or reject to produce exact independent draws from a target: but doomed to fail in high dimensions.

Monte Carlo Methods

Burn-in and Convergence Diagnostics

Burn-in is only the first filter. Modern MCMC trust comes from split rank-normalized R-hat, bulk and tail ESS, trace behavior, and sampler-specific warnings like divergences.

Metropolis-Hastings AlgorithmMarkov Chains and Steady StateCoupling Arguments and Mixing Time+7 more

Rao-Blackwellization

The Rao-Blackwell theorem: conditioning an estimator on a sufficient statistic reduces variance without increasing bias. In MCMC, this means replacing sample averages with conditional expectations for lower-variance estimates at no extra sampling cost.

Sufficient Statistics and Exponential FamiliesImportance Sampling

Variance Reduction Techniques

Get the same accuracy with fewer samples by exploiting correlation, known quantities, and stratification. Antithetic variates, control variates, stratification, and Rao-Blackwellization.

Importance Sampling

Hamiltonian Monte Carlo

HMC uses gradients and Hamiltonian dynamics to propose large moves that still land in high-density regions. The real practical story includes leapfrog geometry, warmup, divergences, and when centered hierarchical models break the sampler.

Metropolis-Hastings AlgorithmMarkov Chain Monte CarloGibbs Sampling+2 more

Langevin Dynamics

The overdamped Langevin SDE: gradient descent plus calibrated Gaussian noise. The mathematical object behind SGLD, ULA, and energy-based MCMC samplers, and the simplest sampler with provable polynomial-time convergence on log-concave targets.

Stochastic Differential EquationsFokker–Planck EquationHamiltonian Monte Carlo+4 more

No-U-Turn Sampler and Neal's Funnel

NUTS removes HMC's hand-tuned trajectory length, but Neal's funnel shows why geometry still dominates. Centered hierarchical models create narrow necks, divergences, and misleading convergence unless you reparameterize.

Hamiltonian Monte CarloBayesian EstimationBurn-in and Convergence Diagnostics+1 more

Adaptive Rejection Sampling

For log-concave densities, build a piecewise linear upper hull of log f(x) that tightens automatically with each evaluation. Sample from the envelope and accept/reject.

Rejection SamplingGriddy Gibbs SamplingSqueezed Rejection Sampling

Griddy Gibbs Sampling

When Gibbs sampling encounters a non-conjugate full conditional, approximate it on a grid and sample from the piecewise constant or piecewise linear approximation. Simple, effective for smooth univariate conditionals.

Slice Sampling

Slice sampling draws from a target distribution by uniformly sampling from the region under its density curve. It introduces an auxiliary variable to avoid tuning proposal distributions, unlike random-walk Metropolis-Hastings.

Metropolis-Hastings Algorithm

Squeezed Rejection Sampling

An optimization of rejection sampling that adds a cheap lower bound (squeeze function) to avoid expensive target density evaluations when the sample clearly falls in the accept or reject region.

Rejection Sampling

Coupling Arguments and Mixing Time

Coupling constructs two Markov chains on the same probability space so they eventually meet, bounding total variation distance and mixing time. Spectral gap and coupling inequality are the main tools for proving how fast MCMC converges to stationarity.

Metropolis-Hastings AlgorithmMartingale TheoryTotal Variation Distance

MCMC for Markov Random Fields

Gibbs sampling on undirected graphical models. The joint distribution factorizes over cliques, each variable is resampled from its Markov blanket, and the real practical story is local conditional updates versus long-range dependence, critical slowing, and exact-sampling alternatives.

Gibbs SamplingPerfect Sampling

Perfect Sampling

Coupling from the past (Propp-Wilson): run Markov chains backward in time until all starting states coalesce. The result is an exact sample from the stationary distribution with no burn-in approximation, but often at a serious computational price.

Metropolis-Hastings AlgorithmGibbs Sampling

Reversible Jump MCMC

MCMC for model selection: propose moves that change the number of parameters, maintain detailed balance across dimensions via Jacobian corrections, and sample over model space and parameter space simultaneously.

Metropolis-Hastings Algorithm

Decision Theory & Game Theory

2 topics / L2 / Tier 1

Expected utility, Nash equilibrium, minimax, auction theory, mechanism design, Arrow impossibility.

Bounded Rationality

Real agents optimize under limited information, limited compute, and limited foresight. Simon's satisficing, heuristics, and the implications for search, planning, and agent design in ML.

Decision Theory FoundationsConvex Optimization BasicsConvex Tinkering+2 more

Game Theory Foundations

Strategic interaction between rational agents. Normal-form games, dominant strategies, Nash equilibrium existence, mixed strategies, and connections to minimax, mechanism design, and multi-agent RL.

Common Probability DistributionsConvex Optimization BasicsArrow's Impossibility Theorem+3 more

Algorithms & Computation

10 topics / L0A, L1, L2, L3 / Tiers 1-3

Sorting, graph algorithms, dynamic programming, FFT, matrix multiplication, P vs NP.

Dynamic Programming

Solve complex optimization problems by decomposing them into overlapping subproblems with optimal substructure. The algorithmic backbone of sequence models, control theory, and reinforcement learning.

Sets, Functions, and RelationsGraph Algorithms EssentialsGreedy Algorithms

Information Retrieval Foundations

Search as a first-class capability. TF-IDF, BM25, inverted indexes, precision/recall, reranking, and why retrieval is not just vector DB plus embeddings.

Common Probability DistributionsBasic Logic and Proof Techniquesfuzzy matching and record linkage

Graph Algorithms Essentials

The graph algorithms every ML practitioner needs: BFS, DFS, Dijkstra, MST, and topological sort. Why they matter for computational graphs, knowledge graphs, dependency resolution, and GNNs.

Sets, Functions, and Relations

Greedy Algorithms

The greedy paradigm: make the locally optimal choice at each step and never look back. When matroid structure or the exchange argument guarantees global optimality.

Knapsack Problem

The canonical constrained optimization problem: 0/1 knapsack (NP-hard, pseudo-polynomial DP), fractional knapsack (greedy), FPTAS, and connections to Lagrangian relaxation in ML.

Dynamic ProgrammingGreedy Algorithms

Sorting Algorithms

Comparison-based sorting lower bound, quicksort, mergesort, heapsort, and non-comparison sorts. The foundational algorithms behind efficient data processing and search.

Fast Fourier Transform

The Cooley-Tukey FFT reduces the discrete Fourier transform from O(n²) to O(n log n), enabling efficient convolution, spectral methods, and Fourier features for kernel approximation.

Exponential Function PropertiesComplex Numbers for Fourier

Matrix Multiplication Algorithms

From naive cubic algorithms to Strassen's subcubic breakthrough to the open question of the true matrix multiplication exponent. What we know, what we do not, and why it matters for ML.

Vectors, Matrices, and Linear Maps

Unsolved Problems in Computer Science

P vs NP, the matrix multiplication exponent, one-way functions, the unique games conjecture, BPP vs P, natural proofs, and circuit lower bounds. The problems that define the limits of computation.

p vs npOpen Problems in Matrix Computation

Open Problems in Matrix Computation

The unsolved questions in numerical linear algebra: the true exponent of matrix multiplication, practical fast algorithms, sparse matrix multiplication, randomized methods, and why these matter for scaling ML.

Matrix Multiplication AlgorithmsEigenvalues and Eigenvectors

Model Timeline & Labs

12 topics / L4, L5 / Tiers 1-3

GPT, Claude, Gemini, Llama, DeepSeek. Historical lineage and the 2020s lab landscape.

DeepSeek Models

DeepSeek's model family: MoE architectures with Multi-head Latent Attention, fine-grained expert routing, RL-trained reasoning in DeepSeek-R1, the V3.1/V3.2 hybrid reasoning line, and the V4 Preview 1M-context release.

Transformer ArchitectureMixture of Experts

Cohere Models

The Cohere model family: Command R and R+ for retrieval-augmented enterprise generation, Embed and Rerank for two-stage retrieval, and the open-weight Aya multilingual line from Cohere For AI. Cohere announced an all-stock merger with German lab Aleph Alpha on April 24, 2026, taking the Schwarz Group as lead investor and positioning the combined company as a sovereign-AI alternative to dominant US providers.

Transformer ArchitectureTokenization and Information TheoryInformation Retrieval Foundations

Ineffable Intelligence

British AI lab founded by David Silver (UCL professor, ex-DeepMind reinforcement-learning lead, AlphaGo / AlphaZero / AlphaProof). Announced a $1.1 billion seed round at $5.1 billion valuation on April 27, 2026 — co-led by Sequoia and Lightspeed, with Nvidia, Google, DST Global, Index Ventures, and the UK Sovereign AI Fund participating. Stated mission: build a 'superlearner' that acquires knowledge from its own experience rather than from human-generated data, instantiating the 'Era of Experience' research agenda.

Reinforcement Learning from Human FeedbackAI Labs Landscape

Mistral Models

The Mistral AI model family: Mistral 7B with sliding-window attention, Mixtral sparse mixture-of-experts releases, the Mistral Large/Small/Nemo line, and specialist Codestral, Devstral, Pixtral, Voxtral, and Ministral variants.

Transformer ArchitectureMixture of ExpertsAttention Mechanism Theory+2 more

AI Labs Landscape

Factual reference on the major AI research labs and companies: what they build, key technical contributions, and research focus areas. Current landscape as of July 2026.

Model Timelinekey researchers and ideas

Claude Model Family

Anthropic's Claude series from Claude 1 through Opus 4.8, Sonnet 5, Haiku 4.5, Fable 5, and the restricted Mythos 5 line: Constitutional AI, extended thinking, computer use, long context, tiering, and safety governance.

Transformer Architecture

Gemini and Google Models

Google's model lineage from PaLM through the Gemini 3 generation: native multimodality, long context, TPU infrastructure, Gemini API preview churn, and the Gemma open-weight branch through Gemma 3n.

Transformer Architecture

GPT Series Evolution

The progression from GPT-1 (117M) through GPT-4, the o-series, GPT-5.6, GPT-Live, gpt-oss, and GPT-Rosalind: what changed through scale, post-training, reasoning-time compute, tool use, and domain specialization.

Transformer ArchitectureAttention Mechanism TheoryScaling Laws+5 more

LLaMA and Open Weight Models

The open weight movement in large language models: LLaMA 1/2/3, the ecosystem of fine-tuning and quantization tools, and why open weights changed the dynamics of AI research.

Transformer ArchitectureToken Prediction and Language ModelingScaling Laws+7 more

Model Comparison Table

Structured comparison of major LLM families as of July 2026: architecture, parameters, context length, open weights, and key strengths, with discussion of what comparison tables cannot tell you.

Transformer ArchitectureClaude Model FamilyDeepSeek Models+3 more

Model Timeline

A structured factual timeline of major language and multimodal models from GPT-2 through GPT-5.6, GPT-Live, Claude Fable 5, and the current frontier, with parameter counts, key innovations, and the ideas that defined each era.

History of Artificial Intelligencekey researchers and ideas

Qwen and Chinese Models

The Chinese open-weight model ecosystem: Qwen (Alibaba), Yi (01.AI), Baichuan, GLM (Zhipu AI), and Kimi (Moonshot AI), with a focus on multilingual capability and independent scaling.

Transformer Architecture

Connections & Applications

9 topics / L2, L3, L4 / Tiers 1-3

Cross-domain topics: SLAM, cryptography, signal detection, Kalman filters. Not core curriculum, but worth knowing.

Kalman Filter

Optimal state estimation for linear Gaussian systems via recursive prediction and update steps using the Kalman gain.

Common Probability DistributionsEigenvalues and Eigenvectors

Bayesian State Estimation

The filtering problem: recursively estimate a hidden state from noisy observations using predict-update cycles. Kalman filter for linear Gaussian systems, particle filters for the general case.

Bayesian EstimationCommon Probability DistributionsGaussian Processes in Astronomy+2 more

State Space Models

Linear state space form, the Kalman filter and RTS smoother, EM for parameter learning, and the ARIMA equivalence. The unifying framework behind classical filtering and modern Mamba/S4.

Time Series FoundationsMarkov Chains and Steady StateKalman Filter+1 more

Time Series Foundations

Rigorous treatment of stationarity, the Wold decomposition, autocorrelation, unit roots, AR/MA/ARMA/ARIMA models, and spectral representation. The classical theory that every modern sequence model rests on.

Kolmogorov Probability AxiomsExpectation, Variance, Covariance, and MomentsStochastic Processes for ML

GraphSLAM and Factor Graphs

SLAM as graph optimization: poses as nodes, constraints as edges, factor graph representation, MAP estimation via nonlinear least squares, and the sparsity structure that makes large-scale mapping tractable.

Particle Filters

Sequential Monte Carlo: represent the posterior over hidden states as a set of weighted particles, propagate through dynamics, reweight by likelihood, and resample to combat degeneracy.

Metropolis-Hastings AlgorithmImportance SamplingGraphSLAM and Factor Graphs+2 more

Active SLAM and POMDPs

Choosing robot actions to simultaneously map an environment and localize, formulated as a partially observable Markov decision process over belief states.

GraphSLAM and Factor GraphsMarkov Decision ProcessesVisual and Semantic SLAM

Number Theory and Machine Learning

The emerging two-way street between number theory and machine learning: how number-theoretic tools improve ML systems, and how ML is discovering new mathematical structure in classical problems.

Common Probability DistributionsLaw of Large NumbersDifferential Privacy+1 more

Visual and Semantic SLAM

Replacing laser range finders with cameras for SLAM, and enriching maps with semantic labels to improve data association and planning.

GraphSLAM and Factor Graphs

Other topics

Additional topics not yet assigned to a specific module.

L4Tier 2applied-ml

Causal Inference for Policy Evaluation

Quasi-experimental methods for recovering policy effects without randomization. Difference-in-differences identifies the average treatment effect on the treated under parallel trends; regression discontinuity identifies a local average treatment effect under continuity at the cutoff; instrumental variables identifies a local average treatment effect for compliers under monotonicity (Imbens-Angrist 1994). Synthetic control and double/debiased ML extend these designs to single-unit and high-dimensional settings.

Causal Inference BasicsCausal Inference and the Ladder of CausationHypothesis Testing for ML

L2Tier 1applied-statistics

Non-Probability Sampling

Convenience and opt-in samples do not give probability-of-inclusion guarantees. The data-defect identity (Meng 2018) shows why a massive convenience sample can produce a confidently wrong answer. Repair methods: calibration, sampling-score weighting, mass imputation, doubly robust integration with a probability sample, and sensitivity analysis.

Expectation, Variance, Covariance, and MomentsLaw of Large NumbersCentral Limit Theorem+1 more

L3Tier 1causal-semiparametric

Double/Debiased Machine Learning

A general recipe for plugging flexible ML estimators into causal and structural estimands while recovering root-n rate and asymptotic normality. Cross-fitting plus Neyman-orthogonal moments converts slow nuisance rates into honest confidence intervals for a low-dimensional parameter of interest.

Asymptotic Statistics: M-Estimators, Delta Method, LANMaximum Likelihood Estimation: Theory, Information Identity, and Asymptotic EfficiencyCross-Validation Theory+3 more

L4Tier 1formal-verification

AlphaProof and AI-Assisted Theorem Proving

AlphaProof and AlphaGeometry 2 reached a 28/42 silver-medal-equivalent score at IMO 2024. AlphaProof solved three non-geometry problems in Lean; AlphaGeometry 2 solved the geometry problem. This page explains autoformalization, reinforcement-learning proof search, and the Lean-kernel trust boundary.

theorem proving in leanIneffable Intelligence

L0BTier 2infrastructure

WebGPU for Machine Learning

WebGPU gives the browser an explicit GPU compute model: devices, queues, buffers, and WGSL kernels. That is the missing substrate for serious in-browser inference, custom kernels, and eventually browser-native training systems.

Computer Architecture for MLAutomatic Differentiation

L2Tier 1learning-theory

Kernel Density Estimation

Smooth a sample into a density by placing a small bump on each point. The Rosenblatt-Parzen estimator, its bias-variance decomposition, the optimal bandwidth scaling, why MISE converges at rate n^{-4/5} in one dimension, and why the curse of dimensionality wrecks it in high dimensions.

Common Probability DistributionsExpectation, Variance, Covariance, and MomentsBias-Variance Tradeoff+2 more

L2Tier 1learning-theory

Local Polynomial Regression

Replace the local-constant fit of Nadaraya-Watson with a local-degree-p polynomial fit. Same n^{-4/5} rate as NW, no boundary bias, automatic design-density correction, and the bias-variance asymmetry between odd and even degrees that makes p=1 the practical default.

Nadaraya-Watson Kernel RegressionLinear RegressionBias-Variance Tradeoff+1 more

L2Tier 1learning-theory

Nadaraya-Watson Kernel Regression

Estimate a conditional mean by a weighted average of nearby labels, with weights given by a kernel. The Nadaraya-Watson estimator, its bias-variance decomposition, optimal bandwidth scaling, boundary bias, and why local polynomial regression is the practical upgrade.

Expectation, Variance, Covariance, and MomentsBias-Variance TradeoffKernel Density Estimation+2 more

L3Tier 2learning-theory

Adaptive Learning Is Not IID

Why diagnostic systems that choose the next question from previous answers need sequential-probability assumptions, not iid sampling assumptions.

Random VariablesRadon-Nikodym and Conditional ExpectationMartingale Theory+1 more

L3Tier 2optimization

SGD as a Stochastic Differential Equation

The continuous-time SDE limit of mini-batch SGD. Order-1 weak approximation (Li-Tai-E), Mandt-Hoffman-Blei stationary distribution, Bayesian interpretation, the linear scaling rule for batch size, and the modified-equation correction that exposes SGD's implicit gradient-norm regularizer.

Stochastic Differential EquationsStochastic Gradient Descent ConvergenceFokker–Planck Equation+1 more

L2Tier 1predictive-uncertainty

Split Conformal Prediction

A distribution-free, model-agnostic procedure that converts any point predictor into a prediction set with finite-sample marginal coverage. The only assumption is exchangeability. The proof is five lines.

Order StatisticsHypothesis Testing for MLCross-Validation Theory+1 more

L3Tier 1predictive-uncertainty

Weighted Conformal Prediction Under Covariate Shift

The extension of split conformal prediction that restores finite-sample coverage when the test distribution differs from the training distribution, at the cost of knowing or estimating the likelihood ratio between them.

Split Conformal PredictionRadon-Nikodym and Conditional ExpectationImportance Sampling+1 more

L1Tier 1scientific-ml

Classical ODEs: Existence, Stability, and Numerical Methods

The foundational theorems that justify treating an ODE as having a solution at all: Picard-Lindelof existence and uniqueness, Gronwall's inequality, linear-system theory, Lyapunov stability, and the numerical methods (Euler, Runge-Kutta, adaptive step) used in modern scientific ML.

Continuity in RⁿThe Jacobian Matrix

L3Tier 2scientific-ml

Adjoint Sensitivity Method

Compute gradients through an ODE solver by integrating a backward adjoint ODE, trading O(NT) activation memory for O(1) memory at the cost of a second integration.

Neural ODEs and Continuous-Depth NetworksClassical ODEs: Existence, Stability, and Numerical MethodsAutomatic Differentiation

L3Tier 3scientific-ml

Continuous Normalizing Flows

Generative models that replace the stack of invertible layers in a normalizing flow with a learned ODE, trading the per-layer Jacobian determinant for an O(d) trace via Hutchinson's estimator.

Neural ODEs and Continuous-Depth NetworksNormalizing FlowsThe Jacobian Matrix+1 more

L4Tier 3scientific-ml

Neural ODEs and Continuous-Depth Networks

Treating neural network depth as a continuous variable: the ODE formulation of residual networks, the adjoint method for memory-efficient backpropagation, the duality with PINNs, the SDE bridge to diffusion models, and the open research frontier.

Classical ODEs: Existence, Stability, and Numerical MethodsSkip Connections and ResNetsGradient Flow and Vanishing Gradients+5 more

L4Tier 3scientific-ml

Neural SDEs and the Diffusion Bridge

The stochastic generalization of neural ODEs: parameterizing the drift and diffusion of an SDE with neural networks, the adjoint method extended through Brownian motion, the explicit bridge to diffusion models via the probability flow ODE, and generative neural SDEs as infinite-dimensional GANs.

Neural ODEs and Continuous-Depth NetworksStochastic Calculus for MLAdjoint Sensitivity Method+2 more

L3Tier 1sequential-inference

Anytime-Valid Inference

A framework where statistical guarantees hold simultaneously at every stopping time, not just at a pre-specified sample size. Built on e-processes and Ville's inequality. The decision rule and the stopping rule can both depend on data without inflating Type I error. The technical setting behind continuous A/B-test monitoring, adaptive clinical trials, and rolling LLM evaluations.

e-valuese-processesMartingale Theory+3 more

L2Tier 1sequential-inference

Confidence Sequences

A sequence of intervals on a parameter that contain the true value uniformly over time. Built by inverting an e-process: the interval is the set of parameter values for which the e-process never crosses the rejection threshold. The result is a live, monotonically narrowing interval valid at every sample size, with no need to pre-specify the stopping rule.

e-valuese-processesAnytime-Valid Inference+3 more

L3Tier 1sequential-inference

e-processes

The sequential version of an e-value: a nonnegative process whose value at every stopping time is an e-value for the null. Constructed as the running product of conditional e-values adapted to a filtration; equivalently, a nonnegative supermartingale under the null. The object that powers anytime-valid inference, sequential testing, and confidence sequences.

e-valuesMartingale TheoryMeasure-Theoretic Probability+3 more

L3Tier 1sequential-inference

E-Values and Anytime-Valid Inference

A framework for hypothesis testing whose validity survives optional stopping, optional continuation, and data-dependent peeking. E-values are nonnegative random variables with expectation at most one under the null; Ville's inequality makes the guarantee anytime-valid.

Measure-Theoretic ProbabilityMartingale TheoryHypothesis Testing for ML+4 more

L2Tier 1sequential-inference

e-values

A nonnegative random variable whose expectation under the null is at most one. Reciprocals of e-values behave like p-values via Markov's inequality, with the structural advantage that products of conditional e-values remain valid evidence under filtration. E-values were developed to replace p-values where optional stopping or selective combination is unavoidable.

p-valuesHypothesis Testing for MLLikelihood-Ratio, Wald, and Score Tests+3 more

L2Tier 1sequential-inference

p-values

The classical evidence statistic against a null hypothesis: probability under the null of observing a test statistic at least as extreme as the one seen. Valid for a single test at a pre-specified sample size; breaks under optional stopping, repeated peeking, and selective reporting. The p-value is what e-values and confidence sequences are designed to replace or augment in sequential settings.

Hypothesis Testing for MLNeyman-Pearson and Hypothesis Testing TheoryRandom Variables+2 more

L3Tier 1sequential-inference

Safe Testing

A formal framework for hypothesis testing where every test statistic is an e-value and every sequential procedure is an e-process. Safe tests survive optional stopping, optional continuation, and selective combination by construction. The Grünwald-de Heide-Koolen 2024 formalization replaces Neyman-Pearson tests with admissible tests built on reverse information projection.

e-valuese-processesHypothesis Testing for ML+3 more

L1Tier 1statistics

Analysis of Variance

One-way ANOVA decomposes the total sum of squares into a between-group component and a within-group component. Under iid normal data with equal group variances, the ratio of mean squares has an F distribution and gives an exact test of the equal-means null hypothesis. The Welch correction handles unequal variances. Two-way ANOVA partitions further into main effects and interaction. Post-hoc procedures (Tukey HSD, Bonferroni, Scheffe) correct for the multiple-comparison problem that naive pairwise t-tests ignore.

Central Limit TheoremThe Multivariate Normal DistributionLinear Regression+2 more

L1Tier 1statistics

Delta Method

Asymptotic distribution of a smooth function of an estimator. If sqrt(n)(T_n - mu) converges to N(0, sigma^2), then sqrt(n)(g(T_n) - g(mu)) converges to N(0, [g'(mu)]^2 sigma^2). The multivariate version uses the Jacobian; the second-order version handles vanishing derivatives. The page derives the result, works three canonical examples (variance of a log proportion, variance of a ratio of means, asymptotic variance of the sample correlation), and ties the construction to variance-stabilizing transformations.

Central Limit TheoremModes of Convergence of Random VariablesExpectation, Variance, Covariance, and Moments+1 more

L1Tier 1statistics

Variance-Stabilizing Transformations

Many distributions have variance that depends on the mean: Poisson variance equals the mean, binomial-proportion variance equals p(1-p)/n. The delta method gives Var(g(X)) ≈ [g'(μ)]^2 σ^2(μ)/n, so picking g to satisfy g'(μ) σ(μ) = constant makes the asymptotic variance independent of μ. Solving this ODE produces the canonical transformations: 2√X for Poisson, arcsin(√p̂) for binomial proportions, log for multiplicative scale data, and the Fisher z-transform for the sample correlation. Anscombe's small-count corrections and the Box-Cox family complete the toolkit.

Delta MethodCentral Limit TheoremCommon Probability Distributions+1 more

L1Tier 2statistics

Tweedie Distribution

The Tweedie distribution is the one-parameter subfamily of the exponential dispersion model (EDM) family characterized by a power variance function V(mu) = mu^p. Special cases recover the Normal (p=0), Poisson (p=1), Gamma (p=2), and Inverse Gaussian (p=3). The intermediate range 1<p<2 produces a compound Poisson-Gamma distribution with a point mass at zero and a continuous positive part, which is the canonical model for insurance loss severity. The page covers the EDM construction, the four special-case identifications, and the compound-Poisson-Gamma representation; the applied actuarial treatment lives on ActuaryPath.

Common Probability DistributionsSufficient Statistics and Exponential FamiliesMaximum Likelihood Estimation: Theory, Information Identity, and Asymptotic Efficiency