# TheoremPath — full index

> Complete listing of all 491 published topics and 53 comparison pages on TheoremPath. Grouped by layer. Intended for LLM ingestion and research use with attribution.

Maintained by Robby Sneiderman (https://github.com/Robby955). Canonical URL: https://theorempath.com

## How to cite

Link to the specific topic page. Example:

> TheoremPath. "Hoeffding's Inequality." https://theorempath.com/topics/hoeffdings-inequality

Topic-page structure: each page states assumptions, proof sketch, failure modes, worked examples, and references to canonical textbooks. See the page itself for the authoritative content; summaries here are one-sentence descriptions.

## Layer 0A — Axioms

- [Common Inequalities](https://theorempath.com/topics/common-inequalities) (tier 1, core): The algebraic and probabilistic inequalities that appear everywhere in ML theory: Cauchy-Schwarz, Jensen, AM-GM, Holder, Minkowski, Young, Markov, and Chebyshev.
- [Common Probability Distributions](https://theorempath.com/topics/common-probability-distributions) (tier 1, core): The essential probability distributions for ML: Bernoulli through Dirichlet, with PMFs/PDFs, moments, relationships, and when each arises in practice.
- [Compactness and Heine-Borel](https://theorempath.com/topics/compactness-and-heine-borel) (tier 1, core): Sequential compactness, the Heine-Borel theorem in finite dimensions, the extreme value theorem, and why compactness is the key assumption in optimization.
- [Computability Theory](https://theorempath.com/topics/computability-theory) (tier 1, core): What can be computed? Turing machines, decidability, the Church-Turing thesis, recursive and recursively enumerable sets, reductions, Rice's theorem, and connections to learning theory.
- [Continuity in R^n](https://theorempath.com/topics/continuity-in-rn) (tier 1, core): Epsilon-delta continuity, uniform continuity, and Lipschitz continuity in Euclidean space. Lipschitz constants control how fast function values change and appear throughout optimization and generalization theory.
- [Differentiation in Rn](https://theorempath.com/topics/differentiation-in-rn) (tier 1, core): Partial derivatives, the gradient, directional derivatives, the total derivative (Frechet), and the multivariable chain rule. Why the gradient points in the steepest ascent direction, and why this matters for all of optimization.
- [Dynamic Programming](https://theorempath.com/topics/dynamic-programming) (tier 1, core): Solve complex optimization problems by decomposing them into overlapping subproblems with optimal substructure. The algorithmic backbone of sequence models, control theory, and reinforcement learning.
- [Eigenvalues and Eigenvectors](https://theorempath.com/topics/eigenvalues-and-eigenvectors) (tier 1, core): Eigenvalues and eigenvectors: the directions a matrix scales without rotating. Characteristic polynomial, diagonalization, the spectral theorem for symmetric matrices, and the direct connection to PCA.
- [Expectation, Variance, Covariance, and Moments](https://theorempath.com/topics/expectation-variance-covariance-moments) (tier 1, core): Expectation, variance, covariance, correlation, linearity of expectation, variance of sums, and moment-based reasoning in ML.
- [Exponential Function Properties](https://theorempath.com/topics/exponential-function-properties) (tier 1, core): The exponential function e^x: series definition, algebraic properties, and why it appears everywhere in ML. Softmax, MGFs, the Chernoff method, Boltzmann distributions, and exponential families all reduce to properties of exp.
- [Floating-Point Arithmetic](https://theorempath.com/topics/floating-point-arithmetic) (tier 1, core): How computers represent real numbers, why they get it wrong, and why ML uses float32, float16, bfloat16, and int8. IEEE 754, machine epsilon, overflow, underflow, and catastrophic cancellation.
- [Inner Product Spaces and Orthogonality](https://theorempath.com/topics/inner-product-spaces-and-orthogonality) (tier 1, core): Inner product axioms, Cauchy-Schwarz inequality, orthogonality, Gram-Schmidt, projections, and the bridge to Hilbert spaces.
- [Joint, Marginal, and Conditional Distributions](https://theorempath.com/topics/joint-marginal-conditional-distributions) (tier 1, core): Joint distributions, marginalization, conditional distributions, independence, Bayes theorem, and the chain rule of probability.
- [Matrix Norms](https://theorempath.com/topics/matrix-norms) (tier 1, core): Frobenius, spectral, and nuclear norms for matrices. Submultiplicativity. When and why each norm appears in ML theory.
- [Matrix Operations and Properties](https://theorempath.com/topics/matrix-operations-and-properties) (tier 1, core): Essential matrix operations for ML: trace, determinant, inverse, pseudoinverse, Schur complement, and the Sherman-Morrison-Woodbury formula. When and why each matters.
- [Metric Spaces, Convergence, and Completeness](https://theorempath.com/topics/metric-spaces-convergence-completeness) (tier 1, core): Metric space axioms, convergence of sequences, Cauchy sequences, completeness, and the Banach fixed-point theorem.
- [Positive Semidefinite Matrices](https://theorempath.com/topics/positive-semidefinite-matrices) (tier 1, core): PSD matrices: equivalent characterizations, Cholesky decomposition, Schur complement, and Loewner ordering. Covariance matrices are PSD. Hessians of convex functions are PSD. These facts connect linear algebra to optimization and statistics.
- [Sets, Functions, and Relations](https://theorempath.com/topics/sets-functions-and-relations) (tier 1, core): The language underneath all of mathematics: sets, Cartesian products, functions, injectivity, surjectivity, equivalence relations, and quotient sets.
- [Singular Value Decomposition](https://theorempath.com/topics/singular-value-decomposition) (tier 1, core): The SVD A = U Sigma V^T: the most important matrix factorization in applied mathematics. Geometric interpretation, relationship to eigendecomposition, low-rank approximation via Eckart-Young, and applications from PCA to pseudoinverses.
- [Taylor Expansion](https://theorempath.com/topics/taylor-expansion) (tier 1, core): Taylor approximation in one and many variables. Every optimization algorithm is a Taylor approximation: gradient descent uses first order, Newton's method uses second order.
- [Tensors and Tensor Operations](https://theorempath.com/topics/tensors-and-tensor-operations) (tier 1, core): What a tensor actually is: a multilinear map with specific transformation rules, how tensor contraction generalizes matrix multiplication, Einstein summation, tensor decompositions, and how ML frameworks use the word tensor to mean multidimensional array.
- [The Hessian Matrix](https://theorempath.com/topics/the-hessian-matrix) (tier 1, core): The matrix of second partial derivatives: encodes curvature, determines the nature of critical points, and is the central object in second-order optimization.
- [The Jacobian Matrix](https://theorempath.com/topics/the-jacobian-matrix) (tier 1, core): The matrix of all first partial derivatives of a vector-valued function: encodes the best linear approximation, connects to the chain rule in matrix form, and is the backbone of backpropagation.
- [Vectors, Matrices, and Linear Maps](https://theorempath.com/topics/vectors-matrices-and-linear-maps) (tier 1, core): Vector spaces, linear maps, matrix representation, rank, nullity, and the rank-nullity theorem. The algebraic backbone of ML.
- [Basic Logic and Proof Techniques](https://theorempath.com/topics/basic-logic-and-proof-techniques) (tier 2, core): The fundamental proof strategies used throughout mathematics: direct proof, contradiction, contrapositive, induction, construction, and cases. Required vocabulary for reading any theorem.
- [Birthday Paradox](https://theorempath.com/topics/birthday-paradox) (tier 2, core): In a group of 23 people, the probability that two share a birthday exceeds 50%. Pairwise collision counting explains why this threshold is so low.
- [Cantor's Theorem and Uncountability](https://theorempath.com/topics/cantors-theorem-and-uncountability) (tier 2, core): Cantor's diagonal argument proves the reals are uncountable. The power set of any set has strictly greater cardinality. These results are the origin of the distinction between countable and uncountable infinity.
- [Cardinality and Countability](https://theorempath.com/topics/cardinality-and-countability) (tier 2, core): Two sets have the same cardinality when a bijection exists between them. The naturals, integers, and rationals are countable. The reals are uncountable, proved by Cantor's diagonal argument.
- [Category Theory](https://theorempath.com/topics/category-theory) (tier 2, advanced): Categories, functors, natural transformations, universal properties, adjunctions, and the Yoneda lemma. The language of abstract structure that unifies algebra, topology, logic, and increasingly appears in ML theory.
- [Counting and Combinatorics](https://theorempath.com/topics/counting-and-combinatorics) (tier 2, core): Counting principles, binomial and multinomial coefficients, inclusion-exclusion, and Stirling's approximation. These tools appear whenever you count hypotheses, bound shattering coefficients, or analyze combinatorial arguments in learning theory.
- [Godel's Incompleteness Theorems](https://theorempath.com/topics/godels-incompleteness-theorems) (tier 2, core): Godel's first incompleteness theorem: any consistent formal system containing basic arithmetic has true but unprovable statements. The second: such a system cannot prove its own consistency. These are hard limits on what formal reasoning can achieve.
- [Graph Algorithms Essentials](https://theorempath.com/topics/graph-algorithms-essentials) (tier 2, core): The graph algorithms every ML practitioner needs: BFS, DFS, Dijkstra, MST, and topological sort. Why they matter for computational graphs, knowledge graphs, dependency resolution, and GNNs.
- [Greedy Algorithms](https://theorempath.com/topics/greedy-algorithms) (tier 2, core): The greedy paradigm: make the locally optimal choice at each step and never look back. When matroid structure or the exchange argument guarantees global optimality.
- [Integration and Change of Variables](https://theorempath.com/topics/integration-and-change-of-variables) (tier 2, core): Riemann integration, improper integrals, the substitution rule, multivariate change of variables via the Jacobian determinant, and Fubini theorem. The computational backbone of probability and ML.
- [Inverse and Implicit Function Theorem](https://theorempath.com/topics/inverse-and-implicit-function-theorem) (tier 2, core): The inverse function theorem guarantees local invertibility when the Jacobian is nonsingular. The implicit function theorem guarantees that constraint surfaces are locally graphs. Both are essential for constrained optimization and implicit layers.
- [Knapsack Problem](https://theorempath.com/topics/knapsack-problem) (tier 2, core): The canonical constrained optimization problem: 0/1 knapsack (NP-hard, pseudo-polynomial DP), fractional knapsack (greedy), FPTAS, and connections to Lagrangian relaxation in ML.
- [Lambda Calculus](https://theorempath.com/topics/lambda-calculus) (tier 2, core): Lambda calculus is the simplest model of computation: just variables, abstraction, and application. It is equivalent to Turing machines in power, and it is the foundation of functional programming, type theory, and the Curry-Howard correspondence.
- [Moment Generating Functions](https://theorempath.com/topics/moment-generating-functions) (tier 2, core): The moment generating function M(t) = E[e^{tX}] encodes all moments of a distribution. The Chernoff method, sub-Gaussian bounds, and exponential family theory all reduce to MGF conditions.
- [Monty Hall Problem](https://theorempath.com/topics/monty-hall-problem) (tier 2, core): Three doors, one car, two goats. You pick a door, the host reveals a goat behind another. Switching wins 2/3 of the time. Bayes theorem makes this precise.
- [Peano Axioms](https://theorempath.com/topics/peano-axioms) (tier 2, core): The five axioms that define the natural numbers: zero exists, every number has a successor, successors are injective, zero is not a successor, and induction. All of arithmetic follows from these.
- [Sequences and Series of Functions](https://theorempath.com/topics/sequences-and-series-of-functions) (tier 2, core): Pointwise vs uniform convergence of function sequences, the Weierstrass M-test, and why uniform convergence preserves continuity. The concept that makes learning theory work.
- [Sorting Algorithms](https://theorempath.com/topics/sorting-algorithms) (tier 2, core): Comparison-based sorting lower bound, quicksort, mergesort, heapsort, and non-comparison sorts. The foundational algorithms behind efficient data processing and search.
- [Type Theory](https://theorempath.com/topics/type-theory) (tier 2, advanced): Types as propositions, terms as proofs. Simply typed lambda calculus, the Curry-Howard correspondence, dependent types, and connections to programming language foundations and formal verification.
- [Zermelo-Fraenkel Set Theory](https://theorempath.com/topics/zermelo-fraenkel-set-theory) (tier 2, core): The ZFC axioms form the standard foundation for mathematics. Extensionality, pairing, union, power set, infinity, separation, replacement, choice, and foundation prevent paradoxes while being expressive enough for all of modern mathematics.
- [Formal Languages and Automata](https://theorempath.com/topics/formal-languages) (tier 3, core): Regular languages, context-free grammars, pushdown automata, the Chomsky hierarchy, pumping lemmas, and connections to parsing, neural sequence models, and computational complexity.
- [Foundational Dependencies](https://theorempath.com/topics/foundational-dependencies) (tier 3, advanced): Which axiomatic systems does each branch of TheoremPath depend on? A map from content to foundations.
- [P vs NP](https://theorempath.com/topics/p-vs-np) (tier 3, core): A central open problem in computer science: is every problem whose solution can be verified in polynomial time also solvable in polynomial time? Covers P, NP, NP-completeness, reductions, Cook-Levin theorem, and relevance for ML.
- [Vieta Jumping](https://theorempath.com/topics/vieta-jumping) (tier 3, advanced): A competition number theory technique: given a Diophantine equation in two variables, fix one variable, treat the equation as a quadratic in the other, and use Vieta's formulas to jump to a new integer solution. Repeated jumping produces a contradiction or forces a known base case.

## Layer 0B — Infrastructure

- [Central Limit Theorem](https://theorempath.com/topics/central-limit-theorem) (tier 1, core): The CLT: the sample mean is approximately Gaussian for large n, regardless of the original distribution. Berry-Esseen rate, multivariate CLT, and why CLT explains asymptotic normality of MLE, confidence intervals, and the ubiquity of the Gaussian.
- [Convex Duality](https://theorempath.com/topics/convex-duality) (tier 1, core): Fenchel conjugates, the Fenchel-Moreau theorem, weak and strong duality, KKT conditions, and why duality gives the kernel trick for SVMs, connects regularization to constraints, and enables adversarial formulations in DRO.
- [Cramér-Rao Bound](https://theorempath.com/topics/cramer-rao-bound) (tier 1, core): The fundamental lower bound on the variance of any unbiased estimator: no unbiased estimator can have variance smaller than the reciprocal of the Fisher information.
- [Deep Learning (Goodfellow, Bengio, Courville)](https://theorempath.com/topics/deep-learning-goodfellow-book) (tier 1, core): Reading guide for the Goodfellow, Bengio, Courville textbook (2016). What it covers, which chapters still matter in 2026, what has aged, and how to use it efficiently.
- [Editorial Principles](https://theorempath.com/topics/editorial-principles) (tier 1, core): How TheoremPath treats knowledge, uncertainty, fairness, and systems. Six intellectual lenses with scope conditions: Simon for bounded intelligence, Pearl for causality, Meadows for systems, Ostrom for governance, Rawls for fairness, Taleb for uncertainty.
- [Fisher Information](https://theorempath.com/topics/fisher-information) (tier 1, core): The Fisher information quantifies how much a sample tells you about an unknown parameter: it measures the curvature of the log-likelihood, sets the Cramér-Rao lower bound on estimator variance, and serves as a natural Riemannian metric on parameter space.
- [Law of Large Numbers](https://theorempath.com/topics/law-of-large-numbers) (tier 1, core): The weak and strong laws of large numbers: the sample mean converges to the population mean. Kolmogorov's conditions, the rate of convergence from the CLT, and why LLN justifies using empirical risk as a proxy for population risk.
- [Maximum Likelihood Estimation](https://theorempath.com/topics/maximum-likelihood-estimation) (tier 1, core): MLE: find the parameter that maximizes the likelihood of observed data. Consistency, asymptotic normality, Fisher information, Cramér-Rao efficiency, and when MLE fails.
- [Measure-Theoretic Probability](https://theorempath.com/topics/measure-theoretic-probability) (tier 1, core): The rigorous foundations of probability: sigma-algebras, measures, measurable functions as random variables, Lebesgue integration, and the convergence theorems that make modern probability and statistics possible.
- [Radon-Nikodym and Conditional Expectation](https://theorempath.com/topics/radon-nikodym-and-conditional-expectation) (tier 1, core): The Radon-Nikodym theorem: what 'density' really means. Absolute continuity, the Radon-Nikodym derivative, conditional expectation as a projection, tower property, and why this undergirds likelihood ratios, importance sampling, and KL divergence.
- [Shrinkage Estimation and the James-Stein Estimator](https://theorempath.com/topics/shrinkage-estimation-james-stein) (tier 1, core): In three or more dimensions, the sample mean is inadmissible for estimating a multivariate normal mean. The James-Stein estimator shrinks toward zero and dominates the MLE in total MSE, a result that shocked the statistics world.
- [The Elements of Statistical Learning (Hastie, Tibshirani, Friedman)](https://theorempath.com/topics/elements-of-statistical-learning-book) (tier 1, core): Reading guide for ESL (2009, 2nd edition). The standard graduate statistics/ML textbook. Covers linear methods, trees, boosting, SVMs, ensemble methods. What to read, what to skip, and where it excels.
- [Bayesian Estimation](https://theorempath.com/topics/bayesian-estimation) (tier 2, core): The Bayesian approach to parameter estimation: encode prior beliefs, update with data via Bayes rule, and obtain a full posterior distribution over parameters. Conjugate priors, MAP estimation, and the Bernstein-von Mises theorem showing that the posterior concentrates around the true parameter.
- [Functional Analysis Core](https://theorempath.com/topics/functional-analysis-core) (tier 2, advanced): The four pillars of functional analysis: Hahn-Banach (extending functionals), Uniform Boundedness (pointwise bounded implies uniformly bounded), Open Mapping (surjective bounded operators have open images), and Banach-Alaoglu (dual unit ball is weak-* compact). These underpin RKHS theory, optimization in function spaces, and duality.
- [Information Theory Foundations](https://theorempath.com/topics/information-theory-foundations) (tier 2, core): The core of information theory for ML: entropy, cross-entropy, KL divergence, mutual information, data processing inequality, and the chain rules that connect them. The language of variational inference, generalization bounds, and representation learning.
- [Martingale Theory](https://theorempath.com/topics/martingale-theory) (tier 2, advanced): Martingales and their convergence properties: Doob martingale, optional stopping theorem, martingale convergence, Azuma-Hoeffding inequality, and Freedman inequality. The tools behind McDiarmid's inequality, online learning regret bounds, and stochastic approximation.
- [Method of Moments](https://theorempath.com/topics/method-of-moments) (tier 2, core): Match sample moments to population moments to estimate parameters. Simpler than MLE but less efficient. Covers classical MoM, generalized method of moments (GMM), and when MoM is the better choice.
- [Stein's Paradox](https://theorempath.com/topics/steins-paradox) (tier 2, core): In dimension d >= 3, the sample mean is inadmissible for estimating the mean of a multivariate normal under squared error loss. The James-Stein estimator dominates it by shrinking toward zero.
- [Sufficient Statistics and Exponential Families](https://theorempath.com/topics/sufficient-statistics-and-exponential-families) (tier 2, core): Sufficient statistics compress data without losing information about the parameter. The Neyman-Fisher factorization theorem, exponential families, completeness, and Rao-Blackwell improvement of estimators.
- [Asymptotic Statistics](https://theorempath.com/topics/asymptotic-statistics) (tier 3, advanced): The large-sample toolbox: delta method, Slutsky's theorem, asymptotic normality of MLE, local asymptotic normality, and Fisher efficiency. These results justify nearly every confidence interval and hypothesis test used in practice.
- [Basu's Theorem](https://theorempath.com/topics/basu-theorem) (tier 3, advanced): A complete sufficient statistic is independent of every ancillary statistic. This provides the cleanest method for proving independence between statistics without computing joint distributions.
- [Spectral Theory of Operators](https://theorempath.com/topics/spectral-theory-of-operators) (tier 3, advanced): Spectral theorem for compact self-adjoint operators on Hilbert spaces: every such operator has a countable orthonormal eigenbasis with real eigenvalues accumulating only at zero. This is the infinite-dimensional backbone of PCA, kernel methods, and neural tangent kernel theory.

## Layer 1 — Core mathematical objects

- [Activation Functions](https://theorempath.com/topics/activation-functions) (tier 1, core): Nonlinear activation functions in neural networks: sigmoid, tanh, ReLU, Leaky ReLU, GELU, and SiLU. Their gradients, saturation behavior, and impact on trainability.
- [Automatic Differentiation](https://theorempath.com/topics/automatic-differentiation) (tier 1, core): Forward mode computes Jacobian-vector products, reverse mode computes vector-Jacobian products: backpropagation is reverse-mode autodiff, and the asymmetry between the two modes explains why training neural networks is efficient.
- [Chernoff Bounds](https://theorempath.com/topics/chernoff-bounds) (tier 1, core): The Chernoff method: the universal technique for deriving exponential tail bounds by optimizing over the moment generating function, yielding the tightest possible exponential concentration.
- [Concentration Inequalities](https://theorempath.com/topics/concentration-inequalities) (tier 1, core): Bounds on how far random variables deviate from their expectations: Markov, Chebyshev, Hoeffding, and Bernstein. Used throughout generalization theory, bandits, and sample complexity.
- [Conditioning and Condition Number](https://theorempath.com/topics/conditioning-and-condition-number) (tier 1, core): The condition number measures how sensitive a problem is to perturbations in its input. Ill-conditioned matrices turn small errors into catastrophic ones, and understanding conditioning is essential for any computation involving linear algebra.
- [Confusion Matrices and Classification Metrics](https://theorempath.com/topics/confusion-matrices-and-classification-metrics) (tier 1, core): The complete framework for evaluating binary classifiers: confusion matrix entries, precision, recall, F1, specificity, ROC curves, AUC, and precision-recall curves. When to use each metric.
- [Confusion Matrix Deep Dive](https://theorempath.com/topics/confusion-matrix-deep-dive) (tier 1, core): Beyond binary classification: multi-class confusion matrices, micro vs macro averaging, Matthews correlation coefficient, Cohen's kappa, cost-sensitive evaluation, and how to read a confusion matrix without getting the axes wrong.
- [Convex Optimization Basics](https://theorempath.com/topics/convex-optimization-basics) (tier 1, core): Convex sets, convex functions, gradient descent convergence, strong convexity, and duality: the optimization foundation that every learning-theoretic result silently depends on.
- [Cross-Entropy Loss Deep Dive](https://theorempath.com/topics/cross-entropy-loss-deep-dive) (tier 1, core): Why cross-entropy is the correct loss for classification: its derivation as negative log-likelihood, connection to KL divergence, why MSE fails for classification, and practical variants including label smoothing and focal loss.
- [Data Preprocessing and Feature Engineering](https://theorempath.com/topics/data-preprocessing-and-feature-engineering) (tier 1, core): Standardization, scaling, encoding, imputation, and feature selection. Why most algorithms assume centered, scaled inputs and what breaks when you skip preprocessing.
- [Gradient Descent Variants](https://theorempath.com/topics/gradient-descent-variants) (tier 1, core): From full-batch to stochastic to mini-batch gradient descent, plus momentum, Nesterov acceleration, AdaGrad, RMSProp, and Adam. Why mini-batch SGD with momentum is the practical default.
- [Gram Matrices and Kernel Matrices](https://theorempath.com/topics/gram-matrices-and-kernel-matrices) (tier 1, core): The Gram matrix G_{ij} = <x_i, x_j> encodes pairwise inner products of a dataset. Always PSD. Appears in kernel methods, PCA, SVD, and attention. Understanding it connects linear algebra to ML.
- [K-Means Clustering](https://theorempath.com/topics/k-means-clustering) (tier 1, core): Lloyd's algorithm for partitional clustering: the within-cluster sum of squares objective, convergence guarantees, k-means++ initialization, choosing k, and the connection to EM for Gaussians.
- [KL Divergence](https://theorempath.com/topics/kl-divergence) (tier 1, core): Kullback-Leibler divergence measures how one probability distribution differs from another. Asymmetric, always non-negative, and central to variational inference, MLE, and RLHF.
- [Linear Regression](https://theorempath.com/topics/linear-regression) (tier 1, core): Ordinary least squares as projection, the normal equations, the hat matrix, Gauss-Markov optimality, and the connection to maximum likelihood under Gaussian noise.
- [Log-Probability Computation](https://theorempath.com/topics/log-probability-computation) (tier 1, core): Working in log space prevents underflow when multiplying many small probabilities. The log-sum-exp trick provides a numerically stable way to compute log(sum(exp(x_i))), and it underlies stable softmax, log-likelihoods, and the forward algorithm for HMMs.
- [Logistic Regression](https://theorempath.com/topics/logistic-regression) (tier 1, core): The foundational linear classifier: sigmoid link function, maximum likelihood estimation, cross-entropy loss, gradient derivation, and regularized variants.
- [Loss Functions Catalog](https://theorempath.com/topics/loss-functions-catalog) (tier 1, core): A systematic catalog of loss functions for classification, regression, and representation learning: cross-entropy, MSE, hinge, Huber, focal, KL divergence, and contrastive loss.
- [Matrix Calculus](https://theorempath.com/topics/matrix-calculus) (tier 1, core): The differentiation identities you actually use in ML: derivatives of traces, log-determinants, and quadratic forms with respect to matrices and vectors.
- [Model Evaluation Best Practices](https://theorempath.com/topics/model-evaluation-best-practices) (tier 1, core): Train/validation/test splits, cross-validation, stratified and temporal splits, data leakage, reporting with confidence intervals, and statistical tests for model comparison. Why single-number comparisons are misleading.
- [Newton's Method](https://theorempath.com/topics/newtons-method) (tier 1, core): The gold standard for fast local convergence: use second-order information (the Hessian) to take optimal quadratic steps. Quadratic convergence when it works, catastrophic failure when it doesn't.
- [Numerical Stability and Conditioning](https://theorempath.com/topics/numerical-stability) (tier 1, core): Continuous math becomes real only through finite-precision approximation. Condition numbers, backward stability, catastrophic cancellation, and why theorems about reals do not transfer cleanly to floating-point.
- [Overfitting and Underfitting](https://theorempath.com/topics/overfitting-and-underfitting) (tier 1, core): The two failure modes of supervised learning: models that memorize noise versus models too simple to capture signal. Diagnosis via training-validation gaps.
- [PAC Learning Framework](https://theorempath.com/topics/pac-learning-framework) (tier 1, core): The foundational formalization of what it means to learn from data: a concept is PAC-learnable if an algorithm can, with high probability, find a hypothesis that is approximately correct, using a polynomial number of samples.
- [Principal Component Analysis](https://theorempath.com/topics/principal-component-analysis) (tier 1, core): Dimensionality reduction via variance maximization: PCA as eigendecomposition of the covariance matrix, PCA as truncated SVD of the centered data matrix, reconstruction error, and when sample PCA works.
- [Regularization in Practice](https://theorempath.com/topics/regularization-in-practice) (tier 1, core): Practical guide to regularization: L2 (weight decay), L1 (sparsity), dropout, early stopping, data augmentation, and batch normalization. How each constrains model complexity and how to choose between them.
- [Skewness, Kurtosis, and Higher Moments](https://theorempath.com/topics/skewness-kurtosis-and-higher-moments) (tier 1, core): Distribution shape beyond mean and variance: skewness measures tail asymmetry, kurtosis measures tail extremeness, cumulants are the cleaner language, and heavy-tailed distributions break all of these.
- [Softmax and Numerical Stability](https://theorempath.com/topics/softmax-and-numerical-stability) (tier 1, core): The softmax function maps arbitrary reals to a probability distribution. Getting it right numerically: avoiding overflow and underflow: is the first lesson in writing ML code that actually works.
- [Train-Test Split and Data Leakage](https://theorempath.com/topics/train-test-split-and-data-leakage) (tier 1, core): Why you need three data splits, how to construct them correctly, and the common ways information leaks from test data into training. Temporal splits, stratified splits, and leakage detection.
- [Types of Bias in Statistics](https://theorempath.com/topics/types-of-bias-in-statistics) (tier 1, core): A taxonomy of the major biases that corrupt statistical inference: selection bias, measurement bias, survivorship bias, response bias, attrition bias, publication bias, and their consequences for ML.
- [Understanding Machine Learning (Shalev-Shwartz, Ben-David)](https://theorempath.com/topics/understanding-machine-learning-book) (tier 1, core): Reading guide for the definitive learning theory textbook. Covers PAC learning, VC dimension, Rademacher complexity, uniform convergence, stability, online learning, boosting, and regularization with rigorous proofs.
- [Ascent Algorithms and Hill Climbing](https://theorempath.com/topics/ascent-algorithms) (tier 2, core): Gradient ascent, hill climbing, and their failure modes: local optima, plateaus, and ridges. Random restarts and simulated annealing as strategies for escaping local optima.
- [Base Rate Fallacy](https://theorempath.com/topics/base-rate-fallacy) (tier 2, core): Ignoring the prior probability (base rate) when interpreting test results. A 99% accurate test for a 1% prevalence disease gives only 50% positive predictive value.
- [Benford's Law](https://theorempath.com/topics/benfords-law) (tier 2, core): The leading digit of naturally occurring numerical data is not uniformly distributed: digit 1 appears about 30% of the time, digit 9 about 5%. This arises from scale invariance and logarithmic density, and has real applications in fraud detection, election auditing, and data integrity checks.
- [Class Imbalance and Resampling](https://theorempath.com/topics/class-imbalance-and-resampling) (tier 2, core): When class frequencies differ dramatically, standard accuracy is misleading. Resampling, cost-sensitive learning, and threshold tuning restore meaningful evaluation and training.
- [Cramér-Wold Theorem](https://theorempath.com/topics/cramer-wold-theorem) (tier 2, advanced): A multivariate distribution is uniquely determined by all of its one-dimensional projections. This reduces multivariate convergence in distribution to checking univariate projections, and is the standard tool for proving multivariate CLT.
- [Exploratory Data Analysis](https://theorempath.com/topics/exploratory-data-analysis) (tier 2, core): The disciplined practice of looking at data before modeling: summary statistics, distributions, correlations, missing values, outliers, and class balance. You cannot model what you do not understand.
- [Fast Fourier Transform](https://theorempath.com/topics/fast-fourier-transform) (tier 2, core): The Cooley-Tukey FFT reduces the discrete Fourier transform from O(n^2) to O(n log n), enabling efficient convolution, spectral methods, and Fourier features for kernel approximation.
- [Goodness-of-Fit Tests](https://theorempath.com/topics/goodness-of-fit-tests) (tier 2, core): KS test, Shapiro-Wilk, Anderson-Darling, and the probability integral transform: testing whether data follows a specified distribution, with exact power comparisons and when each test fails.
- [Hardware for ML Practitioners](https://theorempath.com/topics/hardware-for-ml-practitioners) (tier 2, core): Practical hardware guidance for ML work: GPU memory as the real bottleneck, when local GPUs make sense, cloud options compared, and why you should not spend $5000 before knowing what you need.
- [K-Nearest Neighbors](https://theorempath.com/topics/knn) (tier 2, core): Classify by majority vote of the k closest training points: no training phase, universal consistency as n and k grow, and the curse of dimensionality that makes distance meaningless in high dimensions.
- [Markov Chains and Steady State](https://theorempath.com/topics/markov-chains-and-steady-state) (tier 2, core): Markov chains: the Markov property, transition matrices, stationary distributions, irreducibility, aperiodicity, the ergodic theorem, and mixing time. The backbone of PageRank, MCMC, and reinforcement learning.
- [Matrix Multiplication Algorithms](https://theorempath.com/topics/matrix-multiplication-algorithms) (tier 2, core): From naive O(n^3) to Strassen's O(n^{2.807}) to the open question of the true exponent omega. What we know, what we do not, and why it matters for ML.
- [ML Project Lifecycle](https://theorempath.com/topics/ml-project-lifecycle) (tier 2, core): The full ML project workflow from problem definition through deployment and monitoring. Why most projects fail at data quality, not model architecture. Cross-functional requirements and MLOps basics.
- [Multi-Class and Multi-Label Classification](https://theorempath.com/topics/multi-class-and-multi-label-classification) (tier 2, core): Multi-class (exactly one label, softmax + cross-entropy) vs multi-label (multiple labels, sigmoid + binary cross-entropy). One-vs-rest, one-vs-one, hierarchical classification, and evaluation metrics.
- [Naive Bayes](https://theorempath.com/topics/naive-bayes) (tier 2, core): The simplest generative classifier: assume conditional independence of features given the class, estimate class-conditional densities, and classify via MAP. Why it works despite the wrong independence assumption.
- [Numerical Linear Algebra](https://theorempath.com/topics/numerical-linear-algebra) (tier 2, core): Algorithms for solving linear systems, computing eigenvalues, and factoring matrices. Every linear regression, PCA, and SVD computation depends on these methods.
- [Order Statistics](https://theorempath.com/topics/order-statistics) (tier 2, core): Order statistics are the sorted values of a random sample. Their distributions govern quantile estimation, confidence intervals for medians, and the behavior of extremes.
- [Perceptron](https://theorempath.com/topics/perceptron) (tier 2, core): Rosenblatt's perceptron (1958): the simplest neural network, the first learning algorithm with a convergence proof, and the lesson that linear separability is both powerful and limiting.
- [Rejection Sampling](https://theorempath.com/topics/rejection-sampling) (tier 2, core): The simplest exact sampling method: propose from an envelope distribution and accept or reject to produce exact independent draws from a target: but doomed to fail in high dimensions.
- [Relational Algebra](https://theorempath.com/topics/relational-algebra) (tier 2, core): The mathematical foundation of SQL and relational databases. Selection, projection, join, set operations, and Codd's theorem connecting algebra to relational calculus.
- [Secant Method](https://theorempath.com/topics/secant-method) (tier 2, core): A derivative-free root-finding method that approximates Newton's method using two previous function evaluations, achieving superlinear convergence of order approximately 1.618.
- [Signals and Systems for ML](https://theorempath.com/topics/signals-and-systems-for-ml) (tier 2, core): Linear time-invariant systems, convolution, Fourier transform, and the sampling theorem. The signal processing foundations that underpin CNNs, efficient attention, audio ML, and frequency-domain analysis of training dynamics.
- [Simpson's Paradox](https://theorempath.com/topics/simpsons-paradox) (tier 2, core): A trend present in every subgroup can reverse when the subgroups are combined. This happens when a confounding variable determines both group membership and outcome.
- [Winsorization](https://theorempath.com/topics/winsorization) (tier 3, core): Clip extreme values to a fixed percentile instead of removing them. Preserves sample size, reduces outlier sensitivity, and improves stability of downstream estimators.

## Layer 2 — Probability, statistics, optimization

- [Adam Optimizer](https://theorempath.com/topics/adam-optimizer) (tier 1, core): Adaptive moment estimation: first and second moment tracking, bias correction, AdamW vs Adam+L2, learning rate warmup, and when Adam fails compared to SGD.
- [AIC and BIC](https://theorempath.com/topics/aic-and-bic) (tier 1, core): Akaike and Bayesian information criteria for model selection: how they trade off fit versus complexity, when to use each, and their connection to cross-validation.
- [Bagging](https://theorempath.com/topics/bagging) (tier 1, core): Bootstrap Aggregation: train B models on bootstrap samples and average their predictions. Why averaging reduces variance, how correlation limits the gain, and out-of-bag error estimation.
- [Batch Normalization](https://theorempath.com/topics/batch-normalization) (tier 1, core): Normalize activations within mini-batches to stabilize and accelerate training, then scale and shift with learnable parameters.
- [Bellman Equations](https://theorempath.com/topics/bellman-equations) (tier 1, core): The recursive backbone of RL. State-value and action-value Bellman equations, the contraction mapping property, convergence of value iteration, and why recursive decomposition is the central idea in sequential decision-making.
- [Bootstrap Methods](https://theorempath.com/topics/bootstrap-methods) (tier 1, core): The nonparametric bootstrap: resample with replacement to approximate sampling distributions, construct confidence intervals, and quantify uncertainty without distributional assumptions.
- [Bounded Rationality](https://theorempath.com/topics/bounded-rationality) (tier 1, core): Real agents optimize under limited information, limited compute, and limited foresight. Simon's satisficing, heuristics, and the implications for search, planning, and agent design in ML.
- [Dropout](https://theorempath.com/topics/dropout) (tier 1, core): Dropout as stochastic regularization: Bernoulli masking, inverted dropout scaling, implicit ensemble interpretation, Bayesian connection via MC dropout, and equivalence to L2 regularization in linear models.
- [Empirical Risk Minimization](https://theorempath.com/topics/empirical-risk-minimization) (tier 1, core): The foundational principle of statistical learning: minimize average loss on training data as a proxy for minimizing true population risk.
- [Fat Tails and Heavy-Tailed Distributions](https://theorempath.com/topics/fat-tails) (tier 1, advanced): When the tails dominate. Power laws, Pareto distributions, subexponential tails, why the law of large numbers converges slowly or fails, and why most of ML silently assumes thin tails.
- [Feedforward Networks and Backpropagation](https://theorempath.com/topics/feedforward-networks-and-backpropagation) (tier 1, core): Feedforward neural networks as compositions of affine transforms and nonlinearities, the universal approximation theorem, and backpropagation as reverse-mode automatic differentiation on the computational graph.
- [Game Theory Foundations](https://theorempath.com/topics/game-theory) (tier 1, core): Strategic interaction between rational agents. Normal-form games, dominant strategies, Nash equilibrium existence, mixed strategies, and connections to minimax, mechanism design, and multi-agent RL.
- [Gauss-Markov Theorem](https://theorempath.com/topics/gauss-markov-theorem) (tier 1, core): Among all linear unbiased estimators, ordinary least squares has the smallest variance. This is the BLUE theorem. And understanding when its assumptions fail is just as important as the result itself.
- [Gibbs Sampling](https://theorempath.com/topics/gibbs-sampling) (tier 1, core): The workhorse MCMC algorithm for Bayesian models: sample each variable from its full conditional distribution, cycling through all variables, and every proposal is automatically accepted.
- [Gradient Boosting](https://theorempath.com/topics/gradient-boosting) (tier 1, advanced): Gradient boosting as functional gradient descent: fit weak learners to pseudo-residuals sequentially, reducing bias at each round. Covers AdaBoost, shrinkage, XGBoost second-order methods, and LightGBM leaf-wise growth.
- [Gradient Flow and Vanishing Gradients](https://theorempath.com/topics/gradient-flow-and-vanishing-gradients) (tier 1, core): Why deep networks are hard to train: gradients shrink or explode as they propagate through layers. The Jacobian product chain, sigmoid saturation, ReLU dead neurons, skip connections, normalization, and gradient clipping.
- [High-Dimensional Probability (Vershynin)](https://theorempath.com/topics/high-dimensional-probability-book) (tier 1, core): Reading guide for Vershynin's textbook on sub-Gaussian and sub-exponential random variables, concentration inequalities, random matrices, covering numbers, and high-dimensional geometry. The modern reference for probabilistic tools in ML theory.
- [Hypothesis Classes and Function Spaces](https://theorempath.com/topics/hypothesis-classes-and-function-spaces) (tier 1, core): What is a hypothesis class, why the choice of hypothesis class determines what ERM can learn, and the approximation-estimation tradeoff: bigger classes reduce approximation error but increase estimation error.
- [Importance Sampling](https://theorempath.com/topics/importance-sampling) (tier 1, core): Estimate expectations under one distribution by sampling from another and reweighting: a technique that is powerful when done right and catastrophically unreliable when done wrong.
- [Information Retrieval Foundations](https://theorempath.com/topics/information-retrieval) (tier 1, core): Search as a first-class capability. TF-IDF, BM25, inverted indexes, precision/recall, reranking, and why retrieval is not just vector DB plus embeddings.
- [Kalman Filter](https://theorempath.com/topics/kalman-filter) (tier 1, core): Optimal state estimation for linear Gaussian systems via recursive prediction and update steps using the Kalman gain.
- [Lasso Regression](https://theorempath.com/topics/lasso-regression) (tier 1, core): OLS with L1 regularization: sparsity, the geometry of why L1 selects variables, proximal gradient descent, LARS, and elastic net.
- [Learning Rate Scheduling](https://theorempath.com/topics/learning-rate-scheduling) (tier 1, core): Why learning rate moves the loss more than any other hyperparameter, and how warmup, cosine decay, step decay, cyclic schedules, and 1cycle policy control training dynamics.
- [Markov Decision Processes](https://theorempath.com/topics/markov-decision-processes) (tier 1, core): The mathematical framework for sequential decision-making under uncertainty: states, actions, transitions, rewards, and the Bellman equations that make solving them possible.
- [Metropolis-Hastings Algorithm](https://theorempath.com/topics/metropolis-hastings) (tier 1, core): The foundational MCMC algorithm: construct a Markov chain whose stationary distribution is your target by accepting or rejecting proposed moves according to a carefully chosen ratio.
- [Proximal Gradient Methods](https://theorempath.com/topics/proximal-gradient-methods) (tier 1, core): Optimize composite objectives by alternating gradient steps on smooth terms with proximal operators on nonsmooth terms. ISTA and its accelerated variant FISTA.
- [Quasi-Newton Methods](https://theorempath.com/topics/quasi-newton-methods) (tier 1, core): Approximate the Hessian instead of computing it: BFGS builds a dense approximation, L-BFGS stores only a few vectors. Superlinear convergence without second derivatives.
- [Random Forests](https://theorempath.com/topics/random-forests) (tier 1, core): Random forests combine bagging with random feature subsampling to decorrelate trees, reducing ensemble variance beyond what pure bagging achieves. Out-of-bag estimation, variable importance, consistency theory, and practical strengths and weaknesses.
- [Ridge Regression](https://theorempath.com/topics/ridge-regression) (tier 1, core): OLS with L2 regularization: closed-form shrinkage, bias-variance tradeoff, SVD interpretation, and the Bayesian connection to Gaussian priors.
- [Sample Complexity Bounds](https://theorempath.com/topics/sample-complexity-bounds) (tier 1, core): How many samples do you need to learn? Tight answers for finite hypothesis classes, VC classes, and Rademacher-bounded classes, plus matching lower bounds via Fano and Le Cam.
- [Skip Connections and ResNets](https://theorempath.com/topics/skip-connections-and-resnets) (tier 1, core): Residual connections let gradients flow through identity paths, enabling training of very deep networks. ResNets learn residual functions F(x) = H(x) - x, which is easier than learning H(x) directly.
- [Stochastic Gradient Descent Convergence](https://theorempath.com/topics/stochastic-gradient-descent-convergence) (tier 1, core): SGD convergence rates for convex and strongly convex functions, the role of noise as both curse and blessing, mini-batch variance reduction, learning rate schedules, and the Robbins-Monro conditions.
- [Sub-Exponential Random Variables](https://theorempath.com/topics/subexponential-random-variables) (tier 1, core): The distributional class between sub-Gaussian and heavy-tailed: heavier tails than Gaussian, the psi_1 norm, Bernstein condition, and the two-regime concentration bound.
- [Sub-Gaussian Random Variables](https://theorempath.com/topics/subgaussian-random-variables) (tier 1, core): Sub-Gaussian random variables: the precise characterization of 'light-tailed' behavior that underpins every concentration inequality in learning theory.
- [Support Vector Machines](https://theorempath.com/topics/support-vector-machines) (tier 1, core): Maximum-margin classifiers via convex optimization: hard margin, soft margin with slack variables, hinge loss, the dual formulation, and the kernel trick.
- [The EM Algorithm](https://theorempath.com/topics/em-algorithm) (tier 1, core): Expectation-Maximization: the principled way to do maximum likelihood when some variables are unobserved. Derives the ELBO, proves monotonic convergence, and shows why EM is the backbone of latent variable models.
- [Uniform Convergence](https://theorempath.com/topics/uniform-convergence) (tier 1, core): Uniform convergence of empirical risk to population risk over an entire hypothesis class: the key property that makes ERM provably work.
- [Universal Approximation Theorem](https://theorempath.com/topics/universal-approximation-theorem) (tier 1, core): A single hidden layer neural network can approximate any continuous function on a compact set to arbitrary accuracy. Why this is both important and misleading: it says nothing about width, weight-finding, or generalization.
- [Value Iteration and Policy Iteration](https://theorempath.com/topics/value-iteration-and-policy-iteration) (tier 1, core): The two foundational algorithms for solving MDPs exactly: value iteration applies the Bellman optimality operator until convergence, while policy iteration alternates between exact evaluation and greedy improvement.
- [VC Dimension](https://theorempath.com/topics/vc-dimension) (tier 1, core): The Vapnik-Chervonenkis dimension: a combinatorial measure of hypothesis class complexity that characterizes learnability in binary classification.
- [Weight Initialization](https://theorempath.com/topics/weight-initialization) (tier 1, core): Why initialization determines whether gradients vanish or explode, and how Xavier and He initialization preserve variance across layers.
- [AdaBoost](https://theorempath.com/topics/adaboost) (tier 2, core): AdaBoost as iterative reweighting of misclassified samples, exponential loss minimization, weak-to-strong learner amplification, margin theory, and the connection to coordinate descent.
- [Anomaly Detection](https://theorempath.com/topics/anomaly-detection) (tier 2, core): Methods for identifying data points that deviate from expected patterns: isolation forests, one-class SVMs, autoencoders, statistical distances, and why the absence of anomaly labels makes this problem structurally harder than classification.
- [Arrow's Impossibility Theorem](https://theorempath.com/topics/arrows-impossibility) (tier 2, core): No voting system can satisfy all fairness axioms simultaneously. Arrow's theorem, the Gibbard-Satterthwaite extension, and connections to social choice, mechanism design, and preference aggregation in ML.
- [Augmented Lagrangian and ADMM](https://theorempath.com/topics/augmented-lagrangian-and-admm) (tier 2, advanced): The augmented Lagrangian adds a quadratic penalty to enforce constraints while preserving exact solutions. ADMM splits problems into tractable subproblems solved alternately, enabling distributed and non-smooth optimization for Lasso, matrix completion, and large-scale ML.
- [Autoencoders](https://theorempath.com/topics/autoencoders) (tier 2, advanced): Encoder-decoder architectures for unsupervised representation learning: undercomplete bottlenecks, sparse and denoising variants, and the connection between linear autoencoders and PCA.
- [Batch Size and Learning Dynamics](https://theorempath.com/topics/batch-size-and-learning-dynamics) (tier 2, core): How batch size affects what SGD finds: gradient noise, implicit regularization, the linear scaling rule, sharp vs flat minima, and the gradient noise scale as the key quantity governing the tradeoff.
- [Bayesian Optimization for Hyperparameters](https://theorempath.com/topics/bayesian-optimization-for-hyperparameters) (tier 2, advanced): Hyperparameter tuning as black-box optimization with a Gaussian process surrogate. Acquisition functions (EI, UCB, PI), sample efficiency for expensive evaluations, TPE as an alternative, and why grid search is wasteful.
- [Bayesian State Estimation](https://theorempath.com/topics/bayesian-state-estimation) (tier 2, core): The filtering problem: recursively estimate a hidden state from noisy observations using predict-update cycles. Kalman filter for linear Gaussian systems, particle filters for the general case.
- [Bias-Variance Tradeoff](https://theorempath.com/topics/bias-variance-tradeoff) (tier 2, core): The classical decomposition of mean squared error into bias squared, variance, and irreducible noise. The U-shaped test error curve, why it breaks in modern ML (double descent), and the connection to regularization.
- [Burn-in and Convergence Diagnostics](https://theorempath.com/topics/burn-in-convergence-diagnostics) (tier 2, core): Knowing when an MCMC chain has reached stationarity and when to trust its samples. Burn-in, Gelman-Rubin R-hat, effective sample size, trace plots, and autocorrelation.
- [Computer Architecture for ML](https://theorempath.com/topics/computer-architecture-for-ml) (tier 2, core): Memory movement often matters more than FLOPs. Memory hierarchy, bandwidth vs. compute, roofline model, GPU architecture, kernel fusion, and why parameter count alone does not tell the computational story.
- [Conjugate Gradient Methods](https://theorempath.com/topics/conjugate-gradient-methods) (tier 2, advanced): Solving Ax = b when A is symmetric positive definite by choosing A-conjugate search directions. Exact convergence in n steps, condition-number-dependent rates, nonlinear CG variants, and the connection to Krylov subspaces.
- [Convex Tinkering](https://theorempath.com/topics/convex-tinkering) (tier 2, core): Taleb's concept applied to ML research: designing small experiments with bounded downside and unbounded upside, and why this strategy dominates scale-first approaches under uncertainty.
- [Coordinate Descent](https://theorempath.com/topics/coordinate-descent) (tier 2, core): Optimize by updating one coordinate (or block) at a time while holding others fixed. The default solver for Lasso because each coordinate update has a closed-form solution.
- [Cross-Validation Theory](https://theorempath.com/topics/cross-validation-theory) (tier 2, core): The theory behind cross-validation as a model selection tool: LOO-CV, K-fold, the bias-variance tradeoff of the CV estimator itself, and why CV estimates generalization error.
- [Cryptographic Hash Functions](https://theorempath.com/topics/hash-functions) (tier 2, core): One-way compression of arbitrary data to fixed-length digests. Collision resistance, preimage resistance, the birthday attack bound, Merkle-Damgard construction, and applications from digital signatures to ML model fingerprinting.
- [Data Augmentation Theory](https://theorempath.com/topics/data-augmentation-theory) (tier 2, core): Why data augmentation works as a regularizer: invariance injection, effective sample size, MixUp, CutMix, and the connection to Vicinal Risk Minimization.
- [Decision Theory Foundations](https://theorempath.com/topics/decision-theory-foundations) (tier 2, core): The mathematical framework for rational choice. States, actions, consequences, Savage axioms, subjective probability, and the bridge between probability theory, utility theory, and statistical decision theory.
- [Decision Trees and Ensembles](https://theorempath.com/topics/decision-trees-and-ensembles) (tier 2, advanced): Greedy recursive partitioning with splitting criteria, pruning, and why combining weak learners via bagging (random forests) and boosting (gradient boosting) yields strong predictors.
- [Design-Based vs. Model-Based Inference](https://theorempath.com/topics/design-based-vs-model-based-inference) (tier 2, core): Two philosophies of statistical inference from survey data: design-based inference where randomness comes from the sampling design, and model-based inference where randomness comes from a statistical model, with the model-assisted hybrid approach.
- [Detection Theory](https://theorempath.com/topics/detection-theory) (tier 2, advanced): Binary hypothesis testing, the Neyman-Pearson lemma (likelihood ratio tests are most powerful), ROC curves, Bayesian detection, and sequential testing. Classification IS detection theory. ROC/AUC comes directly from here.
- [Dimensionality Reduction Theory](https://theorempath.com/topics/dimensionality-reduction-theory) (tier 2, core): Why and how to reduce dimensions: the curse of dimensionality, PCA, random projections (JL lemma), t-SNE, UMAP, and when each method preserves the structure you care about.
- [Distributional Semantics](https://theorempath.com/topics/distributional-semantics) (tier 2, core): You shall know a word by the company it keeps. The distributional hypothesis, co-occurrence matrices, PMI, SVD-based embeddings, and the mathematical bridge from linguistics to word vectors and transformers.
- [Elastic Net](https://theorempath.com/topics/elastic-net) (tier 2, core): Combining L1 and L2 penalties: elastic net gets sparsity from lasso and grouping stability from ridge, solving the failure mode where lasso arbitrarily selects among correlated features.
- [Ensemble Methods Theory](https://theorempath.com/topics/ensemble-methods-theory) (tier 2, core): Why combining multiple models outperforms any single model: bias-variance decomposition for ensembles, diversity conditions, and the theoretical foundations of bagging, boosting, and stacking.
- [Evaluation Metrics and Properties](https://theorempath.com/topics/evaluation-metrics-and-properties) (tier 2, core): The metrics that determine whether a model is good: accuracy, precision, recall, F1, AUC-ROC, log loss, RMSE, calibration, and proper scoring rules. Why choosing the right metric matters more than improving the wrong one.
- [Expected Utility Theory](https://theorempath.com/topics/expected-utility) (tier 2, core): The axiomatic foundation of rational choice under uncertainty. Von Neumann-Morgenstern utility, the independence axiom, risk aversion from concavity, and where the theory breaks (Allais paradox, prospect theory).
- [Exploration vs Exploitation](https://theorempath.com/topics/exploration-vs-exploitation) (tier 2, core): The fundamental tradeoff in sequential decision-making: exploit known good actions to collect reward now, or explore uncertain actions to discover potentially better strategies. Epsilon-greedy, Boltzmann exploration, UCB, count-based methods, and intrinsic motivation.
- [Feature Importance and Interpretability](https://theorempath.com/topics/feature-importance-and-interpretability) (tier 2, core): Methods for attributing model predictions to input features: permutation importance, SHAP values, LIME, partial dependence, and why none of these imply causality.
- [Gaussian Mixture Models and EM](https://theorempath.com/topics/gaussian-mixture-models-and-em) (tier 2, core): GMMs as soft clustering with per-component Gaussians: EM derivation (E-step responsibilities, M-step parameter updates), convergence guarantees, model selection with BIC/AIC, and the connection to k-means as the hard-assignment limit.
- [Generalized Additive Models](https://theorempath.com/topics/generalized-additive-models) (tier 2, core): GAMs: y = alpha + sum f_j(x_j) where each f_j is a smooth function. Interpretable nonlinear regression with backfitting, P-splines, and partial effect plots.
- [Hypothesis Testing for ML](https://theorempath.com/topics/hypothesis-testing-for-ml) (tier 2, core): Null and alternative hypotheses, p-values, confidence intervals, error types, multiple comparisons, and proper statistical tests for comparing ML models.
- [Implicit Differentiation](https://theorempath.com/topics/implicit-differentiation) (tier 2, advanced): Differentiating through implicit equations and optimization problems: the implicit function theorem gives dy/dx without solving for y explicitly. Applications to bilevel optimization, deep equilibrium models, hyperparameter optimization, and meta-learning.
- [Kelly Criterion](https://theorempath.com/topics/kelly-criterion) (tier 2, core): The mathematically optimal bet size. Maximize expected log wealth, the Kelly fraction, connections to information theory and Shannon, and why full Kelly is often too aggressive in practice.
- [Kolmogorov Complexity and MDL](https://theorempath.com/topics/kolmogorov-complexity-and-mdl) (tier 2, core): Kolmogorov complexity measures the shortest program that produces a string. The Minimum Description Length principle selects models that compress data best, providing a computable approximation to an incomputable ideal.
- [Label Smoothing and Regularization](https://theorempath.com/topics/label-smoothing-and-regularization) (tier 2, core): Replace hard targets with soft targets to prevent overconfidence: the label smoothing formula, its connection to maximum entropy regularization, calibration effects, and when it hurts.
- [Line Search Methods](https://theorempath.com/topics/line-search-methods) (tier 2, core): Choose step sizes that guarantee sufficient decrease and curvature conditions. Armijo and Wolfe conditions, backtracking, and why step size selection makes or breaks gradient descent.
- [Meta-Analysis](https://theorempath.com/topics/meta-analysis) (tier 2, core): Combining results from multiple studies: fixed-effect and random-effects models, heterogeneity measures, publication bias, and why this matters for ML benchmarking.
- [Minimax and Saddle Points](https://theorempath.com/topics/minimax-saddle-points) (tier 2, core): Minimax theorems characterize when max-min equals min-max. Saddle points arise in zero-sum games, duality theory, GANs, and robust optimization.
- [Multi-Armed Bandits Theory](https://theorempath.com/topics/multi-armed-bandits-theory) (tier 2, core): The exploration-exploitation tradeoff formalized: K arms, regret as the cost of not knowing the best arm, and algorithms (UCB, Thompson sampling) that achieve near-optimal regret bounds.
- [Nash Equilibrium](https://theorempath.com/topics/nash-equilibrium) (tier 2, core): No player can improve by unilateral deviation. Existence via Kakutani fixed-point theorem, computation complexity (PPAD-completeness), refinements, and why Nash equilibria can be inefficient (price of anarchy).
- [Natural Language Processing Foundations](https://theorempath.com/topics/natural-language-processing-foundations) (tier 2, core): The progression from bag-of-words to transformers: tokenization, language modeling, TF-IDF, sequence-to-sequence, attention, and why the pre-train then fine-tune paradigm replaced task-specific architectures.
- [Neyman-Pearson and Hypothesis Testing Theory](https://theorempath.com/topics/neyman-pearson-and-hypothesis-testing-theory) (tier 2, core): The likelihood ratio test is the most powerful test for simple hypotheses (Neyman-Pearson lemma), UMP tests extend this to one-sided composites, and the power function characterizes a test's behavior across the parameter space.
- [Nonresponse and Missing Data](https://theorempath.com/topics/nonresponse-and-missing-data) (tier 2, core): The taxonomy of missingness mechanisms (MCAR, MAR, MNAR), their consequences for inference, and the major correction methods: multiple imputation, inverse probability weighting, and the EM algorithm.
- [P-Hacking and Multiple Testing](https://theorempath.com/topics/p-hacking-and-multiple-testing) (tier 2, core): How selective reporting and multiple comparisons inflate false positive rates, and how Bonferroni and Benjamini-Hochberg corrections control them. Why hyperparameter tuning is multiple testing and benchmark shopping is p-hacking.
- [PageRank Algorithm](https://theorempath.com/topics/pagerank-algorithm) (tier 2, core): PageRank as the stationary distribution of a random walk on a graph: damping factor, power iteration, eigenvector interpretation, and applications beyond web search.
- [Projected Gradient Descent](https://theorempath.com/topics/projected-gradient-descent) (tier 2, core): Constrained convex optimization by alternating gradient steps with projections onto the feasible set. Same convergence rates as unconstrained gradient descent when projections are cheap.
- [Proper Scoring Rules](https://theorempath.com/topics/proper-scoring-rules) (tier 2, advanced): A scoring rule is proper if the expected score is maximized when the forecaster reports their true belief. Log score and Brier score are strictly proper. Accuracy is not. Why this matters for calibrated probability estimates.
- [Public-Key Cryptography](https://theorempath.com/topics/public-key-cryptography) (tier 2, core): RSA, Diffie-Hellman, discrete logarithm, trapdoor one-way functions, and the number-theoretic assumptions underlying modern encryption. The mathematical bridge between algebra and practical security.
- [Q-Learning](https://theorempath.com/topics/q-learning) (tier 2, core): Model-free, off-policy value learning: the Q-learning update rule, convergence under Robbins-Monro conditions, and the deep Q-network revolution that introduced function approximation, experience replay, and the deadly triad.
- [Rao-Blackwellization](https://theorempath.com/topics/rao-blackwellization) (tier 2, core): The Rao-Blackwell theorem: conditioning an estimator on a sufficient statistic reduces variance without increasing bias. In MCMC, this means replacing sample averages with conditional expectations for lower-variance estimates at no extra sampling cost.
- [Recommender Systems](https://theorempath.com/topics/recommender-systems) (tier 2, core): User-item interaction modeling via matrix factorization, collaborative filtering, and content-based methods: the math of SVD-based recommendations, cold start, implicit feedback, and why evaluation is harder than the model.
- [Regularization Theory](https://theorempath.com/topics/regularization-theory) (tier 2, core): Why unconstrained ERM overfits and how regularization controls complexity: Tikhonov (L2), sparsity (L1), elastic net, early stopping, dropout, the Bayesian prior connection, and the link to algorithmic stability.
- [Reproducibility and Experimental Rigor](https://theorempath.com/topics/reproducibility-and-experimental-rigor) (tier 2, core): What it takes to make ML experiments truly reproducible: seeds, variance reporting, data hygiene, configuration management, and the discipline of multi-run evaluation.
- [Sample Size Determination](https://theorempath.com/topics/sample-size-determination) (tier 2, core): How to compute the number of observations needed to estimate means, proportions, and treatment effects with specified precision and power, including corrections for finite populations and complex designs.
- [SAT, SMT, and Automated Reasoning](https://theorempath.com/topics/sat-smt-and-automated-reasoning) (tier 2, advanced): SAT solvers decide Boolean satisfiability (NP-complete). SMT solvers extend SAT with theories like arithmetic and arrays. These tools verify constraints, discharge proof obligations, and complement LLMs in AI agent pipelines.
- [Signal Detection Theory](https://theorempath.com/topics/signal-detection-theory) (tier 2, core): The mathematical framework for binary decisions under noise. ROC curves, d-prime, likelihood ratios, the Neyman-Pearson lemma connection, and why SDT is the foundation of both psychophysics and ML classification evaluation.
- [Spectral Clustering](https://theorempath.com/topics/spectral-clustering) (tier 2, core): Clustering via the eigenvectors of a graph Laplacian: embed data using the bottom eigenvectors, then run k-means in the embedding space. Finds non-convex clusters that k-means alone cannot.
- [Stability and Optimization Dynamics](https://theorempath.com/topics/stability-and-optimization-dynamics) (tier 2, advanced): Convergence of gradient descent for smooth and strongly convex objectives, the descent lemma, gradient flow as a continuous-time limit, Lyapunov stability analysis, and the edge of stability phenomenon.
- [Statistical Significance and Multiple Comparisons](https://theorempath.com/topics/statistical-significance-and-multiple-comparisons) (tier 2, core): p-values, significance levels, confidence intervals, and the multiple comparisons problem. Bonferroni correction, Benjamini-Hochberg FDR control, and why these matter for model selection and benchmark evaluation.
- [Stochastic Approximation Theory](https://theorempath.com/topics/stochastic-approximation-theory) (tier 2, advanced): The Robbins-Monro framework, ODE method, and Polyak-Ruppert averaging: the unified theory behind why SGD, Q-learning, and TD-learning converge.
- [Survey Sampling Methods](https://theorempath.com/topics/survey-sampling-methods) (tier 2, core): The major probability sampling designs used in survey statistics: simple random, stratified, cluster, systematic, multi-stage, and multi-phase sampling, with their variance properties and estimators.
- [t-SNE and UMAP](https://theorempath.com/topics/tsne-and-umap) (tier 2, core): Two dominant nonlinear dimensionality reduction methods: t-SNE preserves local neighborhoods via KL divergence with a Student-t kernel, UMAP uses fuzzy simplicial sets and cross-entropy. Both excel at visualization but have important limitations.
- [Temporal Difference Learning](https://theorempath.com/topics/td-learning) (tier 2, core): Temporal difference methods bootstrap value estimates from other value estimates, enabling online, incremental learning without waiting for episode termination. TD(0), SARSA, and TD(lambda) with eligibility traces.
- [Time Series Forecasting Basics](https://theorempath.com/topics/time-series-forecasting-basics) (tier 2, core): Stationarity, autocorrelation, AR, MA, ARIMA, exponential smoothing, and why classical methods still beat deep learning on many forecasting benchmarks.
- [Trust Region Methods](https://theorempath.com/topics/trust-region-methods) (tier 2, advanced): Build a local quadratic model of the objective, minimize it within a trusted ball, and adjust the ball size based on prediction quality. Trust region methods provide convergence guarantees even for non-convex problems and inspired TRPO in reinforcement learning.
- [Variance Reduction Techniques](https://theorempath.com/topics/variance-reduction-techniques) (tier 2, core): Get the same accuracy with fewer samples by exploiting correlation, known quantities, and stratification. Antithetic variates, control variates, stratification, and Rao-Blackwellization.
- [Von Neumann Minimax Theorem](https://theorempath.com/topics/minimax-theorem) (tier 2, core): In finite two-person zero-sum games, max-min equals min-max. The fundamental theorem of game theory, its proof via linear programming duality, and connections to adversarial robustness, GANs, and online learning.
- [Whitening and Decorrelation](https://theorempath.com/topics/whitening-and-decorrelation) (tier 2, core): Transform data to have identity covariance, removing correlations and normalizing scales. ZCA and PCA whitening, why whitening helps optimization, and connections to batch normalization.
- [Word Embeddings](https://theorempath.com/topics/word-embeddings) (tier 2, core): Dense vector representations of words: Word2Vec (skip-gram, CBOW), negative sampling, GloVe, the distributional hypothesis, and why embeddings transformed NLP from sparse features to learned representations.
- [XGBoost](https://theorempath.com/topics/xgboost) (tier 2, core): XGBoost as second-order gradient boosting: Taylor expansion of the loss, regularized objective, optimal leaf weights, split gain formula, and the system optimizations that made it dominant on tabular data.
- [Adaptive Rejection Sampling](https://theorempath.com/topics/adaptive-rejection-sampling) (tier 3, advanced): For log-concave densities, build a piecewise linear upper hull of log f(x) that tightens automatically with each evaluation. Sample from the envelope and accept/reject.
- [Boltzmann Machines and Hopfield Networks](https://theorempath.com/topics/boltzmann-machines-and-hopfield-networks) (tier 3, core): Energy-based models for associative memory and generative learning: Hopfield networks store patterns via energy minimization, Boltzmann machines add stochasticity and hidden units, and RBMs enable tractable learning through contrastive divergence.
- [Cubist and Model Trees](https://theorempath.com/topics/cubist-and-model-trees) (tier 3, core): M5 model trees put linear regression at each leaf of a decision tree. Cubist extends this with rule simplification and smoothing. A useful hybrid of interpretability and prediction power.
- [Curriculum Learning](https://theorempath.com/topics/curriculum-learning) (tier 3, core): Train on easy examples first, gradually increase difficulty. Curriculum learning can speed convergence and improve generalization, but defining difficulty is the hard part. Self-paced learning, anti-curriculum, and the connection to importance sampling.
- [Experiment Tracking and Tooling](https://theorempath.com/topics/experiment-tracking-and-tooling) (tier 3, core): MLflow, Weights and Biases, TensorBoard, Hydra, and the discipline of recording everything. What to log, how to compare experiments, and why you cannot reproduce what you did not record.
- [Griddy Gibbs Sampling](https://theorempath.com/topics/griddy-gibbs) (tier 3, core): When Gibbs sampling encounters a non-conjugate full conditional, approximate it on a grid and sample from the piecewise constant or piecewise linear approximation. Simple, effective for smooth univariate conditionals.
- [Logspline Density Estimation](https://theorempath.com/topics/logsplines) (tier 3, advanced): Model the log-density as a spline, then normalize to get a smooth, positive density estimate. Connection to exponential families, knot selection by BIC, and flexible nonparametric density estimation.
- [MARS (Multivariate Adaptive Regression Splines)](https://theorempath.com/topics/mars-multivariate-adaptive-regression-splines) (tier 3, core): MARS: automatically discover nonlinear relationships using piecewise linear hinge functions, forward-backward selection, and a direct connection to ReLU networks.
- [Model Theory Basics](https://theorempath.com/topics/model-theory-basics) (tier 3, advanced): Model theory separates syntax (formulas, proofs) from semantics (structures, truth). Soundness: provable implies true. Completeness: true in all models implies provable. Compactness and Lowenheim-Skolem reveal that first-order logic cannot pin down infinite structures uniquely.
- [NMF (Nonnegative Matrix Factorization)](https://theorempath.com/topics/nmf-nonnegative-matrix-factorization) (tier 3, core): Factor V into W*H with all entries nonnegative: parts-based additive representation, multiplicative update rules, and applications to topic modeling and image decomposition.
- [Proof Theory and Cut-Elimination](https://theorempath.com/topics/proof-theory-and-cut-elimination) (tier 3, advanced): Proof theory studies the structure of proofs as mathematical objects. The cut-elimination theorem (Gentzen's Hauptsatz) shows that every proof using lemmas can be transformed into a direct proof. This connects to normalization in type theory and tactic design in proof assistants.
- [Self-Organizing Maps](https://theorempath.com/topics/self-organizing-maps) (tier 3, core): Kohonen networks: competitive learning on a grid that produces topology-preserving mappings from high-dimensional input to low-dimensional discrete maps.
- [Slice Sampling](https://theorempath.com/topics/slice-sampling) (tier 3, core): Slice sampling draws from a target distribution by uniformly sampling from the region under its density curve. It introduces an auxiliary variable to avoid tuning proposal distributions, unlike random-walk Metropolis-Hastings.
- [Squeezed Rejection Sampling](https://theorempath.com/topics/squeezed-rejection-sampling) (tier 3, core): An optimization of rejection sampling that adds a cheap lower bound (squeeze function) to avoid expensive target density evaluations when the sample clearly falls in the accept or reject region.
- [Statistical Paradoxes Collection](https://theorempath.com/topics/statistical-paradoxes-collection) (tier 3, advanced): A curated collection of statistical paradoxes that trip up practitioners: Lindley's paradox, Lord's paradox, Freedman's paradox, Hand's paradox, and the low birth weight paradox. Each with statement, mechanism, and lesson.
- [Tabu Search](https://theorempath.com/topics/tabu-search) (tier 3, core): Local search with memory: maintain a list of recently visited solutions to prevent cycling, use aspiration criteria to override the tabu when a move leads to a new best, and balance intensification against diversification.
- [Wavelet Smoothing](https://theorempath.com/topics/wavelet-smoothing) (tier 3, advanced): Wavelet transforms decompose signals into localized frequency components. Thresholding wavelet coefficients denoises signals while adapting to local smoothness, achieving minimax-optimal rates over Besov spaces.

## Layer 3 — Learning theory, ML methods

- [Algorithmic Stability](https://theorempath.com/topics/algorithmic-stability) (tier 1, advanced): Algorithmic stability provides generalization bounds by analyzing how much a learning algorithm's output changes when a single training example is replaced: a structurally different lens from complexity-based approaches.
- [Causal Inference and the Ladder of Causation](https://theorempath.com/topics/causal-inference-pearl) (tier 1, advanced): Pearl's three-level hierarchy: association, intervention, counterfactual. Structural causal models, do-calculus, the adjustment formula, and why prediction is not causation.
- [Epsilon-Nets and Covering Numbers](https://theorempath.com/topics/epsilon-nets-and-covering-numbers) (tier 1, core): Discretizing infinite sets for concentration arguments: epsilon-nets, covering numbers, packing numbers, the Dudley integral, and the connection to Rademacher complexity.
- [Fine-Tuning and Adaptation](https://theorempath.com/topics/fine-tuning-and-adaptation) (tier 1, core): Methods for adapting pretrained models to new tasks: full fine-tuning, feature extraction, LoRA, QLoRA, adapter layers, and prompt tuning. When to use each, and how catastrophic forgetting constrains adaptation.
- [Matrix Concentration](https://theorempath.com/topics/matrix-concentration) (tier 1, advanced): Matrix Bernstein, Matrix Hoeffding, Weyl's inequality, and Davis-Kahan: the operator-norm concentration tools needed for covariance estimation, dimensionality reduction, and spectral analysis in high dimensions.
- [McDiarmid's Inequality](https://theorempath.com/topics/mcdiarmids-inequality) (tier 1, advanced): The bounded-differences inequality: if changing one input to a function changes the output by at most c_i, the function concentrates around its mean with sub-Gaussian tails.
- [Optimizer Theory: SGD, Adam, and Muon](https://theorempath.com/topics/optimizer-theory-sgd-adam-muon) (tier 1, advanced): Convergence theory of SGD (convex and strongly convex), momentum methods (Polyak and Nesterov), Adam as adaptive + momentum, why SGD can generalize better, the Muon optimizer, and learning rate schedules.
- [Policy Gradient Theorem](https://theorempath.com/topics/policy-gradient-theorem) (tier 1, advanced): The fundamental result enabling gradient-based optimization of parameterized policies: the policy gradient theorem and the algorithms it spawns.
- [Rademacher Complexity](https://theorempath.com/topics/rademacher-complexity) (tier 1, advanced): A data-dependent measure of hypothesis class complexity that gives tighter generalization bounds than VC dimension by measuring how well the class can fit random noise on the actual data.
- [Reward Design and Reward Misspecification](https://theorempath.com/topics/reward-design) (tier 1, advanced): The hardest problem in RL: specifying what you want. Reward shaping, potential-based shaping theorem, specification gaming, Goodhart's law in RL, and the bridge from classic RL to alignment.
- [Symmetrization Inequality](https://theorempath.com/topics/symmetrization-inequality) (tier 1, advanced): The symmetrization technique: the proof template that connects the generalization gap to Rademacher complexity by introducing a ghost sample and random signs.
- [The Bitter Lesson](https://theorempath.com/topics/bitter-lesson) (tier 1, core): Sutton's meta-principle: scalable general methods that exploit computation tend to beat hand-crafted domain-specific approaches in the long run. Search and learning win; brittle cleverness loses.
- [Variational Autoencoders](https://theorempath.com/topics/variational-autoencoders) (tier 1, advanced): Deriving the ELBO, the reparameterization trick for backpropagation through sampling, and how VAEs turn autoencoders into principled generative models via amortized variational inference.
- [Ablation Study Design](https://theorempath.com/topics/ablation-study-design) (tier 2, core): How to properly design ablation studies: remove one component at a time, measure the effect against proper baselines, and report results with statistical rigor.
- [Activation Checkpointing](https://theorempath.com/topics/activation-checkpointing) (tier 2, core): Trade compute for memory by recomputing activations during the backward pass instead of storing them all. Reduces memory from O(L) to O(sqrt(L)) for L layers.
- [Actor-Critic Methods](https://theorempath.com/topics/actor-critic-methods) (tier 2, advanced): The dominant paradigm for deep RL and LLM training: an actor (policy network) guided by a critic (value network), with advantage estimation, PPO clipping, and entropy regularization.
- [AlexNet and Deep Learning History](https://theorempath.com/topics/alexnet-and-deep-learning-history) (tier 2, core): AlexNet (2012) proved deep CNNs work at scale on real vision tasks, reigniting deep learning. Key innovations: GPU training, ReLU, dropout, data augmentation. The path from AlexNet through VGGNet, GoogLeNet, ResNet, to vision transformers.
- [Attention Mechanisms History](https://theorempath.com/topics/attention-mechanisms-history) (tier 2, core): The evolution of attention from Bahdanau (2014) additive alignment to Luong dot-product attention to self-attention in transformers. How attention solved the fixed-length bottleneck of seq2seq models.
- [Auction Theory](https://theorempath.com/topics/auction-theory) (tier 2, advanced): First-price, second-price, English, Dutch. Revenue equivalence, optimal reserve prices, Myerson's theorem, and connections to ad auctions, spectrum allocation, and ML compute markets.
- [Bits, Nats, Perplexity, and BPB](https://theorempath.com/topics/bits-nats-perplexity-bpb) (tier 2, core): The four units people use to measure language model quality, how they relate to each other, when to use each one, and how mixing them up leads to wrong conclusions.
- [Calibration and Uncertainty Quantification](https://theorempath.com/topics/calibration-and-uncertainty) (tier 2, advanced): When a model says 70% confidence, is it right 70% of the time? Calibration measures the alignment between predicted probabilities and actual outcomes. Temperature scaling, Platt scaling, conformal prediction, and MC dropout provide practical tools for trustworthy uncertainty.
- [CAP Theorem](https://theorempath.com/topics/cap-theorem) (tier 2, core): In a distributed system, you cannot simultaneously guarantee consistency, availability, and partition tolerance. Brewer's conjecture, the Gilbert-Lynch proof, PACELC, and practical implications for ML infrastructure.
- [Commons Governance and Institutional Analysis](https://theorempath.com/topics/commons-governance-ostrom) (tier 2, core): Ostrom's framework for managing shared resources without privatization or central control. Design principles for durable institutions, the IAD framework, and applications to open-source, benchmarks, and data commons in ML.
- [Continual Learning and Forgetting](https://theorempath.com/topics/continual-learning-and-forgetting) (tier 2, advanced): Learning sequentially without destroying previous knowledge: Elastic Weight Consolidation, progressive networks, replay methods, and the stability-plasticity tradeoff in deployed systems.
- [Contraction Inequality](https://theorempath.com/topics/contraction-inequality) (tier 2, advanced): The Ledoux-Talagrand contraction principle: applying an L-Lipschitz function with phi(0)=0 to a function class can only contract Rademacher complexity, letting you bound the complexity of the loss class from the hypothesis class.
- [Contrastive Learning](https://theorempath.com/topics/contrastive-learning) (tier 2, advanced): Learning representations by pulling positive pairs together and pushing negative pairs apart, with theoretical grounding in mutual information maximization.
- [Convolutional Neural Networks](https://theorempath.com/topics/convolutional-neural-networks) (tier 2, advanced): How weight sharing and local connectivity exploit spatial structure: convolution as cross-correlation, translation equivariance, pooling for approximate invariance, and the conv-pool-fc architecture.
- [Decoding Strategies](https://theorempath.com/topics/decoding-strategies) (tier 2, core): How language models select output tokens: greedy decoding, beam search, temperature scaling, top-k sampling, and nucleus (top-p) sampling. The tradeoffs between coherence, diversity, and quality.
- [Differential Privacy](https://theorempath.com/topics/differential-privacy) (tier 2, advanced): Formal privacy guarantees for algorithms: epsilon-delta DP, Laplace and Gaussian mechanisms, composition theorems, DP-SGD for training neural networks, and the privacy-utility tradeoff.
- [Distributed Consensus](https://theorempath.com/topics/distributed-consensus) (tier 2, advanced): How do distributed nodes agree on a value? The FLP impossibility result, Paxos, Raft, Byzantine fault tolerance, and why consensus underpins replicated state machines from databases to distributed ML training.
- [EM Algorithm Variants](https://theorempath.com/topics/expectation-maximization-variants) (tier 2, advanced): Variants of EM for when the standard algorithm is intractable: Monte Carlo EM, Variational EM, Stochastic EM, and ECM. Connection to VAEs as amortized variational EM.
- [Empirical Processes and Chaining](https://theorempath.com/topics/empirical-processes-and-chaining) (tier 2, advanced): Bounding the supremum of empirical processes via covering numbers and chaining: Dudley's entropy integral and Talagrand's generic chaining, the sharpest tools in classical learning theory.
- [Ethics and Fairness in ML](https://theorempath.com/topics/ethics-and-fairness-in-ml) (tier 2, advanced): Fairness definitions (demographic parity, equalized odds, calibration), the impossibility theorem showing they cannot all hold simultaneously, bias sources, and mitigation strategies at each stage of the pipeline.
- [Extreme Value Theory](https://theorempath.com/topics/extreme-value-theory) (tier 2, advanced): The mathematics of maxima and rare events. The Fisher-Tippett-Gnedenko theorem, the three extreme value distributions (Gumbel, Frechet, Weibull), peaks-over-threshold, and applications to tail risk and model evaluation.
- [Fano Inequality](https://theorempath.com/topics/fanos-inequality) (tier 2, advanced): Fano inequality as the standard tool for information-theoretic lower bounds: if X -> Y -> X_hat, then error probability is bounded below by conditional entropy and alphabet size.
- [Federated Learning](https://theorempath.com/topics/federated-learning) (tier 2, advanced): Train a global model without centralizing data. FedAvg, communication efficiency, non-IID convergence challenges, differential privacy integration, and applications in healthcare and mobile computing.
- [Formal Verification and Proof Assistants](https://theorempath.com/topics/formal-verification) (tier 2, advanced): Some behaviors should be proven, not merely hoped for. Model checking, theorem proving, Lean/Coq, verified compilers, and why verification matters for infrastructure, protocols, and safety-critical AI.
- [Gaussian Process Regression](https://theorempath.com/topics/gaussian-processes-regression) (tier 2, advanced): Inference with Gaussian processes: the prior-to-posterior update in closed form, the role of kernel choice, marginal likelihood for hyperparameter selection, sparse approximations for scalability, and the connection to Bayesian optimization.
- [Generative Adversarial Networks](https://theorempath.com/topics/generative-adversarial-networks) (tier 2, advanced): The minimax game between generator and discriminator: Nash equilibrium at the data distribution, mode collapse, the Wasserstein distance fix, StyleGAN, and why diffusion models have largely replaced GANs for image generation.
- [Graph Neural Networks](https://theorempath.com/topics/graph-neural-networks) (tier 2, advanced): Message passing on graphs: GCN, GAT, GraphSAGE, the WL isomorphism test as an expressivity ceiling, over-smoothing in deep GNNs, and applications to molecules, social networks, and knowledge graphs.
- [GraphSLAM and Factor Graphs](https://theorempath.com/topics/graphslam-and-factor-graphs) (tier 2, advanced): SLAM as graph optimization: poses as nodes, constraints as edges, factor graph representation, MAP estimation via nonlinear least squares, and the sparsity structure that makes large-scale mapping tractable.
- [Hamiltonian Monte Carlo](https://theorempath.com/topics/hamiltonian-monte-carlo) (tier 2, advanced): HMC uses gradient information and Hamiltonian dynamics to propose large, distant moves in parameter space that are still accepted with high probability: dramatically outperforming random-walk Metropolis-Hastings in high dimensions.
- [Hanson-Wright Inequality](https://theorempath.com/topics/hanson-wright-inequality) (tier 2, advanced): Concentration of quadratic forms X^T A X for sub-Gaussian random vectors: the two-term bound involving the Frobenius norm (Gaussian regime) and operator norm (extreme regime).
- [High-Dimensional Covariance Estimation](https://theorempath.com/topics/high-dimensional-covariance-estimation) (tier 2, advanced): When dimension d is comparable to sample size n, the sample covariance matrix fails. Shrinkage estimators (Ledoit-Wolf), banding and tapering for structured covariance, and Graphical Lasso for sparse precision matrices.
- [Ito's Lemma](https://theorempath.com/topics/ito-lemma) (tier 2, advanced): The chain rule of stochastic calculus: if X_t follows an SDE, then f(X_t) follows a modified SDE with an extra second-order correction term that has no analogue in ordinary calculus.
- [Kernel Two-Sample Tests](https://theorempath.com/topics/kernel-two-sample-tests) (tier 2, advanced): Maximum Mean Discrepancy (MMD): a kernel-based nonparametric test for whether two samples come from the same distribution, with unbiased estimation, permutation testing, and applications to GAN evaluation.
- [Kernels and Reproducing Kernel Hilbert Spaces](https://theorempath.com/topics/kernels-and-rkhs) (tier 2, advanced): Kernel functions, Mercer's theorem, the RKHS reproducing property, and the representer theorem: the mathematical framework that enables learning in infinite-dimensional function spaces via finite-dimensional computations.
- [Knowledge Distillation](https://theorempath.com/topics/knowledge-distillation) (tier 2, advanced): Training a small student model to mimic a large teacher: soft targets, temperature scaling, dark knowledge, and why the teacher's mistakes carry useful information about class structure.
- [Leverage Points in Complex Systems](https://theorempath.com/topics/leverage-points-systems) (tier 2, core): Meadows's hierarchy of intervention points: parameters, buffers, feedback loops, information flows, rules, goals, paradigms. Why shallow tweaks fail and deep levers are counterintuitive.
- [Markov Games and Self-Play](https://theorempath.com/topics/markov-games-and-self-play) (tier 2, advanced): Multi-agent extensions of MDPs where multiple agents with separate rewards interact. Nash equilibria, minimax values in zero-sum games, and self-play as a training method.
- [Measure Concentration and Geometric Functional Analysis](https://theorempath.com/topics/measure-concentration-and-geometric-fa) (tier 2, advanced): High-dimensional geometry is counterintuitive: Lipschitz functions concentrate, random projections preserve distances, and most of a sphere's measure sits near the equator. Johnson-Lindenstrauss, Gaussian concentration, and Levy's lemma.
- [Mechanism Design](https://theorempath.com/topics/mechanism-design) (tier 2, advanced): Inverse game theory: designing rules so that self-interested agents produce desired outcomes. The revelation principle, VCG mechanisms, Myerson's optimal auction, and applications to ML marketplaces and data economics.
- [Meta-Learning](https://theorempath.com/topics/meta-learning) (tier 2, advanced): Learning to learn: find model initializations or embedding spaces that enable fast adaptation to new tasks from few examples. MAML, prototypical networks, and the connection to few-shot learning and in-context learning in LLMs.
- [Minimax Lower Bounds](https://theorempath.com/topics/minimax-lower-bounds) (tier 2, advanced): Why upper bounds are not enough: minimax risk, Le Cam two-point method, Fano inequality, and Assouad lemma for proving that no estimator can beat a given rate.
- [Mirror Descent and Frank-Wolfe](https://theorempath.com/topics/mirror-descent-and-frank-wolfe) (tier 2, advanced): Mirror descent generalizes gradient descent via Bregman divergences, recovering multiplicative weights and exponentiated gradient as special cases. Frank-Wolfe replaces projections with linear minimization, making it projection-free.
- [Mixed Precision Training](https://theorempath.com/topics/mixed-precision-training) (tier 2, core): Train with FP16 or BF16 for speed while keeping FP32 master weights for accuracy. Loss scaling, overflow prevention, and when mixed precision fails.
- [Model Compression and Pruning](https://theorempath.com/topics/model-compression-and-pruning) (tier 2, core): Reducing model size without proportional accuracy loss: unstructured and structured pruning, magnitude pruning, the lottery ticket hypothesis, entropy coding for compressed weights, and knowledge distillation as compression.
- [Model-Based Reinforcement Learning](https://theorempath.com/topics/model-based-rl) (tier 2, advanced): Learning a model of the environment and planning with it. Dyna architecture, learned world models, planning as simulated experience, sample efficiency, and the model-error problem.
- [No-Regret Learning](https://theorempath.com/topics/no-regret-learning) (tier 2, advanced): Online learning against adversarial losses: regret as cumulative loss minus the best fixed action in hindsight, multiplicative weights, follow the regularized leader, and why no-regret dynamics converge to Nash equilibria in zero-sum games.
- [Object Detection and Segmentation](https://theorempath.com/topics/object-detection-and-segmentation) (tier 2, advanced): Localizing and classifying objects in images: two-stage (R-CNN), one-stage (YOLO, SSD), anchor-free (CenterNet, FCOS) detectors, semantic and instance segmentation, and the IoU/mAP evaluation framework.
- [Offline Reinforcement Learning](https://theorempath.com/topics/offline-reinforcement-learning) (tier 2, advanced): Learning policies from a fixed dataset without environment interaction: distributional shift as the core challenge, conservative Q-learning (CQL) as the standard fix, and Decision Transformer as an alternative sequence modeling approach.
- [Online Convex Optimization](https://theorempath.com/topics/online-convex-optimization) (tier 2, advanced): A general framework for sequential decision-making with convex losses: online gradient descent, follow the regularized leader, adaptive methods, and the O(sqrt(T)) regret guarantee that unifies many algorithms.
- [Online Learning and Bandits](https://theorempath.com/topics/online-learning-and-bandits) (tier 2, advanced): Sequential decision making with adversarial or stochastic feedback: the bandit setting, explore-exploit tradeoff, UCB, Thompson sampling, and regret bounds. Connections to RL and A/B testing.
- [Optimal Brain Surgery and Pruning Theory](https://theorempath.com/topics/optimal-brain-surgery-and-pruning-theory) (tier 2, advanced): Principled weight pruning via second-order information: Optimal Brain Damage uses the Hessian diagonal, Optimal Brain Surgeon uses the full inverse Hessian, and both reveal why magnitude pruning is a crude but popular approximation.
- [Optimal Transport and Earth Mover's Distance](https://theorempath.com/topics/optimal-transport-and-earth-movers-distance) (tier 2, advanced): The Monge and Kantorovich formulations of optimal transport, the linear programming dual, Sinkhorn regularization, and applications to WGANs, domain adaptation, and fairness.
- [Out-of-Distribution Detection](https://theorempath.com/topics/out-of-distribution-detection) (tier 2, advanced): Methods for detecting when test inputs differ from training data, where naive softmax confidence fails and principled alternatives based on energy, Mahalanobis distance, and typicality succeed.
- [PAC-Bayes Bounds](https://theorempath.com/topics/pac-bayes-bounds) (tier 2, advanced): Generalization bounds that depend on the KL divergence between a learned posterior and a prior over hypotheses. PAC-Bayes gives non-vacuous bounds for overparameterized networks where VC and Rademacher bounds fail.
- [Perplexity and Language Model Evaluation](https://theorempath.com/topics/perplexity-and-language-model-evaluation) (tier 2, core): Perplexity as exp(cross-entropy): the standard intrinsic metric for language models, its information-theoretic interpretation, connection to bits-per-byte, and why low perplexity alone does not guarantee useful generation.
- [Policy Optimization: PPO and TRPO](https://theorempath.com/topics/policy-optimization-ppo-trpo) (tier 2, advanced): Trust region and clipped surrogate methods for stable policy optimization: TRPO enforces a KL constraint for monotonic improvement, PPO replaces it with a simpler clipped objective, and PPO became the default optimizer for LLM post-training.
- [Policy Representations](https://theorempath.com/topics/policy-representations) (tier 2, core): How to parameterize policies in reinforcement learning: categorical for discrete actions, Gaussian for continuous actions, and why the choice affects gradient variance and exploration.
- [Preconditioned Optimizers: Shampoo, K-FAC, and Natural Gradient](https://theorempath.com/topics/preconditioned-optimizers) (tier 2, advanced): Optimizers that use curvature information to precondition gradients: the natural gradient via Fisher information, K-FAC's Kronecker approximation, and Shampoo's full-matrix preconditioning. How they connect to Riemannian optimization and why they outperform Adam on certain architectures.
- [Prospect Theory](https://theorempath.com/topics/prospect-theory) (tier 2, core): How people actually make decisions under risk. Kahneman and Tversky's model: reference dependence, loss aversion, probability weighting, and why expected utility fails as a descriptive theory.
- [Recurrent Neural Networks](https://theorempath.com/topics/recurrent-neural-networks) (tier 2, advanced): Sequential processing via hidden state recurrence: the simple RNN, vanishing and exploding gradients, LSTM gating mechanisms, and why transformers have largely replaced RNNs.
- [Representation Learning Theory](https://theorempath.com/topics/representation-learning-theory) (tier 2, advanced): What makes a good learned representation: the information bottleneck, contrastive learning, sufficient statistics, rate-distortion theory, and why representation learning is the central unsolved problem of deep learning.
- [Restricted Isometry Property](https://theorempath.com/topics/restricted-isometry-property) (tier 2, advanced): The restricted isometry property (RIP): when a measurement matrix approximately preserves norms of sparse vectors, enabling exact sparse recovery via L1 minimization. Random Gaussian matrices satisfy RIP with O(s log(n/s)) rows.
- [Riemannian Optimization and Manifold Constraints](https://theorempath.com/topics/riemannian-optimization) (tier 2, advanced): Optimization on curved spaces: the Stiefel manifold for orthogonal matrices, symmetric positive definite matrices, Riemannian gradient descent, retractions, and why flat-space intuitions break on manifolds. The geometric backbone of Shampoo, Muon, and constrained neural network training.
- [Robust Statistics and M-Estimators](https://theorempath.com/topics/robust-statistics-and-m-estimators) (tier 2, advanced): When data has outliers or model assumptions are wrong, classical estimators break. M-estimators generalize MLE to handle contamination gracefully.
- [Second-Order Optimization Methods](https://theorempath.com/topics/second-order-optimization-methods) (tier 2, advanced): Newton's method, Gauss-Newton, natural gradient, and K-FAC: how curvature information accelerates convergence, why the Hessian is too expensive to compute at scale, and Hessian-free alternatives that use Hessian-vector products.
- [Self-Play and Multi-Agent RL](https://theorempath.com/topics/self-play-and-multi-agent-rl) (tier 2, advanced): Self-play as a training paradigm for competitive games, fictitious play convergence, AlphaGo/AlphaZero, and the challenges of multi-agent reinforcement learning: non-stationarity, partial observability, and centralized training.
- [Semantic Search and Embeddings](https://theorempath.com/topics/semantic-search-and-embeddings) (tier 2, core): Dense vector representations for semantic similarity: bi-encoders, cross-encoders, approximate nearest neighbor search, cosine similarity geometry, and the RAG retrieval pipeline.
- [Speech and Audio ML](https://theorempath.com/topics/speech-and-audio-ml) (tier 2, advanced): Machine learning for audio: mel spectrograms as 2D representations, CTC loss for sequence alignment, Whisper for speech recognition, text-to-speech synthesis, and why continuous audio signals are harder than discrete text.
- [Survival Analysis](https://theorempath.com/topics/survival-analysis) (tier 2, advanced): Modeling time-to-event data with censoring: Kaplan-Meier curves, hazard functions, and the Cox proportional hazards model.
- [Synthetic Data Generation](https://theorempath.com/topics/synthetic-data-generation) (tier 2, advanced): Using models to generate training data: LLM-generated instructions, diffusion-based image augmentation, code synthesis. When synthetic data helps (low-resource, privacy) and when it hurts (model collapse).
- [Token Prediction and Language Modeling](https://theorempath.com/topics/token-prediction-and-language-modeling) (tier 2, core): Language modeling as probability assignment over sequences. Autoregressive and masked prediction objectives, perplexity evaluation, and the connection between prediction and compression.
- [Transfer Learning](https://theorempath.com/topics/transfer-learning) (tier 2, advanced): Pretrain on a large dataset, fine-tune on a smaller target: why lower layers learn transferable features, feature extraction vs fine-tuning, domain adaptation, negative transfer, and the foundation model paradigm.
- [Unsolved Problems in Computer Science](https://theorempath.com/topics/unsolved-problems-in-computer-science) (tier 2, advanced): P vs NP, the matrix multiplication exponent, one-way functions, the unique games conjecture, BPP vs P, natural proofs, and circuit lower bounds. The problems that define the limits of computation.
- [Zero-Knowledge Proofs](https://theorempath.com/topics/zero-knowledge-proofs) (tier 2, advanced): Prove you know a secret without revealing it. Interactive proofs, the simulation paradigm, ZK completeness, SNARKs, and connections to complexity theory and verifiable computation.
- [Anthropic Bias and Observation Selection](https://theorempath.com/topics/anthropic-bias-and-observation-selection) (tier 3, advanced): Observation selection effects constrain what you can observe by the fact that you exist to observe it. The self-sampling assumption, the Doomsday argument, and connections to selection bias in statistics.
- [Bayesian Neural Networks](https://theorempath.com/topics/bayesian-neural-networks) (tier 3, advanced): Place a prior over neural network weights and compute the posterior given data. Exact inference is intractable, so we approximate: variational inference, MC dropout, Laplace approximation, SWAG. Principled uncertainty, high cost, limited scaling evidence.
- [Benchmarking Methodology](https://theorempath.com/topics/benchmarking-methodology) (tier 3, core): What makes a good benchmark, how benchmarks fail (contamination, leaderboard gaming, single-number comparisons), and how to report results honestly with variance, seeds, and proper statistical practice.
- [Causal Inference Basics](https://theorempath.com/topics/causal-inference-basics) (tier 3, advanced): Correlation is not causation. The potential outcomes framework, average treatment effects, confounders, and the methods that let you estimate causal effects from data.
- [Copulas](https://theorempath.com/topics/copulas) (tier 3, advanced): Copulas separate the dependence structure of a multivariate distribution from its marginals. Sklar's theorem guarantees that any joint CDF can be decomposed into marginals and a copula, making dependence modeling modular.
- [Coupling Arguments and Mixing Time](https://theorempath.com/topics/coupling-arguments-and-mixing-time) (tier 3, advanced): Coupling constructs two Markov chains on the same probability space so they eventually meet, bounding total variation distance and mixing time. Spectral gap and coupling inequality are the main tools for proving how fast MCMC converges to stationarity.
- [Dependent Type Theory](https://theorempath.com/topics/dependent-type-theory) (tier 3, advanced): Types that depend on values. Pi types generalize function types; Sigma types generalize pairs. The Curry-Howard correspondence extends to full logic: propositions are types, proofs are programs. This is the foundation of proof assistants like Lean and Coq.
- [Energy-Based Models](https://theorempath.com/topics/energy-based-models) (tier 3, advanced): A unifying framework for generative modeling: assign low energy to likely configurations via E(x), define probability through the Boltzmann distribution, and train without computing the intractable partition function.
- [Information Bottleneck](https://theorempath.com/topics/information-bottleneck) (tier 3, advanced): The information bottleneck principle: compress the input X into a representation T that preserves information about the target Y. The Lagrangian formulation, connection to deep learning, Shwartz-Ziv and Tishby claims, and why the compression story may not hold for ReLU networks.
- [Information Geometry](https://theorempath.com/topics/information-geometry) (tier 3, advanced): Riemannian geometry on the space of probability distributions: the Fisher information metric, natural gradient descent, exponential families as dually flat manifolds, and the connection to mirror descent.
- [Interior Point Methods](https://theorempath.com/topics/interior-point-methods) (tier 3, advanced): Barrier functions transform constrained optimization into unconstrained problems. Newton steps on the barrier objective trace the central path to the constrained optimum with polynomial convergence.
- [Longitudinal Surveys and Panel Data](https://theorempath.com/topics/longitudinal-surveys-and-panel-data) (tier 3, advanced): Analysis of data where the same units are measured repeatedly over time: fixed effects, random effects, difference-in-differences, and the problems of attrition and time-varying confounding.
- [MCMC for Markov Random Fields](https://theorempath.com/topics/mcmc-for-markov-random-fields) (tier 3, advanced): Gibbs sampling on undirected graphical models. The joint distribution factorizes over cliques, and each variable is sampled conditioned on its Markov blanket. Ising model, image denoising, and simulated annealing.
- [Mixture Density Networks](https://theorempath.com/topics/mixture-density-networks) (tier 3, advanced): Neural networks that output the parameters of a mixture model instead of a single point prediction: handling multi-modal conditional distributions, the negative log-likelihood loss, and applications to inverse problems.
- [Nonlinear Gauss-Seidel](https://theorempath.com/topics/nonlinear-gauss-seidel) (tier 3, advanced): Block coordinate descent with Newton-like updates within each block. Converges under contraction conditions for structured optimization problems. The EM algorithm is a special case.
- [Normalizing Flows](https://theorempath.com/topics/normalization-flows) (tier 3, advanced): Generative models that transform a simple base distribution through invertible mappings, enabling exact log-likelihood computation via the change of variables formula.
- [Official Statistics and National Surveys](https://theorempath.com/topics/official-statistics-and-national-surveys) (tier 3, core): How government statistical agencies produce population, economic, and social data through censuses and surveys, with quality frameworks and implications for ML practitioners using these datasets.
- [Open Problems in Matrix Computation](https://theorempath.com/topics/open-problems-in-matrix-computation) (tier 3, advanced): The unsolved questions in numerical linear algebra: the true exponent of matrix multiplication, practical fast algorithms, sparse matrix multiplication, randomized methods, and why these matter for scaling ML.
- [Options and Temporal Abstraction](https://theorempath.com/topics/options-and-temporal-abstraction) (tier 3, advanced): The options framework for hierarchical RL: temporally extended actions with initiation sets, internal policies, and termination conditions. Semi-MDPs and learning options end-to-end.
- [Particle Filters](https://theorempath.com/topics/particle-filters) (tier 3, advanced): Sequential Monte Carlo: represent the posterior over hidden states as a set of weighted particles, propagate through dynamics, reweight by likelihood, and resample to combat degeneracy.
- [Perfect Sampling](https://theorempath.com/topics/perfect-sampling) (tier 3, advanced): Coupling from the past (Propp-Wilson): run Markov chains backward in time until all starting states coalesce. The result is an exact sample from the stationary distribution with zero burn-in bias.
- [Reinforcement Learning Environments and Benchmarks](https://theorempath.com/topics/reinforcement-learning-environments-and-benchmarks) (tier 3, core): The standard RL evaluation stack: Gymnasium API, classic control tasks, Atari, MuJoCo, ProcGen, the sim-to-real gap, and why benchmark performance is a poor predictor of real-world RL capability.
- [Reservoir Computing and Echo State Networks](https://theorempath.com/topics/reservoir-computing-and-echo-state-networks) (tier 3, advanced): Fixed random recurrent networks with trained linear readouts: the echo state property, why random high-dimensional projections carry computational power, extreme learning machines, and connections to state-space models.
- [Reversible Jump MCMC](https://theorempath.com/topics/reversible-jump-mcmc) (tier 3, advanced): MCMC for model selection: propose moves that change the number of parameters, maintain detailed balance across dimensions via Jacobian corrections, and sample over model space and parameter space simultaneously.
- [Small Area Estimation](https://theorempath.com/topics/small-area-estimation) (tier 3, advanced): Methods for producing reliable estimates in domains where direct survey estimates have too few observations for useful precision, using Fay-Herriot and unit-level models that borrow strength across areas.
- [Stochastic Calculus for ML](https://theorempath.com/topics/stochastic-calculus-for-ml) (tier 3, advanced): Brownian motion, Ito integrals, Ito's lemma, and stochastic differential equations: the mathematical machinery behind diffusion models, score-based generative models, and Langevin dynamics.
- [Submodular Optimization](https://theorempath.com/topics/submodular-optimization) (tier 3, advanced): Submodular functions exhibit diminishing returns. The greedy algorithm achieves a (1-1/e) approximation for monotone submodular maximization under cardinality constraints, with applications in feature selection, sensor placement, and data summarization.

## Layer 4 — Deep learning, modern architectures

- [Attention Is All You Need (Paper)](https://theorempath.com/topics/attention-is-all-you-need-paper) (tier 1, advanced): The 2017 paper that introduced the transformer: self-attention replacing recurrence, multi-head attention, positional encoding, and what survived versus what changed in modern LLMs.
- [Hallucination Theory](https://theorempath.com/topics/hallucination-theory) (tier 1, advanced): Why large language models confabulate, the mathematical frameworks for understanding when model outputs are unreliable, and what current theory says about mitigation.
- [Implicit Bias and Modern Generalization](https://theorempath.com/topics/implicit-bias-and-modern-generalization) (tier 1, advanced): Why classical generalization theory breaks for overparameterized models: the random labels experiment, the interpolation threshold, implicit bias of gradient descent, double descent, and the frontier of understanding why deep learning works.
- [The Era of Experience](https://theorempath.com/topics/era-of-experience) (tier 1, advanced): Sutton and Silver's thesis: the next phase of AI moves beyond imitation from human data toward agents that learn predominantly from their own experience. Text is not enough for general intelligence.
- [Adversarial Machine Learning](https://theorempath.com/topics/adversarial-machine-learning) (tier 2, advanced): Attacks across the ML lifecycle: evasion (adversarial examples), poisoning (corrupt training data), model extraction (steal via queries), privacy leakage (membership inference), and LLM jailbreaks. Why defenses are hard.
- [Attention Mechanism Theory](https://theorempath.com/topics/attention-mechanism-theory) (tier 2, advanced): Mathematical formulation of attention: scaled dot-product attention as soft dictionary lookup, why sqrt(d_k) scaling prevents softmax saturation, multi-head attention, and the connection to kernel methods.
- [Attention Sinks and Retrieval Decay](https://theorempath.com/topics/attention-sinks-and-retrieval-decay) (tier 2, advanced): Why transformers disproportionately attend to initial tokens (attention sinks), how StreamingLLM exploits this for infinite-length inference, and how retrieval accuracy degrades with distance and position within the context window.
- [Attention Variants and Efficiency](https://theorempath.com/topics/attention-variants-and-efficiency) (tier 2, advanced): Multi-head, multi-query, grouped-query, linear, and sparse attention: how each variant trades expressivity for efficiency, and when to use which.
- [Benign Overfitting](https://theorempath.com/topics/benign-overfitting) (tier 2, advanced): When interpolation (zero training error) does not hurt generalization: the min-norm interpolator fits noise in harmless directions while preserving signal. Bartlett et al. 2020, effective rank conditions, and why benign overfitting happens in overparameterized but not classical regimes.
- [BERT and the Pretrain-Finetune Paradigm](https://theorempath.com/topics/bert-and-pretrain-finetune-paradigm) (tier 2, core): BERT introduced bidirectional pretraining with masked language modeling. The pretrain-finetune paradigm it established, train once on a large corpus then adapt to many tasks, became the default approach for NLP and beyond.
- [Catastrophic Forgetting](https://theorempath.com/topics/catastrophic-forgetting) (tier 2, advanced): Fine-tuning a neural network on new data destroys knowledge of old data. Understanding the stability-plasticity dilemma and mitigation strategies: EWC, progressive networks, replay: is essential for continual learning and safe LLM fine-tuning.
- [CLIP and OpenCLIP in Practice](https://theorempath.com/topics/clip-and-openclip-in-practice) (tier 2, advanced): CLIP learns a shared embedding space for images and text via contrastive learning on 400M pairs. Practical guide to zero-shot classification, image search, OpenCLIP variants, embedding geometry, and known limitations.
- [Diffusion Models](https://theorempath.com/topics/diffusion-models) (tier 2, advanced): Generative models that learn to reverse a noise-adding process: the math of score matching, denoising, SDEs, and why diffusion dominates image generation.
- [Double Descent](https://theorempath.com/topics/double-descent) (tier 2, advanced): Test error follows a double-descent curve: it decreases, peaks at the interpolation threshold, then decreases again in the overparameterized regime, defying classical bias-variance intuition.
- [Efficient Transformers Survey](https://theorempath.com/topics/efficient-transformers-survey) (tier 2, advanced): Sub-quadratic attention variants (linear attention, Linformer, Performer, Longformer, BigBird) and why FlashAttention, a hardware-aware exact method, made most of them unnecessary in practice.
- [Equilibrium and Implicit-Layer Models](https://theorempath.com/topics/equilibrium-and-implicit-models) (tier 2, advanced): Deep Equilibrium Models (DEQ) replace explicit depth with a fixed-point equation: instead of stacking L layers, solve for the equilibrium state where one more layer would not change the output. This enables infinite-depth networks with constant memory, using implicit differentiation for backprop.
- [Equivariant Deep Learning](https://theorempath.com/topics/equivariant-deep-learning) (tier 2, advanced): Networks that respect symmetry: if the input transforms under a group action, the output transforms predictably. Equivariance generalizes translation equivariance in CNNs to rotations, permutations, and gauge symmetries, reducing sample complexity and improving generalization on structured data.
- [Flow Matching](https://theorempath.com/topics/flow-matching) (tier 2, research): Learn a velocity field that transports noise to data along straight-line paths. Simpler training than diffusion, faster sampling, and cleaner math.
- [Forgetting Transformer (FoX)](https://theorempath.com/topics/fox-forget-gate) (tier 2, advanced): FoX adds a data-dependent forget gate to softmax attention. The gate down-weights unnormalized attention scores between past and present positions, giving the transformer a learned, recency-biased decay. FoX is FlashAttention-compatible, works without positional embeddings, and improves long-context language modeling and length extrapolation.
- [Grokking](https://theorempath.com/topics/grokking) (tier 2, advanced): Models can memorize training data quickly, then generalize much later after continued training. This delayed generalization, called grokking, breaks the assumption that overfitting is a terminal state and connects to weight decay, implicit regularization, and phase transitions in learning.
- [Induction Heads](https://theorempath.com/topics/induction-heads) (tier 2, advanced): Induction heads are attention head circuits that implement pattern completion: given a sequence like [A][B]...[A], they predict [B]. They are a leading candidate mechanism for in-context learning, with strong causal evidence in small attention-only models and correlational evidence in large transformers. They emerge through a phase transition during training.
- [JEPA and Joint Embedding](https://theorempath.com/topics/jepa-and-joint-embedding) (tier 2, advanced): LeCun's Joint Embedding Predictive Architecture: learning by predicting abstract representations rather than pixels, energy-based formulation, I-JEPA and V-JEPA implementations, and the connection to contrastive learning and world models.
- [Lazy vs Feature Learning](https://theorempath.com/topics/lazy-vs-feature-learning) (tier 2, advanced): The fundamental dichotomy in neural network training: lazy regime (NTK, kernel-like, weights barely move) versus rich/feature learning regime (weights move substantially, representations emerge).
- [Mamba and State-Space Models](https://theorempath.com/topics/mamba-and-state-space-models) (tier 2, advanced): Linear-time sequence modeling via structured state spaces: S4, HiPPO initialization, selective state-space models (Mamba), and the architectural fork from transformers.
- [Mean Field Theory](https://theorempath.com/topics/mean-field-theory) (tier 2, advanced): The mean field limit of neural networks: as width goes to infinity under the right scaling, neurons become independent particles whose weight distribution evolves by Wasserstein gradient flow, capturing feature learning that the NTK regime misses.
- [Mechanistic Interpretability](https://theorempath.com/topics/mechanistic-interpretability) (tier 2, research): Understanding what individual neurons and circuits compute inside neural networks: sparse autoencoders, superposition, induction heads, probing, and the limits of interpretability.
- [Mixture of Experts](https://theorempath.com/topics/mixture-of-experts) (tier 2, advanced): Sparse computation via learned routing: replace dense FFN layers with multiple expert networks, activate only a subset per token, and scale capacity without proportional compute.
- [Multi-Agent Collaboration](https://theorempath.com/topics/multi-agent-collaboration) (tier 2, advanced): Multiple LLM agents working together on complex tasks: debate for improving reasoning, division of labor across specialist agents, structured communication protocols, and when multi-agent outperforms single-agent systems.
- [Neural Network Optimization Landscape](https://theorempath.com/topics/neural-network-optimization-landscape) (tier 2, advanced): Loss surface geometry of neural networks: saddle points dominate in high dimensions, mode connectivity, flat vs sharp minima, Sharpness-Aware Minimization, and the edge of stability phenomenon.
- [Neural ODEs and Continuous-Depth Networks](https://theorempath.com/topics/neural-odes) (tier 2, advanced): Treating neural network depth as a continuous variable: the ODE formulation of residual networks, the adjoint method for memory-efficient backpropagation, connections to dynamical systems theory, and practical limitations.
- [Neural Tangent Kernel](https://theorempath.com/topics/neural-tangent-kernel) (tier 2, advanced): In the infinite-width limit, neural networks trained with gradient descent behave like kernel regression with a specific kernel: the Neural Tangent Kernel: connecting deep learning to classical kernel theory.
- [Physics-Informed Neural Networks](https://theorempath.com/topics/physics-informed-neural-networks) (tier 2, advanced): Embedding PDE constraints directly into the neural network loss function via automatic differentiation. When physics-informed learning works, when it fails, and what alternatives exist.
- [Random Matrix Theory Overview](https://theorempath.com/topics/random-matrix-theory-overview) (tier 2, advanced): Why the spectra of random matrices matter for ML: Marchenko-Pastur law, Wigner semicircle, spiked models, and their applications to covariance estimation, PCA, and overparameterization.
- [Residual Stream and Transformer Internals](https://theorempath.com/topics/residual-stream-and-transformer-internals) (tier 2, advanced): The residual stream as the central computational highway in transformers: attention and FFN blocks read from and write to it. Pre-norm vs post-norm, FFN as key-value memory, and the logit lens for inspecting intermediate representations.
- [RLHF and Alignment](https://theorempath.com/topics/rlhf-and-alignment) (tier 2, advanced): The RLHF pipeline for aligning language models with human preferences: reward modeling, PPO fine-tuning, KL penalties, DPO, and why none of it guarantees truthfulness.
- [Scaling Laws](https://theorempath.com/topics/scaling-laws) (tier 2, advanced): Power-law relationships between loss and compute, parameters, and data: Kaplan scaling, Chinchilla-optimal training, emergent abilities, and whether scaling laws are fundamental or empirical.
- [Self-Supervised Vision](https://theorempath.com/topics/self-supervised-vision) (tier 2, advanced): Learning visual representations without labels: contrastive methods (SimCLR, MoCo), self-distillation (DINO/DINOv2), and masked image modeling (MAE). Why self-supervised vision matters for transfer learning and label-scarce domains.
- [Sparse Attention and Long Context](https://theorempath.com/topics/sparse-attention-and-long-context) (tier 2, advanced): Standard attention is O(n^2). Sparse patterns (Longformer, Sparse Transformer, Reformer), ring attention for distributed sequences, streaming with attention sinks, and why extending context is harder than it sounds.
- [Sparse Autoencoders for Interpretability](https://theorempath.com/topics/sparse-autoencoders) (tier 2, advanced): Sparse autoencoders decompose neural network activations into interpretable feature directions by learning an overcomplete dictionary with sparsity constraints. They are the primary tool for extracting monosemantic features from polysemantic neurons.
- [Training Dynamics and Loss Landscapes](https://theorempath.com/topics/training-dynamics-and-loss-landscapes) (tier 2, advanced): The geometry of neural network loss surfaces: why saddle points dominate over local minima in high dimensions, how flat minima relate to generalization, and why SGD finds solutions that generalize.
- [Transformer Architecture](https://theorempath.com/topics/transformer-architecture) (tier 2, advanced): The mathematical formulation of the transformer block: self-attention, multi-head attention, layer normalization, FFN blocks, positional encoding, and parameter counting.
- [Vision Transformer Lineage](https://theorempath.com/topics/vision-transformer-lineage) (tier 2, advanced): The evolution of visual representation learning: from CNNs (AlexNet, ResNet) to ViT (pure attention for images), Swin (hierarchical attention), and DINOv2 (self-supervised ViT with self-distillation), with connections to CLIP.
- [World Models and Planning](https://theorempath.com/topics/world-models-and-planning) (tier 2, advanced): Learning a model of the environment and planning inside it: Dreamer's latent dynamics, MuZero's learned model with MCTS, video world models, and why planning in imagination is the path to sample-efficient and safe AI.
- [3D Gaussian Splatting](https://theorempath.com/topics/gaussian-splatting) (tier 3, advanced): Represent a 3D scene as millions of 3D Gaussians, each with position, covariance, opacity, and color. Render by projecting to 2D and alpha-compositing. Real-time, high-quality novel view synthesis without neural networks at render time.
- [Active SLAM and POMDPs](https://theorempath.com/topics/active-slam-and-pomdps) (tier 3, advanced): Choosing robot actions to simultaneously map an environment and localize, formulated as a partially observable Markov decision process over belief states.
- [Attention as Kernel Regression](https://theorempath.com/topics/attention-as-kernel-regression) (tier 3, advanced): Softmax attention viewed as Nadaraya-Watson kernel regression: the output at each position is a kernel-weighted average of values, with the softmax kernel K(q,k) = exp(q^T k / sqrt(d)). Connects attention to classical nonparametric statistics and motivates linear attention via random features.
- [Gaussian Processes for Machine Learning](https://theorempath.com/topics/gaussian-processes-for-ml) (tier 3, advanced): A distribution over functions specified by a mean and kernel: closed-form posterior predictions with uncertainty, connection to kernel ridge regression, marginal likelihood for model selection, and the cubic cost bottleneck.
- [Mean-Field Games](https://theorempath.com/topics/mean-field-games) (tier 3, research): The many-agent limit of strategic interactions: as the number of agents goes to infinity, each agent solves an MDP against the population distribution, and equilibrium becomes a fixed-point condition on the mean field.
- [Neural Architecture Search](https://theorempath.com/topics/neural-architecture-search) (tier 3, advanced): Automating network architecture design: search spaces, search strategies (RL, evolutionary, differentiable), performance estimation via weight sharing, and the gap between NAS hype and practical gains.
- [Number Theory and Machine Learning](https://theorempath.com/topics/number-theory-and-ml) (tier 3, advanced): The emerging two-way street between number theory and machine learning: how number-theoretic tools improve ML systems, and how ML is discovering new mathematical structure in classical problems.
- [Occupancy Networks and Neural Fields](https://theorempath.com/topics/occupancy-networks-and-neural-fields) (tier 3, advanced): Representing 3D geometry and appearance as continuous functions parameterized by neural networks: NeRF, occupancy networks, DeepSDF, volume rendering, and the connection to Gaussian splatting.
- [Positional Encoding](https://theorempath.com/topics/positional-encoding) (tier 3, advanced): Why attention needs position information, sinusoidal encoding, learned positions, RoPE (rotary position encoding via 2D rotations), ALiBi, and why RoPE became the default for modern LLMs.
- [Robust Adversarial Policies](https://theorempath.com/topics/robust-adversarial-policies) (tier 3, advanced): Robust MDPs optimize against worst-case transition dynamics within an uncertainty set. Adversarial policies formalize distribution shift in RL as a game between agent and environment.
- [Sparse Recovery and Compressed Sensing](https://theorempath.com/topics/sparse-recovery-and-compressed-sensing) (tier 3, advanced): Recover a sparse signal from far fewer measurements than its ambient dimension: the restricted isometry property, basis pursuit via L1 minimization, random measurement matrices, and applications from MRI to single-pixel cameras.
- [Tokenization and Information Theory](https://theorempath.com/topics/tokenization-and-information-theory) (tier 3, advanced): Tokenization determines an LLM's vocabulary and shapes everything from compression efficiency to multilingual ability. Information theory explains what good tokenization looks like.
- [Visual and Semantic SLAM](https://theorempath.com/topics/visual-semantic-slam) (tier 3, advanced): Replacing laser range finders with cameras for SLAM, and enriching maps with semantic labels to improve data association and planning.
- [Wasserstein Distances](https://theorempath.com/topics/wasserstein-distances) (tier 3, advanced): The Wasserstein (earth mover's) distance measures the minimum cost of transporting one probability distribution to another, with deep connections to optimal transport, GANs, and distributional robustness.

## Layer 5 — Applied systems, frontier research

- [Reinforcement Learning from Human Feedback: Deep Dive](https://theorempath.com/topics/reinforcement-learning-from-human-feedback-deep-dive) (tier 1, advanced): The full RLHF pipeline: supervised fine-tuning, Bradley-Terry reward modeling, PPO with KL penalty, reward hacking via Goodhart, and the post-RLHF landscape of DPO, GRPO, and RLVR.
- [Agentic RL and Tool Use](https://theorempath.com/topics/agentic-rl-and-tool-use) (tier 2, advanced): The shift from passive sequence generation to autonomous multi-turn decision making. LLMs as RL policies, tool use as actions, ReAct, AgentRL, and why agentic RL differs from chat RLHF.
- [AI Labs Landscape](https://theorempath.com/topics/ai-labs-landscape-2026) (tier 2, core): Factual reference on the major AI research labs and companies: what they build, key technical contributions, and research focus areas. Current landscape as of April 2026.
- [Chain-of-Thought and Reasoning](https://theorempath.com/topics/chain-of-thought-and-reasoning) (tier 2, advanced): Chain-of-thought prompting, why intermediate reasoning steps improve LLM performance, self-consistency, tree-of-thought, and the connection to inference-time compute scaling.
- [Claude Model Family](https://theorempath.com/topics/claude-model-family) (tier 2, core): Anthropic's Claude series from Claude 1 through the Claude 4.x family (4, 4.5, 4.6, 4.7), covering Constitutional AI, extended thinking, computer use, long context, and safety-focused design.
- [Constitutional AI](https://theorempath.com/topics/constitutional-ai) (tier 2, advanced): Anthropic's approach to alignment: replace human harmlessness labels with a constitution of principles that the model uses to self-evaluate, enabling scalable AI feedback.
- [Context Engineering](https://theorempath.com/topics/context-engineering) (tier 2, advanced): The discipline of building, routing, compressing, retrieving, and persisting context for LLMs: beyond prompt design into systems engineering for what the model sees.
- [Data Contamination and Evaluation](https://theorempath.com/topics/data-contamination-and-evaluation) (tier 2, advanced): When training data overlaps test benchmarks, model scores become meaningless. Types of contamination, detection methods, dynamic benchmarks, and how to read evaluation claims skeptically.
- [DeepSeek Models](https://theorempath.com/topics/deepseek-models) (tier 2, core): DeepSeek's model family: MoE architectures with Multi-head Latent Attention, fine-grained expert routing, and RL-trained reasoning in DeepSeek-R1.
- [Document Intelligence](https://theorempath.com/topics/document-intelligence) (tier 2, advanced): Beyond OCR: understanding document layout, tables, figures, and structure using models that combine text, spatial position, and visual features to extract structured information from PDFs, invoices, and contracts.
- [DPO vs GRPO vs RL for Reasoning](https://theorempath.com/topics/dpo-vs-grpo-vs-rl-reasoning) (tier 2, advanced): Head-to-head comparison of preference optimization methods: DPO's implicit reward, GRPO's group-relative comparisons, and RL with verifier feedback for reasoning. When each works and when each fails.
- [Edge and On-Device ML](https://theorempath.com/topics/edge-and-on-device-ml) (tier 2, advanced): Running models on phones, embedded devices, and edge servers: pruning, distillation, quantization, TinyML, and hardware-aware neural architecture search under memory, compute, and power constraints.
- [Flash Attention](https://theorempath.com/topics/flash-attention) (tier 2, advanced): IO-aware exact attention: tile QKV matrices into SRAM-sized blocks, compute attention without materializing the full attention matrix in HBM, reducing memory reads/writes from quadratic to linear.
- [Florence and Vision Foundation Models](https://theorempath.com/topics/florence-and-vision-foundation-models) (tier 2, advanced): Vision foundation models that unify classification, detection, segmentation, captioning, and VQA under a single pretrained backbone. Florence, Florence-2, and the path toward GPT-4V-style multimodal understanding.
- [Fused Kernels](https://theorempath.com/topics/fused-kernels) (tier 2, advanced): Combine multiple GPU operations into a single kernel launch to eliminate intermediate HBM reads and writes. Why kernel fusion is the primary optimization technique for memory-bound ML operations.
- [Gemini and Google Models](https://theorempath.com/topics/gemini-and-google-models) (tier 2, core): Google's model lineage from PaLM through Gemini 2.0: native multimodality, extreme long context, TPU infrastructure, and the Gemma open-weight series.
- [GPT Series Evolution](https://theorempath.com/topics/gpt-series-evolution) (tier 2, core): The progression from GPT-1 (117M) to GPT-4: how each generation revealed new capabilities through scale, training methodology, and alignment techniques.
- [GPU Compute Model](https://theorempath.com/topics/gpu-compute-model) (tier 2, advanced): How GPUs execute ML workloads: streaming multiprocessors, warps, memory hierarchy (registers, SRAM, L2, HBM), arithmetic intensity, the roofline model, and why most ML operations are memory-bound.
- [History of Artificial Intelligence](https://theorempath.com/topics/history-of-ai) (tier 2, core): Five recurring questions drive AI history: representation, learning, search, uncertainty, and tractability. From Turing 1936 through the transformer era, every advance answered at least one of these questions differently.
- [Inference Systems Overview](https://theorempath.com/topics/inference-systems-overview) (tier 2, advanced): The modern LLM inference stack: batching strategies, scheduling, memory management with paged attention, model parallelism for serving, and why FLOPs do not equal latency when memory bandwidth is the bottleneck.
- [Inference-Time Scaling Laws](https://theorempath.com/topics/inference-time-scaling-laws) (tier 2, advanced): How additional compute at inference time (repeated sampling, search, verification) improves output quality, why gains are task-dependent, and why verifier quality matters more than raw sample count.
- [KV Cache](https://theorempath.com/topics/kv-cache) (tier 2, advanced): Why autoregressive generation recomputes attention at every step, how caching past key-value pairs makes it linear, and the memory bottleneck that drives MQA, GQA, and paged attention.
- [KV Cache Optimization](https://theorempath.com/topics/kv-cache-optimization) (tier 2, advanced): Advanced techniques for managing the KV cache memory bottleneck: paged attention for fragmentation-free allocation, prefix caching for shared prompts, token eviction for long sequences, and quantized KV cache for reduced footprint.
- [Latent Reasoning](https://theorempath.com/topics/latent-reasoning) (tier 2, advanced): Reasoning in hidden state space instead of generating chain-of-thought tokens: recurrent computation and continuous thought for scaling inference compute without scaling output length.
- [LLaMA and Open Weight Models](https://theorempath.com/topics/llama-and-open-weight-models) (tier 2, core): The open weight movement in large language models: LLaMA 1/2/3, the ecosystem of fine-tuning and quantization tools, and why open weights changed the dynamics of AI research.
- [LLM Application Security](https://theorempath.com/topics/llm-application-security) (tier 2, advanced): The OWASP LLM Top 10: prompt injection, insecure output handling, training data poisoning, model denial of service, supply chain vulnerabilities, sensitive information disclosure, insecure plugin design, excessive agency, overreliance, and model theft. Standard application security for the GenAI era.
- [Memory Systems for LLMs](https://theorempath.com/topics/memory-systems-for-llms) (tier 2, advanced): Taxonomy of LLM memory: short-term (KV cache), working (scratchpad), long-term (retrieval), and parametric (weights). Why extending context alone is insufficient and how memory consolidation works.
- [Model Collapse and Data Quality](https://theorempath.com/topics/model-collapse-and-data-quality) (tier 2, advanced): When models train on their own outputs, the learned distribution narrows, tails disappear, and quality degrades across generations. Why synthetic data feedback loops threaten pretraining data quality and how to mitigate them.
- [Model Comparison Table](https://theorempath.com/topics/model-comparison-table) (tier 2, core): Structured comparison of major LLM families as of early 2026: architecture, parameters, context length, open weights, and key strengths, with discussion of what comparison tables cannot tell you.
- [Model Timeline](https://theorempath.com/topics/model-timeline) (tier 2, core): A structured factual timeline of major language and multimodal models from GPT-2 through the current frontier, with parameter counts, key innovations, and the ideas that defined each era.
- [Multi-Token Prediction](https://theorempath.com/topics/multi-token-prediction) (tier 2, advanced): Predicting k future tokens simultaneously using auxiliary prediction heads: forces planning, improves code generation, and connects to speculative decoding.
- [Multimodal RAG](https://theorempath.com/topics/multimodal-rag) (tier 2, advanced): RAG beyond text: retrieving images, tables, charts, and PDFs alongside text. Document parsing, multimodal chunking, vision-language retrievers, agentic RAG, and reasoning RAG with chain-of-thought retrieval.
- [PaddleOCR and Practical OCR](https://theorempath.com/topics/paddleocr-and-practical-ocr) (tier 2, advanced): A practitioner's guide to modern OCR toolkits: PaddleOCR's three-stage pipeline, TrOCR's transformer approach, EasyOCR, and Tesseract. When to use which, and what accuracy to expect.
- [Parallel Processing Fundamentals](https://theorempath.com/topics/parallel-processing-fundamentals) (tier 2, advanced): Data, tensor, pipeline, expert, and sequence parallelism: the five strategies for distributing model training and inference across multiple GPUs, and how frontier labs combine all of them.
- [Post-Training Overview](https://theorempath.com/topics/post-training-overview) (tier 2, advanced): The full post-training stack in 2026: SFT, RLHF, DPO, GRPO, constitutional AI, verifier-guided training, and self-improvement loops. Why post-training is now its own discipline.
- [Prefix Caching](https://theorempath.com/topics/prefix-caching) (tier 2, advanced): Share computed KV cache entries across requests that share the same prefix. Radix attention trees enable efficient lookup. Significant latency savings for prefix-heavy production workloads.
- [Prompt Engineering and In-Context Learning](https://theorempath.com/topics/prompt-engineering-and-in-context-learning) (tier 2, advanced): In-context learning allows LLMs to adapt to new tasks from examples in the prompt without weight updates. Theories for why it works, prompting strategies, and why prompt engineering is configuring inference-time computation.
- [Reasoning Data Curation](https://theorempath.com/topics/reasoning-data-curation) (tier 2, advanced): How to build training data for reasoning models: math with verified solutions, code with test cases, rejection sampling, external verification, self-play for problem generation, and the connection to RLVR.
- [Red-Teaming and Adversarial Evaluation](https://theorempath.com/topics/red-teaming-and-adversarial-eval) (tier 2, advanced): Systematically trying to make models produce harmful or incorrect outputs: manual and automated red-teaming, jailbreaks, prompt injection, adversarial suffixes, and why adversarial evaluation is necessary before deployment.
- [Reward Hacking](https://theorempath.com/topics/reward-hacking) (tier 2, advanced): Goodhart's law for AI: when models exploit reward model weaknesses instead of being genuinely helpful, including verbosity hacking, sycophancy, and structured mitigation strategies.
- [Reward Models and Verifiers](https://theorempath.com/topics/reward-models-and-verifiers) (tier 2, advanced): Reward models trained on human preferences vs verifiers that check output correctness. Bradley-Terry models, process vs outcome rewards, Goodhart's law, and why verifiers are more robust.
- [Scaling Compute-Optimal Training](https://theorempath.com/topics/scaling-compute-optimal-training) (tier 2, advanced): Chinchilla scaling: how to optimally allocate a fixed compute budget between model size and training data, why many models were undertrained, and the post-Chinchilla reality of data quality and inference cost.
- [Speculative Decoding and Quantization](https://theorempath.com/topics/speculative-decoding-and-quantization) (tier 2, advanced): Two core inference optimizations: speculative decoding for latency (draft-verify parallelism) and quantization for memory and throughput (reducing weight precision without destroying quality).
- [Structured Output and Constrained Generation](https://theorempath.com/topics/structured-output-and-constrained-generation) (tier 2, advanced): Forcing LLM output to conform to schemas via constrained decoding: logit masking for valid tokens, grammar-guided generation, and why constraining the search space can improve both reliability and quality.
- [Test-Time Compute and Search](https://theorempath.com/topics/test-time-compute-and-search) (tier 2, advanced): One of the biggest frontier shifts: spending more compute at inference through repeated sampling, verifier-guided search, MCTS for reasoning, chain-of-thought as compute, and latent reasoning.
- [Test-Time Training and Adaptive Inference](https://theorempath.com/topics/test-time-training) (tier 2, advanced): Updating model parameters at inference time using self-supervised objectives on the test input itself. TTT layers replace fixed linear recurrences (as in Mamba) with learned update rules, blurring the boundary between training and inference.
- [Tool-Augmented Reasoning](https://theorempath.com/topics/tool-augmented-reasoning) (tier 2, advanced): LLMs that call external tools during reasoning: Toolformer for learning when to invoke APIs, ReAct for interleaving thought and action, and code-as-thought for replacing verbal arithmetic with executed programs.
- [Verifier Design and Process Reward](https://theorempath.com/topics/verifier-design-and-process-reward) (tier 2, advanced): Detailed treatment of verifier types, process vs outcome reward models, verifier-guided search, self-verification, and the connection to test-time compute scaling. How to design reward signals for reasoning models.
- [Video World Models](https://theorempath.com/topics/video-world-models) (tier 2, advanced): Turning pretrained video diffusion models into interactive world simulators: condition on actions to generate future frames, enabling RL agent training, robot planning, and game AI without physical environments.
- [Agent Protocols: MCP and A2A](https://theorempath.com/topics/agent-protocols-mcp-a2a) (tier 3, advanced): The protocol layer for AI agents: MCP (Model Context Protocol) for tool access, A2A (Agent-to-Agent) for inter-agent communication, and why standardized interfaces matter for the agent ecosystem.
- [AMD Competition Landscape](https://theorempath.com/topics/amd-competition-landscape) (tier 3, core): AMD's MI300X and MI325X GPUs compete with NVIDIA on memory bandwidth and capacity but lag on software ecosystem. Competition matters because pricing, supply diversity, and vendor lock-in determine who can train and serve models.
- [ASML and Chip Manufacturing](https://theorempath.com/topics/asml-and-chip-manufacturing) (tier 3, core): ASML is the sole manufacturer of EUV lithography machines used to produce every advanced AI chip. Understanding the semiconductor supply chain reveals a critical concentration risk for AI compute.
- [Audio Language Models](https://theorempath.com/topics/audio-language-models) (tier 3, advanced): Models that process and generate speech alongside text: audio tokenization, Whisper for transcription, end-to-end voice models, music generation, and the audio-language frontier.
- [Continuous Thought Machines](https://theorempath.com/topics/continuous-thought-machines) (tier 3, research): Neural networks that process information through continuous-time internal dynamics rather than discrete layer-by-layer computation. Inspired by neural ODEs and dynamical systems, these architectures let the network 'think' for a variable amount of internal time before producing an output.
- [Distributed Training Theory](https://theorempath.com/topics/distributed-training-theory) (tier 3, advanced): Training frontier models requires thousands of GPUs. Data parallelism, model parallelism, and communication-efficient methods make this possible.
- [Donut and OCR-Free Document Understanding](https://theorempath.com/topics/donut-and-ocr-free-document-understanding) (tier 3, advanced): End-to-end document understanding without OCR: Donut reads document images directly and generates structured output, bypassing the error-prone OCR pipeline. Nougat extends this to academic paper parsing.
- [Energy Efficiency and Green AI](https://theorempath.com/topics/energy-efficiency-and-green-ai) (tier 3, core): The compute cost of training frontier models, carbon footprint, FLOPs vs wall-clock time vs dollars, and why reporting efficiency matters. Efficient alternatives: distillation, pruning, quantization, and scaling laws for optimal compute allocation.
- [Key Researchers and Ideas](https://theorempath.com/topics/key-researchers-and-ideas) (tier 3, core): Reference mapping key researchers to their technical contributions: Hinton, LeCun, Bengio, Sutskever, Amodei, Hassabis, Karpathy. Who did what, and what still matters in 2026.
- [Model Merging and Weight Averaging](https://theorempath.com/topics/model-merging-and-weight-averaging) (tier 3, advanced): Combining trained models by averaging or interpolating their weights: SWA, SLERP, TIES-Merging, DARE. Why it works (loss landscape mode connectivity), when it fails, and applications to combining specialized models.
- [NVIDIA GPU Architectures](https://theorempath.com/topics/nvidia-gpu-architectures) (tier 3, advanced): A concrete reference for the GPU hardware that determines what ML models can be trained and served: H100, H200, B200, GB200, and Rubin architectures, with emphasis on memory bandwidth as the primary bottleneck.
- [Open Problems in ML Theory](https://theorempath.com/topics/open-problems-in-ml-theory) (tier 3, research): A curated list of genuinely open problems in machine learning theory: why overparameterized networks generalize, the right complexity measure for deep learning, feature learning beyond NTK, why scaling laws hold, emergent abilities, transformer-specific theory, and post-training theory.
- [Plan-then-Generate](https://theorempath.com/topics/plan-then-generate) (tier 3, research): Beyond strict left-to-right token generation: plan what to say before producing tokens. Outline-then-fill, hierarchical generation, and multi-token prediction. Planning reduces incoherence in long outputs and enables structural editing.
- [Quantization Theory](https://theorempath.com/topics/quantization-theory) (tier 3, advanced): Reduce model weight precision from FP32 to FP16, INT8, or INT4. Post-training quantization, quantization-aware training, GPTQ, AWQ, and GGUF. Quantization is how large language models actually get deployed.
- [Qwen and Chinese Models](https://theorempath.com/topics/qwen-and-chinese-models) (tier 3, core): The Chinese open-weight model ecosystem: Qwen (Alibaba), Yi (01.AI), Baichuan, GLM (Zhipu AI), and Kimi (Moonshot AI), with a focus on multilingual capability and independent scaling.
- [Table Extraction and Structure Recognition](https://theorempath.com/topics/table-extraction-and-structure-recognition) (tier 3, advanced): Detecting tables in documents, recognizing row and column structure, and extracting cell content. Why tables are hard: merged cells, borderless layouts, nested headers, and cascading pipeline errors.
- [World Model Evaluation](https://theorempath.com/topics/world-model-evaluation) (tier 3, advanced): How to measure whether a learned world model is useful: prediction accuracy, controllability (sim-to-real transfer), planning quality, and why long-horizon evaluation is hard.

## Side-by-side comparisons

- [Adam vs. SGD](https://theorempath.com/compare/adam-vs-sgd): Adam adapts the learning rate per parameter using first and second moment estimates for fast early convergence; SGD with momentum uses a single global learning rate and often finds flatter minima that generalize better. The choice depends on your priorities: speed to convergence or final model quality.
- [AdamW vs. Adam](https://theorempath.com/compare/adamw-vs-adam): Adam applies L2 regularization inside the gradient, where the adaptive scaling distorts the penalty. AdamW decouples weight decay from the adaptive step, applying it directly to parameters. This distinction matters: every modern transformer uses AdamW, not Adam with L2.
- [Autoregressive Models vs. Diffusion Models](https://theorempath.com/compare/autoregressive-vs-diffusion): Autoregressive models generate tokens sequentially via next-token prediction; diffusion models generate data by iteratively denoising from Gaussian noise. Sequential discrete generation vs. parallel continuous denoising: why LLMs dominate text and diffusion dominates images.
- [Autoregressive Models vs. JEPA](https://theorempath.com/compare/autoregressive-vs-jepa): Two competing paradigms for learning world models: autoregressive models predict raw tokens or pixels sequentially, while JEPA predicts abstract representations in a learned latent space without generating observable outputs.
- [Azuma-Hoeffding vs. Freedman Inequality](https://theorempath.com/compare/azuma-hoeffding-vs-freedman): Azuma-Hoeffding uses only bounded increments and gives sub-Gaussian tails. Freedman incorporates the predictable quadratic variation and is tighter when the variance is small. Freedman interpolates between sub-Gaussian and sub-exponential behavior.
- [CNN vs. ViT vs. Swin Transformer](https://theorempath.com/compare/cnn-vs-vit-vs-swin): CNNs bake in local inductive bias and translation equivariance. ViT applies global self-attention to image patches but needs large datasets. Swin Transformer uses hierarchical shifted windows to get the best of both: local efficiency with global reasoning.
- [Contrastive Loss vs. Triplet Loss](https://theorempath.com/compare/contrastive-vs-triplet): Contrastive loss (InfoNCE) pushes apart a query from N-1 negatives simultaneously using a softmax over similarities. Triplet loss pushes apart one anchor-negative pair relative to one anchor-positive pair with a fixed margin. InfoNCE scales better with batch size and dominates modern self-supervised learning. Triplet loss is simpler but requires careful hard negative mining to train effectively.
- [Covering Numbers vs. Packing Numbers](https://theorempath.com/compare/covering-vs-packing-numbers): Covering numbers count the minimum eps-net size. Packing numbers count the maximum number of eps-separated points. They are within a factor of 2, but each appears naturally in different proof contexts.
- [Cramér-Rao Bound vs. Minimax Lower Bounds](https://theorempath.com/compare/cramer-rao-vs-minimax): Two frameworks for bounding estimation difficulty: Cramér-Rao gives a local lower bound for unbiased estimators at a single parameter value, while minimax lower bounds apply to all estimators over an entire parameter class.
- [Cross-Entropy vs. MSE Loss](https://theorempath.com/compare/cross-entropy-vs-mse): Cross-entropy is the natural loss for classification because it equals the negative log-likelihood of a Bernoulli or categorical model, produces strong gradients even when the model is confidently wrong, and decomposes as entropy plus KL divergence. MSE is the natural loss for regression, corresponding to Gaussian likelihood, but causes gradient saturation when paired with sigmoid or softmax outputs.
- [Dense Transformers vs. Mixture-of-Experts](https://theorempath.com/compare/dense-vs-mixture-of-experts): Dense transformers activate all parameters for every token, giving simple training but high compute per token. Mixture-of-experts routes each token to k of N experts, achieving higher total capacity with lower per-token compute, at the cost of routing complexity and load balancing challenges.
- [Diffusion Models vs. GANs vs. VAEs](https://theorempath.com/compare/diffusion-vs-gans-vs-vaes): Three generative model families compared: GANs use adversarial training for sharp samples but suffer mode collapse, VAEs optimize ELBO for smooth latent spaces but produce blurry outputs, and diffusion models iteratively denoise for high quality at the cost of slow sampling.
- [Dropout vs. Batch Normalization](https://theorempath.com/compare/dropout-vs-batch-norm): Dropout regularizes by stochastic masking of activations, approximating an ensemble of exponentially many subnetworks. Batch normalization normalizes activations to stabilize training, with an incidental regularization effect from mini-batch noise. Both reduce overfitting, but through completely different mechanisms, and they interact poorly because dropout shifts the statistics that batch norm estimates.
- [Early Stopping vs. Weight Decay](https://theorempath.com/compare/early-stopping-vs-weight-decay): Early stopping halts training when validation loss increases, limiting effective model capacity by restricting optimization time. Weight decay adds an explicit penalty on weight magnitude to the loss function. For linear models, early stopping with gradient descent is equivalent to L2 regularization. In deep networks, they control capacity through different mechanisms and are typically used together.
- [Encoder-Only vs. Decoder-Only vs. Encoder-Decoder](https://theorempath.com/compare/encoder-only-vs-decoder-only-vs-encoder-decoder): Encoder-only models (BERT) use bidirectional attention for classification and extraction. Decoder-only models (GPT) use causal masking for autoregressive generation. Encoder-decoder models (T5) use cross-attention to condition generation on a fully encoded input. The architecture choice determines what tasks the model can perform natively.
- [Fano's Method vs. Le Cam's Method](https://theorempath.com/compare/fano-vs-le-cam): Two techniques for proving minimax lower bounds: Fano reduces to many-hypothesis testing via mutual information, Le Cam reduces to binary hypothesis testing via total variation distance.
- [FlashAttention vs. Vanilla Attention](https://theorempath.com/compare/flash-attention-vs-vanilla-attention): FlashAttention and vanilla attention compute the exact same output. The difference is entirely in IO complexity: vanilla materializes the full n x n attention matrix in GPU HBM, while FlashAttention tiles the computation into SRAM blocks using an online softmax trick, reducing memory from O(n^2) to O(n) and achieving 2-4x wall-clock speedup.
- [Focal Loss vs. Cross-Entropy Loss](https://theorempath.com/compare/focal-vs-cross-entropy): Cross-entropy loss treats all examples equally, weighting each by its negative log-probability. Focal loss multiplies the cross-entropy by a modulating factor that downweights well-classified (easy) examples, focusing training on hard examples. Focal loss is a strict generalization of cross-entropy (setting the focusing parameter to zero recovers cross-entropy). It is most effective for severe class imbalance where easy negatives dominate the gradient.
- [Frequentist vs. Bayesian Inference](https://theorempath.com/compare/frequentist-vs-bayesian): Two foundational philosophies of statistical inference: frequentists treat parameters as fixed unknowns and data as random, Bayesians treat parameters as random variables with prior distributions and compute posteriors.
- [Gradient Clipping vs. Weight Decay](https://theorempath.com/compare/gradient-clipping-vs-weight-decay): Gradient clipping limits the magnitude of gradients during backpropagation, preventing training instability from exploding gradients. Weight decay limits the magnitude of weights, preventing overfitting by penalizing large parameters. They address different problems: clipping is about training stability, weight decay is about generalization. Modern LLM training uses both.
- [Hoeffding vs. Bernstein Inequality](https://theorempath.com/compare/hoeffding-vs-bernstein): When to use range-only bounds vs. variance-aware bounds: Bernstein is tighter when variance is small, Hoeffding is simpler and sufficient when it is not.
- [Kaplan vs. Chinchilla Scaling](https://theorempath.com/compare/kaplan-vs-chinchilla-scaling): Kaplan (2020) said scale up parameters faster than data. Chinchilla (2022) showed the opposite: many large models were undertrained. The disagreement came from a methodological flaw in how Kaplan fitted the scaling exponents.
- [Kernel Methods vs. Feature Learning](https://theorempath.com/compare/kernel-methods-vs-feature-learning): Kernel methods fix a feature map and learn weights. Feature learning methods learn the features themselves. The NTK regime is kernel-like; the rich regime learns features. When each approach suffices and when it does not.
- [KL Divergence vs. Cross-Entropy](https://theorempath.com/compare/kl-divergence-vs-cross-entropy): Cross-entropy and KL divergence are related by a constant: H(P,Q) = H(P) + KL(P||Q). When the true distribution P is fixed (as in supervised classification), minimizing cross-entropy is equivalent to minimizing KL divergence. They differ in symmetry, interpretation, and usage context.
- [L1 vs. L2 Regularization](https://theorempath.com/compare/l1-vs-l2): L1 (Lasso) penalizes the absolute value of weights, producing sparse solutions via the diamond geometry of the L1 ball. L2 (Ridge) penalizes squared weights, shrinking all coefficients toward zero without eliminating any. The choice depends on whether the true model is sparse or dense.
- [Lazy (NTK) Regime vs. Feature Learning Regime](https://theorempath.com/compare/lazy-vs-feature-learning): Neural networks can operate in two regimes: the lazy regime where weights barely move and the network behaves like a fixed kernel, or the feature learning regime where weights move substantially and learn task-specific representations.
- [LoRA vs. Full Fine-Tune vs. QLoRA](https://theorempath.com/compare/lora-vs-full-finetune-vs-qlora): Full fine-tuning updates all parameters and achieves the best task performance but requires storing a complete copy of the model per task. LoRA freezes the base model and trains low-rank additive matrices, cutting trainable parameters by 100x or more. QLoRA quantizes the base model to 4-bit and applies LoRA on top, enabling fine-tuning of 65B models on a single 48GB GPU.
- [Martingale CLT vs. Classical CLT](https://theorempath.com/compare/martingale-clt-vs-classical-clt): The classical CLT requires iid random variables. The martingale CLT extends to dependent sequences with mean-zero increments. The martingale version is needed for stochastic approximation and online learning.
- [MLE vs. Method of Moments](https://theorempath.com/compare/mle-vs-method-of-moments): Two classical estimation strategies: MLE maximizes the likelihood and is asymptotically efficient, while Method of Moments matches sample moments to population moments and is simpler but typically less efficient.
- [Model-Based vs. Model-Free RL](https://theorempath.com/compare/model-based-vs-model-free-rl): Model-based RL learns a dynamics model and plans internally (Dreamer, MuZero), while model-free RL learns value functions or policies directly from experience (DQN, PPO). The tradeoff is sample efficiency vs. model error.
- [Multi-Head vs. Multi-Query vs. Grouped-Query Attention](https://theorempath.com/compare/multi-head-vs-multi-query-vs-gqa): Multi-head attention (MHA) gives each head its own K, V projections. Multi-query attention (MQA) shares a single K, V across all heads. Grouped-query attention (GQA) shares K, V within groups of heads. MQA and GQA reduce KV cache size during autoregressive inference, trading a small quality loss for dramatically lower memory and faster decoding.
- [NTK Regime vs. Mean-Field Regime](https://theorempath.com/compare/ntk-vs-mean-field): Two limiting theories of wide neural networks: NTK linearizes training dynamics around initialization (lazy regime), while mean-field theory captures feature learning through substantial weight movement (rich regime).
- [On-Policy vs. Off-Policy Learning](https://theorempath.com/compare/on-policy-vs-off-policy): On-policy methods learn from data generated by the current policy (SARSA, PPO), ensuring consistency but wasting samples. Off-policy methods learn from any data including replay buffers (Q-learning, SAC), gaining efficiency at the cost of stability.
- [PAC Learning vs. Agnostic PAC Learning](https://theorempath.com/compare/pac-vs-agnostic-pac): Realizable PAC learning assumes the target is in the hypothesis class. Agnostic PAC drops this assumption and competes with the best hypothesis in the class. Agnostic learning is harder, requiring uniform convergence and yielding slower sample complexity.
- [Pointwise vs. Uniform Convergence](https://theorempath.com/compare/pointwise-vs-uniform-convergence): Pointwise convergence allows different rates at different points. Uniform convergence requires the same rate everywhere. Learning theory needs uniform convergence because ERM must work simultaneously for all hypotheses.
- [PPO vs. SAC](https://theorempath.com/compare/ppo-vs-sac): Two dominant actor-critic algorithms compared: PPO uses clipped surrogate objectives on-policy, while SAC maximizes entropy off-policy. PPO excels in discrete/LLM settings, SAC in continuous robotics.
- [Pre-Norm vs. Post-Norm](https://theorempath.com/compare/pre-norm-vs-post-norm): Post-norm places layer normalization after the residual addition, matching the original transformer. Pre-norm places it before the sublayer, inside the residual branch. Pre-norm enables stable training at depth without learning rate warmup, which is why GPT, LLaMA, and most modern LLMs use it. Post-norm can achieve better final performance but requires careful initialization and warmup.
- [Post-Training Quantization vs. Quantization-Aware Training](https://theorempath.com/compare/quantization-ptq-vs-qat): PTQ quantizes a pretrained model with no retraining. QAT simulates quantization during training to recover quality. GPTQ and AWQ are modern PTQ methods that close much of the gap. The tradeoff is compute cost vs. model quality at low bit widths.
- [Ridge vs. Lasso Regression](https://theorempath.com/compare/ridge-vs-lasso): L2 penalty shrinks all coefficients toward zero; L1 penalty drives some exactly to zero. Ridge has a closed-form solution and handles multicollinearity; Lasso performs variable selection but requires iterative solvers.
- [RLHF vs. DPO vs. GRPO](https://theorempath.com/compare/rlhf-vs-dpo-vs-grpo): Three approaches to aligning language models with human preferences. RLHF trains a separate reward model and optimizes via PPO. DPO eliminates the reward model by reparameterizing the preference objective. GRPO uses group-relative scoring without a reward model, suited for reasoning tasks with verifiable answers.
- [RMSNorm vs. LayerNorm](https://theorempath.com/compare/rmsnorm-vs-layernorm): LayerNorm normalizes activations by centering (subtracting the mean) and scaling (dividing by the standard deviation), then applies a learned affine transformation. RMSNorm drops the mean centering step and normalizes by the root mean square only. RMSNorm is 10-15% faster at the same expressivity for transformer training. LLaMA, Gemma, Mistral, and most modern LLMs use RMSNorm.
- [RoPE vs. ALiBi vs. Sinusoidal Positional Encoding](https://theorempath.com/compare/rope-vs-alibi-vs-sinusoidal): Sinusoidal positional encoding (original Transformer) adds fixed position vectors to token embeddings. RoPE (Rotary Position Embedding) applies position-dependent rotation to query and key vectors, encoding relative position through dot product geometry. ALiBi (Attention with Linear Biases) adds a linear position-dependent penalty directly to attention scores. RoPE extrapolates better and dominates modern LLMs. ALiBi is simplest to implement. Sinusoidal is largely historical.
- [Self-Play vs. Independent Learning](https://theorempath.com/compare/self-play-vs-independent-learning): Two approaches to multi-agent reinforcement learning: self-play trains an agent against copies of itself in a non-stationary environment, while independent learning treats other agents as part of a fixed environment. Self-play converges in two-player zero-sum games; independent learning can cycle or diverge.
- [SFT vs. DPO](https://theorempath.com/compare/sft-vs-dpo): Supervised fine-tuning (SFT) learns from demonstration data by maximizing log-likelihood on human-written outputs. Direct preference optimization (DPO) learns from pairwise preference data by directly optimizing the policy without fitting a separate reward model. SFT is simpler and data-efficient for instruction following. DPO is preferred when preference signals are available and you want to skip the reward model stage of RLHF.
- [Shampoo vs. Adam vs. Muon](https://theorempath.com/compare/shampoo-vs-adam-vs-muon): Three approaches to preconditioning gradients: Adam (diagonal, per-parameter), Shampoo (full-matrix Kronecker), and Muon (orthogonalized updates via Newton-Schulz). Each uses increasingly rich curvature information at increasing computational cost.
- [Sub-Gaussian vs. Sub-Exponential Random Variables](https://theorempath.com/compare/subgaussian-vs-subexponential): Two tail regimes for concentration: sub-Gaussian gives exp(-ct^2), sub-exponential gives exp(-ct) for large deviations, and the boundary between them explains when classical bounds break down.
- [SVM vs. Logistic Regression](https://theorempath.com/compare/svm-vs-logistic-regression): SVMs maximize the margin using hinge loss and produce sparse support vectors; logistic regression maximizes likelihood using log loss and produces calibrated probabilities. SVMs handle nonlinearity via the kernel trick; LR needs explicit feature engineering.
- [SwiGLU vs. GELU vs. ReLU](https://theorempath.com/compare/swiglu-vs-gelu-vs-relu): ReLU is the simplest activation: zero for negative inputs, identity for positive. GELU applies a smooth, probabilistic gate based on the Gaussian CDF. SwiGLU combines the Swish activation with a gated linear unit, using an extra linear projection to gate the hidden representation. SwiGLU outperforms GELU and ReLU in transformer feed-forward networks at the cost of additional parameters. LLaMA, PaLM, and Gemma use SwiGLU. GPT-2 and BERT use GELU.
- [Transformer vs. Mamba vs. TTT](https://theorempath.com/compare/transformer-vs-mamba-vs-ttt): Three competing sequence architectures: attention (exact retrieval, quadratic cost), state-space models (linear cost, compressed state), and test-time training (gradient-based state updates, rich memory). Each makes different tradeoffs between memory, compute, and retrieval ability.
- [Value Iteration vs. Policy Iteration](https://theorempath.com/compare/value-iteration-vs-policy-iteration): Both algorithms find optimal policies for finite MDPs. Value iteration applies the Bellman optimality operator repeatedly and extracts the policy at the end. Policy iteration alternates between full policy evaluation and greedy improvement, converging in fewer iterations but with more work per iteration.
- [VC Dimension vs. Rademacher Complexity](https://theorempath.com/compare/vc-dimension-vs-rademacher-complexity): Worst-case combinatorial complexity vs. data-dependent average-case complexity: when each gives tighter generalization bounds.
- [Weak Duality vs. Strong Duality](https://theorempath.com/compare/weak-vs-strong-duality): Weak duality always holds: dual optimal is at most primal optimal. Strong duality says they are equal, but requires constraint qualifications like Slater's condition. KKT conditions become necessary and sufficient under strong duality.
- [Weight Decay vs. L2 Regularization](https://theorempath.com/compare/weight-decay-vs-l2): Weight decay and L2 regularization are identical for SGD but diverge under adaptive optimizers. L2 adds the penalty gradient before adaptive scaling, so heavily updated parameters get less regularization. Weight decay subtracts directly from weights after the update, applying uniform regularization regardless of gradient history.