# TheoremPath > TheoremPath is a structured theory library for ML, statistics, and mathematics. 491 topic pages linked by a prerequisite graph from axioms to frontier research. Every theorem lists its assumptions, proof sketch, failure modes, and textbook references. Maintained by Robby Sneiderman (https://github.com/Robby955). Built with Next.js, MDX, and KaTeX. Content is file-based and version-controlled. ## How to cite If you cite or quote from TheoremPath, please link to the specific topic page. Example: > TheoremPath. "Concentration Inequalities." https://theorempath.com/topics/concentration-inequalities ## Navigation entry points - [Curriculum](https://theorempath.com/curriculum): Structured starting paths by background. - [Atlas](https://theorempath.com/atlas): Interactive graph of all topics with grounding-path tracing. - [Demos](https://theorempath.com/demos): Interactive demos for core mechanics. - [Diagnostic](https://theorempath.com/diagnostic): Short quiz that surfaces gaps in your current knowledge. - [Topics index](https://theorempath.com/topics): Alphabetical list of all 491 topics. ## Foundations (Layer 0A axioms, 0B infrastructure) - [Asymptotic Statistics](https://theorempath.com/topics/asymptotic-statistics): The large-sample toolbox: delta method, Slutsky's theorem, asymptotic normality of MLE, local asymptotic normality, and Fisher efficiency. These results justify nearly every confidence interval and hypothesis test used in practice. - [Basic Logic and Proof Techniques](https://theorempath.com/topics/basic-logic-and-proof-techniques): The fundamental proof strategies used throughout mathematics: direct proof, contradiction, contrapositive, induction, construction, and cases. Required vocabulary for reading any theorem. - [Basu's Theorem](https://theorempath.com/topics/basu-theorem): A complete sufficient statistic is independent of every ancillary statistic. This provides the cleanest method for proving independence between statistics without computing joint distributions. - [Bayesian Estimation](https://theorempath.com/topics/bayesian-estimation): The Bayesian approach to parameter estimation: encode prior beliefs, update with data via Bayes rule, and obtain a full posterior distribution over parameters. Conjugate priors, MAP estimation, and the Bernstein-von Mises theorem showing that the posterior concentrates around the true parameter. - [Birthday Paradox](https://theorempath.com/topics/birthday-paradox): In a group of 23 people, the probability that two share a birthday exceeds 50%. Pairwise collision counting explains why this threshold is so low. - [Cantor's Theorem and Uncountability](https://theorempath.com/topics/cantors-theorem-and-uncountability): Cantor's diagonal argument proves the reals are uncountable. The power set of any set has strictly greater cardinality. These results are the origin of the distinction between countable and uncountable infinity. - [Cardinality and Countability](https://theorempath.com/topics/cardinality-and-countability): Two sets have the same cardinality when a bijection exists between them. The naturals, integers, and rationals are countable. The reals are uncountable, proved by Cantor's diagonal argument. - [Category Theory](https://theorempath.com/topics/category-theory): Categories, functors, natural transformations, universal properties, adjunctions, and the Yoneda lemma. The language of abstract structure that unifies algebra, topology, logic, and increasingly appears in ML theory. - [Central Limit Theorem](https://theorempath.com/topics/central-limit-theorem): The CLT: the sample mean is approximately Gaussian for large n, regardless of the original distribution. Berry-Esseen rate, multivariate CLT, and why CLT explains asymptotic normality of MLE, confidence intervals, and the ubiquity of the Gaussian. - [Common Inequalities](https://theorempath.com/topics/common-inequalities): The algebraic and probabilistic inequalities that appear everywhere in ML theory: Cauchy-Schwarz, Jensen, AM-GM, Holder, Minkowski, Young, Markov, and Chebyshev. - [Common Probability Distributions](https://theorempath.com/topics/common-probability-distributions): The essential probability distributions for ML: Bernoulli through Dirichlet, with PMFs/PDFs, moments, relationships, and when each arises in practice. - [Compactness and Heine-Borel](https://theorempath.com/topics/compactness-and-heine-borel): Sequential compactness, the Heine-Borel theorem in finite dimensions, the extreme value theorem, and why compactness is the key assumption in optimization. - [Computability Theory](https://theorempath.com/topics/computability-theory): What can be computed? Turing machines, decidability, the Church-Turing thesis, recursive and recursively enumerable sets, reductions, Rice's theorem, and connections to learning theory. - [Continuity in R^n](https://theorempath.com/topics/continuity-in-rn): Epsilon-delta continuity, uniform continuity, and Lipschitz continuity in Euclidean space. Lipschitz constants control how fast function values change and appear throughout optimization and generalization theory. - [Convex Duality](https://theorempath.com/topics/convex-duality): Fenchel conjugates, the Fenchel-Moreau theorem, weak and strong duality, KKT conditions, and why duality gives the kernel trick for SVMs, connects regularization to constraints, and enables adversarial formulations in DRO. - [Counting and Combinatorics](https://theorempath.com/topics/counting-and-combinatorics): Counting principles, binomial and multinomial coefficients, inclusion-exclusion, and Stirling's approximation. These tools appear whenever you count hypotheses, bound shattering coefficients, or analyze combinatorial arguments in learning theory. - [Cramér-Rao Bound](https://theorempath.com/topics/cramer-rao-bound): The fundamental lower bound on the variance of any unbiased estimator: no unbiased estimator can have variance smaller than the reciprocal of the Fisher information. - [Deep Learning (Goodfellow, Bengio, Courville)](https://theorempath.com/topics/deep-learning-goodfellow-book): Reading guide for the Goodfellow, Bengio, Courville textbook (2016). What it covers, which chapters still matter in 2026, what has aged, and how to use it efficiently. - [Differentiation in Rn](https://theorempath.com/topics/differentiation-in-rn): Partial derivatives, the gradient, directional derivatives, the total derivative (Frechet), and the multivariable chain rule. Why the gradient points in the steepest ascent direction, and why this matters for all of optimization. - [Dynamic Programming](https://theorempath.com/topics/dynamic-programming): Solve complex optimization problems by decomposing them into overlapping subproblems with optimal substructure. The algorithmic backbone of sequence models, control theory, and reinforcement learning. - [Editorial Principles](https://theorempath.com/topics/editorial-principles): How TheoremPath treats knowledge, uncertainty, fairness, and systems. Six intellectual lenses with scope conditions: Simon for bounded intelligence, Pearl for causality, Meadows for systems, Ostrom for governance, Rawls for fairness, Taleb for uncertainty. - [Eigenvalues and Eigenvectors](https://theorempath.com/topics/eigenvalues-and-eigenvectors): Eigenvalues and eigenvectors: the directions a matrix scales without rotating. Characteristic polynomial, diagonalization, the spectral theorem for symmetric matrices, and the direct connection to PCA. - [The Elements of Statistical Learning (Hastie, Tibshirani, Friedman)](https://theorempath.com/topics/elements-of-statistical-learning-book): Reading guide for ESL (2009, 2nd edition). The standard graduate statistics/ML textbook. Covers linear methods, trees, boosting, SVMs, ensemble methods. What to read, what to skip, and where it excels. - [Expectation, Variance, Covariance, and Moments](https://theorempath.com/topics/expectation-variance-covariance-moments): Expectation, variance, covariance, correlation, linearity of expectation, variance of sums, and moment-based reasoning in ML. - [Exponential Function Properties](https://theorempath.com/topics/exponential-function-properties): The exponential function e^x: series definition, algebraic properties, and why it appears everywhere in ML. Softmax, MGFs, the Chernoff method, Boltzmann distributions, and exponential families all reduce to properties of exp. - [Fisher Information](https://theorempath.com/topics/fisher-information): The Fisher information quantifies how much a sample tells you about an unknown parameter: it measures the curvature of the log-likelihood, sets the Cramér-Rao lower bound on estimator variance, and serves as a natural Riemannian metric on parameter space. - [Floating-Point Arithmetic](https://theorempath.com/topics/floating-point-arithmetic): How computers represent real numbers, why they get it wrong, and why ML uses float32, float16, bfloat16, and int8. IEEE 754, machine epsilon, overflow, underflow, and catastrophic cancellation. - [Formal Languages and Automata](https://theorempath.com/topics/formal-languages): Regular languages, context-free grammars, pushdown automata, the Chomsky hierarchy, pumping lemmas, and connections to parsing, neural sequence models, and computational complexity. - [Foundational Dependencies](https://theorempath.com/topics/foundational-dependencies): Which axiomatic systems does each branch of TheoremPath depend on? A map from content to foundations. - [Functional Analysis Core](https://theorempath.com/topics/functional-analysis-core): The four pillars of functional analysis: Hahn-Banach (extending functionals), Uniform Boundedness (pointwise bounded implies uniformly bounded), Open Mapping (surjective bounded operators have open images), and Banach-Alaoglu (dual unit ball is weak-* compact). These underpin RKHS theory, optimization in function spaces, and duality. - [Godel's Incompleteness Theorems](https://theorempath.com/topics/godels-incompleteness-theorems): Godel's first incompleteness theorem: any consistent formal system containing basic arithmetic has true but unprovable statements. The second: such a system cannot prove its own consistency. These are hard limits on what formal reasoning can achieve. - [Graph Algorithms Essentials](https://theorempath.com/topics/graph-algorithms-essentials): The graph algorithms every ML practitioner needs: BFS, DFS, Dijkstra, MST, and topological sort. Why they matter for computational graphs, knowledge graphs, dependency resolution, and GNNs. - [Greedy Algorithms](https://theorempath.com/topics/greedy-algorithms): The greedy paradigm: make the locally optimal choice at each step and never look back. When matroid structure or the exchange argument guarantees global optimality. - [Information Theory Foundations](https://theorempath.com/topics/information-theory-foundations): The core of information theory for ML: entropy, cross-entropy, KL divergence, mutual information, data processing inequality, and the chain rules that connect them. The language of variational inference, generalization bounds, and representation learning. - [Inner Product Spaces and Orthogonality](https://theorempath.com/topics/inner-product-spaces-and-orthogonality): Inner product axioms, Cauchy-Schwarz inequality, orthogonality, Gram-Schmidt, projections, and the bridge to Hilbert spaces. - [Integration and Change of Variables](https://theorempath.com/topics/integration-and-change-of-variables): Riemann integration, improper integrals, the substitution rule, multivariate change of variables via the Jacobian determinant, and Fubini theorem. The computational backbone of probability and ML. - [Inverse and Implicit Function Theorem](https://theorempath.com/topics/inverse-and-implicit-function-theorem): The inverse function theorem guarantees local invertibility when the Jacobian is nonsingular. The implicit function theorem guarantees that constraint surfaces are locally graphs. Both are essential for constrained optimization and implicit layers. - [Joint, Marginal, and Conditional Distributions](https://theorempath.com/topics/joint-marginal-conditional-distributions): Joint distributions, marginalization, conditional distributions, independence, Bayes theorem, and the chain rule of probability. - [Knapsack Problem](https://theorempath.com/topics/knapsack-problem): The canonical constrained optimization problem: 0/1 knapsack (NP-hard, pseudo-polynomial DP), fractional knapsack (greedy), FPTAS, and connections to Lagrangian relaxation in ML. - [Lambda Calculus](https://theorempath.com/topics/lambda-calculus): Lambda calculus is the simplest model of computation: just variables, abstraction, and application. It is equivalent to Turing machines in power, and it is the foundation of functional programming, type theory, and the Curry-Howard correspondence. - [Law of Large Numbers](https://theorempath.com/topics/law-of-large-numbers): The weak and strong laws of large numbers: the sample mean converges to the population mean. Kolmogorov's conditions, the rate of convergence from the CLT, and why LLN justifies using empirical risk as a proxy for population risk. - [Martingale Theory](https://theorempath.com/topics/martingale-theory): Martingales and their convergence properties: Doob martingale, optional stopping theorem, martingale convergence, Azuma-Hoeffding inequality, and Freedman inequality. The tools behind McDiarmid's inequality, online learning regret bounds, and stochastic approximation. - [Matrix Norms](https://theorempath.com/topics/matrix-norms): Frobenius, spectral, and nuclear norms for matrices. Submultiplicativity. When and why each norm appears in ML theory. - [Matrix Operations and Properties](https://theorempath.com/topics/matrix-operations-and-properties): Essential matrix operations for ML: trace, determinant, inverse, pseudoinverse, Schur complement, and the Sherman-Morrison-Woodbury formula. When and why each matters. - [Maximum Likelihood Estimation](https://theorempath.com/topics/maximum-likelihood-estimation): MLE: find the parameter that maximizes the likelihood of observed data. Consistency, asymptotic normality, Fisher information, Cramér-Rao efficiency, and when MLE fails. - [Measure-Theoretic Probability](https://theorempath.com/topics/measure-theoretic-probability): The rigorous foundations of probability: sigma-algebras, measures, measurable functions as random variables, Lebesgue integration, and the convergence theorems that make modern probability and statistics possible. - [Method of Moments](https://theorempath.com/topics/method-of-moments): Match sample moments to population moments to estimate parameters. Simpler than MLE but less efficient. Covers classical MoM, generalized method of moments (GMM), and when MoM is the better choice. - [Metric Spaces, Convergence, and Completeness](https://theorempath.com/topics/metric-spaces-convergence-completeness): Metric space axioms, convergence of sequences, Cauchy sequences, completeness, and the Banach fixed-point theorem. - [Moment Generating Functions](https://theorempath.com/topics/moment-generating-functions): The moment generating function M(t) = E[e^{tX}] encodes all moments of a distribution. The Chernoff method, sub-Gaussian bounds, and exponential family theory all reduce to MGF conditions. - [Monty Hall Problem](https://theorempath.com/topics/monty-hall-problem): Three doors, one car, two goats. You pick a door, the host reveals a goat behind another. Switching wins 2/3 of the time. Bayes theorem makes this precise. - [P vs NP](https://theorempath.com/topics/p-vs-np): A central open problem in computer science: is every problem whose solution can be verified in polynomial time also solvable in polynomial time? Covers P, NP, NP-completeness, reductions, Cook-Levin theorem, and relevance for ML. - [Peano Axioms](https://theorempath.com/topics/peano-axioms): The five axioms that define the natural numbers: zero exists, every number has a successor, successors are injective, zero is not a successor, and induction. All of arithmetic follows from these. - [Positive Semidefinite Matrices](https://theorempath.com/topics/positive-semidefinite-matrices): PSD matrices: equivalent characterizations, Cholesky decomposition, Schur complement, and Loewner ordering. Covariance matrices are PSD. Hessians of convex functions are PSD. These facts connect linear algebra to optimization and statistics. - [Radon-Nikodym and Conditional Expectation](https://theorempath.com/topics/radon-nikodym-and-conditional-expectation): The Radon-Nikodym theorem: what 'density' really means. Absolute continuity, the Radon-Nikodym derivative, conditional expectation as a projection, tower property, and why this undergirds likelihood ratios, importance sampling, and KL divergence. - [Sequences and Series of Functions](https://theorempath.com/topics/sequences-and-series-of-functions): Pointwise vs uniform convergence of function sequences, the Weierstrass M-test, and why uniform convergence preserves continuity. The concept that makes learning theory work. - [Sets, Functions, and Relations](https://theorempath.com/topics/sets-functions-and-relations): The language underneath all of mathematics: sets, Cartesian products, functions, injectivity, surjectivity, equivalence relations, and quotient sets. - [Shrinkage Estimation and the James-Stein Estimator](https://theorempath.com/topics/shrinkage-estimation-james-stein): In three or more dimensions, the sample mean is inadmissible for estimating a multivariate normal mean. The James-Stein estimator shrinks toward zero and dominates the MLE in total MSE, a result that shocked the statistics world. - [Singular Value Decomposition](https://theorempath.com/topics/singular-value-decomposition): The SVD A = U Sigma V^T: the most important matrix factorization in applied mathematics. Geometric interpretation, relationship to eigendecomposition, low-rank approximation via Eckart-Young, and applications from PCA to pseudoinverses. - [Sorting Algorithms](https://theorempath.com/topics/sorting-algorithms): Comparison-based sorting lower bound, quicksort, mergesort, heapsort, and non-comparison sorts. The foundational algorithms behind efficient data processing and search. - [Spectral Theory of Operators](https://theorempath.com/topics/spectral-theory-of-operators): Spectral theorem for compact self-adjoint operators on Hilbert spaces: every such operator has a countable orthonormal eigenbasis with real eigenvalues accumulating only at zero. This is the infinite-dimensional backbone of PCA, kernel methods, and neural tangent kernel theory. - [Stein's Paradox](https://theorempath.com/topics/steins-paradox): In dimension d >= 3, the sample mean is inadmissible for estimating the mean of a multivariate normal under squared error loss. The James-Stein estimator dominates it by shrinking toward zero. - [Sufficient Statistics and Exponential Families](https://theorempath.com/topics/sufficient-statistics-and-exponential-families): Sufficient statistics compress data without losing information about the parameter. The Neyman-Fisher factorization theorem, exponential families, completeness, and Rao-Blackwell improvement of estimators. - [Taylor Expansion](https://theorempath.com/topics/taylor-expansion): Taylor approximation in one and many variables. Every optimization algorithm is a Taylor approximation: gradient descent uses first order, Newton's method uses second order. - [Tensors and Tensor Operations](https://theorempath.com/topics/tensors-and-tensor-operations): What a tensor actually is: a multilinear map with specific transformation rules, how tensor contraction generalizes matrix multiplication, Einstein summation, tensor decompositions, and how ML frameworks use the word tensor to mean multidimensional array. - [The Hessian Matrix](https://theorempath.com/topics/the-hessian-matrix): The matrix of second partial derivatives: encodes curvature, determines the nature of critical points, and is the central object in second-order optimization. - [The Jacobian Matrix](https://theorempath.com/topics/the-jacobian-matrix): The matrix of all first partial derivatives of a vector-valued function: encodes the best linear approximation, connects to the chain rule in matrix form, and is the backbone of backpropagation. - [Type Theory](https://theorempath.com/topics/type-theory): Types as propositions, terms as proofs. Simply typed lambda calculus, the Curry-Howard correspondence, dependent types, and connections to programming language foundations and formal verification. - [Vectors, Matrices, and Linear Maps](https://theorempath.com/topics/vectors-matrices-and-linear-maps): Vector spaces, linear maps, matrix representation, rank, nullity, and the rank-nullity theorem. The algebraic backbone of ML. - [Vieta Jumping](https://theorempath.com/topics/vieta-jumping): A competition number theory technique: given a Diophantine equation in two variables, fix one variable, treat the equation as a quadratic in the other, and use Vieta's formulas to jump to a new integer solution. Repeated jumping produces a contradiction or forces a known base case. - [Zermelo-Fraenkel Set Theory](https://theorempath.com/topics/zermelo-fraenkel-set-theory): The ZFC axioms form the standard foundation for mathematics. Extensionality, pairing, union, power set, infinity, separation, replacement, choice, and foundation prevent paradoxes while being expressive enough for all of modern mathematics. ## Core topics (Tier 1) - [Activation Functions](https://theorempath.com/topics/activation-functions): Nonlinear activation functions in neural networks: sigmoid, tanh, ReLU, Leaky ReLU, GELU, and SiLU. Their gradients, saturation behavior, and impact on trainability. - [Adam Optimizer](https://theorempath.com/topics/adam-optimizer): Adaptive moment estimation: first and second moment tracking, bias correction, AdamW vs Adam+L2, learning rate warmup, and when Adam fails compared to SGD. - [AIC and BIC](https://theorempath.com/topics/aic-and-bic): Akaike and Bayesian information criteria for model selection: how they trade off fit versus complexity, when to use each, and their connection to cross-validation. - [Algorithmic Stability](https://theorempath.com/topics/algorithmic-stability): Algorithmic stability provides generalization bounds by analyzing how much a learning algorithm's output changes when a single training example is replaced: a structurally different lens from complexity-based approaches. - [Attention Is All You Need (Paper)](https://theorempath.com/topics/attention-is-all-you-need-paper): The 2017 paper that introduced the transformer: self-attention replacing recurrence, multi-head attention, positional encoding, and what survived versus what changed in modern LLMs. - [Automatic Differentiation](https://theorempath.com/topics/automatic-differentiation): Forward mode computes Jacobian-vector products, reverse mode computes vector-Jacobian products: backpropagation is reverse-mode autodiff, and the asymmetry between the two modes explains why training neural networks is efficient. - [Bagging](https://theorempath.com/topics/bagging): Bootstrap Aggregation: train B models on bootstrap samples and average their predictions. Why averaging reduces variance, how correlation limits the gain, and out-of-bag error estimation. - [Batch Normalization](https://theorempath.com/topics/batch-normalization): Normalize activations within mini-batches to stabilize and accelerate training, then scale and shift with learnable parameters. - [Bellman Equations](https://theorempath.com/topics/bellman-equations): The recursive backbone of RL. State-value and action-value Bellman equations, the contraction mapping property, convergence of value iteration, and why recursive decomposition is the central idea in sequential decision-making. - [The Bitter Lesson](https://theorempath.com/topics/bitter-lesson): Sutton's meta-principle: scalable general methods that exploit computation tend to beat hand-crafted domain-specific approaches in the long run. Search and learning win; brittle cleverness loses. - [Bootstrap Methods](https://theorempath.com/topics/bootstrap-methods): The nonparametric bootstrap: resample with replacement to approximate sampling distributions, construct confidence intervals, and quantify uncertainty without distributional assumptions. - [Bounded Rationality](https://theorempath.com/topics/bounded-rationality): Real agents optimize under limited information, limited compute, and limited foresight. Simon's satisficing, heuristics, and the implications for search, planning, and agent design in ML. - [Causal Inference and the Ladder of Causation](https://theorempath.com/topics/causal-inference-pearl): Pearl's three-level hierarchy: association, intervention, counterfactual. Structural causal models, do-calculus, the adjustment formula, and why prediction is not causation. - [Chernoff Bounds](https://theorempath.com/topics/chernoff-bounds): The Chernoff method: the universal technique for deriving exponential tail bounds by optimizing over the moment generating function, yielding the tightest possible exponential concentration. - [Concentration Inequalities](https://theorempath.com/topics/concentration-inequalities): Bounds on how far random variables deviate from their expectations: Markov, Chebyshev, Hoeffding, and Bernstein. Used throughout generalization theory, bandits, and sample complexity. - [Conditioning and Condition Number](https://theorempath.com/topics/conditioning-and-condition-number): The condition number measures how sensitive a problem is to perturbations in its input. Ill-conditioned matrices turn small errors into catastrophic ones, and understanding conditioning is essential for any computation involving linear algebra. - [Confusion Matrices and Classification Metrics](https://theorempath.com/topics/confusion-matrices-and-classification-metrics): The complete framework for evaluating binary classifiers: confusion matrix entries, precision, recall, F1, specificity, ROC curves, AUC, and precision-recall curves. When to use each metric. - [Confusion Matrix Deep Dive](https://theorempath.com/topics/confusion-matrix-deep-dive): Beyond binary classification: multi-class confusion matrices, micro vs macro averaging, Matthews correlation coefficient, Cohen's kappa, cost-sensitive evaluation, and how to read a confusion matrix without getting the axes wrong. - [Convex Optimization Basics](https://theorempath.com/topics/convex-optimization-basics): Convex sets, convex functions, gradient descent convergence, strong convexity, and duality: the optimization foundation that every learning-theoretic result silently depends on. - [Cross-Entropy Loss Deep Dive](https://theorempath.com/topics/cross-entropy-loss-deep-dive): Why cross-entropy is the correct loss for classification: its derivation as negative log-likelihood, connection to KL divergence, why MSE fails for classification, and practical variants including label smoothing and focal loss. - [Data Preprocessing and Feature Engineering](https://theorempath.com/topics/data-preprocessing-and-feature-engineering): Standardization, scaling, encoding, imputation, and feature selection. Why most algorithms assume centered, scaled inputs and what breaks when you skip preprocessing. - [Dropout](https://theorempath.com/topics/dropout): Dropout as stochastic regularization: Bernoulli masking, inverted dropout scaling, implicit ensemble interpretation, Bayesian connection via MC dropout, and equivalence to L2 regularization in linear models. - [The EM Algorithm](https://theorempath.com/topics/em-algorithm): Expectation-Maximization: the principled way to do maximum likelihood when some variables are unobserved. Derives the ELBO, proves monotonic convergence, and shows why EM is the backbone of latent variable models. - [Empirical Risk Minimization](https://theorempath.com/topics/empirical-risk-minimization): The foundational principle of statistical learning: minimize average loss on training data as a proxy for minimizing true population risk. - [Epsilon-Nets and Covering Numbers](https://theorempath.com/topics/epsilon-nets-and-covering-numbers): Discretizing infinite sets for concentration arguments: epsilon-nets, covering numbers, packing numbers, the Dudley integral, and the connection to Rademacher complexity. - [The Era of Experience](https://theorempath.com/topics/era-of-experience): Sutton and Silver's thesis: the next phase of AI moves beyond imitation from human data toward agents that learn predominantly from their own experience. Text is not enough for general intelligence. - [Fat Tails and Heavy-Tailed Distributions](https://theorempath.com/topics/fat-tails): When the tails dominate. Power laws, Pareto distributions, subexponential tails, why the law of large numbers converges slowly or fails, and why most of ML silently assumes thin tails. - [Feedforward Networks and Backpropagation](https://theorempath.com/topics/feedforward-networks-and-backpropagation): Feedforward neural networks as compositions of affine transforms and nonlinearities, the universal approximation theorem, and backpropagation as reverse-mode automatic differentiation on the computational graph. - [Fine-Tuning and Adaptation](https://theorempath.com/topics/fine-tuning-and-adaptation): Methods for adapting pretrained models to new tasks: full fine-tuning, feature extraction, LoRA, QLoRA, adapter layers, and prompt tuning. When to use each, and how catastrophic forgetting constrains adaptation. - [Game Theory Foundations](https://theorempath.com/topics/game-theory): Strategic interaction between rational agents. Normal-form games, dominant strategies, Nash equilibrium existence, mixed strategies, and connections to minimax, mechanism design, and multi-agent RL. - [Gauss-Markov Theorem](https://theorempath.com/topics/gauss-markov-theorem): Among all linear unbiased estimators, ordinary least squares has the smallest variance. This is the BLUE theorem. And understanding when its assumptions fail is just as important as the result itself. - [Gibbs Sampling](https://theorempath.com/topics/gibbs-sampling): The workhorse MCMC algorithm for Bayesian models: sample each variable from its full conditional distribution, cycling through all variables, and every proposal is automatically accepted. - [Gradient Boosting](https://theorempath.com/topics/gradient-boosting): Gradient boosting as functional gradient descent: fit weak learners to pseudo-residuals sequentially, reducing bias at each round. Covers AdaBoost, shrinkage, XGBoost second-order methods, and LightGBM leaf-wise growth. - [Gradient Descent Variants](https://theorempath.com/topics/gradient-descent-variants): From full-batch to stochastic to mini-batch gradient descent, plus momentum, Nesterov acceleration, AdaGrad, RMSProp, and Adam. Why mini-batch SGD with momentum is the practical default. - [Gradient Flow and Vanishing Gradients](https://theorempath.com/topics/gradient-flow-and-vanishing-gradients): Why deep networks are hard to train: gradients shrink or explode as they propagate through layers. The Jacobian product chain, sigmoid saturation, ReLU dead neurons, skip connections, normalization, and gradient clipping. - [Gram Matrices and Kernel Matrices](https://theorempath.com/topics/gram-matrices-and-kernel-matrices): The Gram matrix G_{ij} = encodes pairwise inner products of a dataset. Always PSD. Appears in kernel methods, PCA, SVD, and attention. Understanding it connects linear algebra to ML. - [Hallucination Theory](https://theorempath.com/topics/hallucination-theory): Why large language models confabulate, the mathematical frameworks for understanding when model outputs are unreliable, and what current theory says about mitigation. - [High-Dimensional Probability (Vershynin)](https://theorempath.com/topics/high-dimensional-probability-book): Reading guide for Vershynin's textbook on sub-Gaussian and sub-exponential random variables, concentration inequalities, random matrices, covering numbers, and high-dimensional geometry. The modern reference for probabilistic tools in ML theory. - [Hypothesis Classes and Function Spaces](https://theorempath.com/topics/hypothesis-classes-and-function-spaces): What is a hypothesis class, why the choice of hypothesis class determines what ERM can learn, and the approximation-estimation tradeoff: bigger classes reduce approximation error but increase estimation error. - [Implicit Bias and Modern Generalization](https://theorempath.com/topics/implicit-bias-and-modern-generalization): Why classical generalization theory breaks for overparameterized models: the random labels experiment, the interpolation threshold, implicit bias of gradient descent, double descent, and the frontier of understanding why deep learning works. - [Importance Sampling](https://theorempath.com/topics/importance-sampling): Estimate expectations under one distribution by sampling from another and reweighting: a technique that is powerful when done right and catastrophically unreliable when done wrong. - [Information Retrieval Foundations](https://theorempath.com/topics/information-retrieval): Search as a first-class capability. TF-IDF, BM25, inverted indexes, precision/recall, reranking, and why retrieval is not just vector DB plus embeddings. - [K-Means Clustering](https://theorempath.com/topics/k-means-clustering): Lloyd's algorithm for partitional clustering: the within-cluster sum of squares objective, convergence guarantees, k-means++ initialization, choosing k, and the connection to EM for Gaussians. - [Kalman Filter](https://theorempath.com/topics/kalman-filter): Optimal state estimation for linear Gaussian systems via recursive prediction and update steps using the Kalman gain. - [KL Divergence](https://theorempath.com/topics/kl-divergence): Kullback-Leibler divergence measures how one probability distribution differs from another. Asymmetric, always non-negative, and central to variational inference, MLE, and RLHF. - [Lasso Regression](https://theorempath.com/topics/lasso-regression): OLS with L1 regularization: sparsity, the geometry of why L1 selects variables, proximal gradient descent, LARS, and elastic net. - [Learning Rate Scheduling](https://theorempath.com/topics/learning-rate-scheduling): Why learning rate moves the loss more than any other hyperparameter, and how warmup, cosine decay, step decay, cyclic schedules, and 1cycle policy control training dynamics. - [Linear Regression](https://theorempath.com/topics/linear-regression): Ordinary least squares as projection, the normal equations, the hat matrix, Gauss-Markov optimality, and the connection to maximum likelihood under Gaussian noise. - [Log-Probability Computation](https://theorempath.com/topics/log-probability-computation): Working in log space prevents underflow when multiplying many small probabilities. The log-sum-exp trick provides a numerically stable way to compute log(sum(exp(x_i))), and it underlies stable softmax, log-likelihoods, and the forward algorithm for HMMs. - [Logistic Regression](https://theorempath.com/topics/logistic-regression): The foundational linear classifier: sigmoid link function, maximum likelihood estimation, cross-entropy loss, gradient derivation, and regularized variants. - [Loss Functions Catalog](https://theorempath.com/topics/loss-functions-catalog): A systematic catalog of loss functions for classification, regression, and representation learning: cross-entropy, MSE, hinge, Huber, focal, KL divergence, and contrastive loss. - [Markov Decision Processes](https://theorempath.com/topics/markov-decision-processes): The mathematical framework for sequential decision-making under uncertainty: states, actions, transitions, rewards, and the Bellman equations that make solving them possible. - [Matrix Calculus](https://theorempath.com/topics/matrix-calculus): The differentiation identities you actually use in ML: derivatives of traces, log-determinants, and quadratic forms with respect to matrices and vectors. - [Matrix Concentration](https://theorempath.com/topics/matrix-concentration): Matrix Bernstein, Matrix Hoeffding, Weyl's inequality, and Davis-Kahan: the operator-norm concentration tools needed for covariance estimation, dimensionality reduction, and spectral analysis in high dimensions. - [McDiarmid's Inequality](https://theorempath.com/topics/mcdiarmids-inequality): The bounded-differences inequality: if changing one input to a function changes the output by at most c_i, the function concentrates around its mean with sub-Gaussian tails. - [Metropolis-Hastings Algorithm](https://theorempath.com/topics/metropolis-hastings): The foundational MCMC algorithm: construct a Markov chain whose stationary distribution is your target by accepting or rejecting proposed moves according to a carefully chosen ratio. - [Model Evaluation Best Practices](https://theorempath.com/topics/model-evaluation-best-practices): Train/validation/test splits, cross-validation, stratified and temporal splits, data leakage, reporting with confidence intervals, and statistical tests for model comparison. Why single-number comparisons are misleading. - [Newton's Method](https://theorempath.com/topics/newtons-method): The gold standard for fast local convergence: use second-order information (the Hessian) to take optimal quadratic steps. Quadratic convergence when it works, catastrophic failure when it doesn't. - [Numerical Stability and Conditioning](https://theorempath.com/topics/numerical-stability): Continuous math becomes real only through finite-precision approximation. Condition numbers, backward stability, catastrophic cancellation, and why theorems about reals do not transfer cleanly to floating-point. - [Optimizer Theory: SGD, Adam, and Muon](https://theorempath.com/topics/optimizer-theory-sgd-adam-muon): Convergence theory of SGD (convex and strongly convex), momentum methods (Polyak and Nesterov), Adam as adaptive + momentum, why SGD can generalize better, the Muon optimizer, and learning rate schedules. - [Overfitting and Underfitting](https://theorempath.com/topics/overfitting-and-underfitting): The two failure modes of supervised learning: models that memorize noise versus models too simple to capture signal. Diagnosis via training-validation gaps. - [PAC Learning Framework](https://theorempath.com/topics/pac-learning-framework): The foundational formalization of what it means to learn from data: a concept is PAC-learnable if an algorithm can, with high probability, find a hypothesis that is approximately correct, using a polynomial number of samples. - [Policy Gradient Theorem](https://theorempath.com/topics/policy-gradient-theorem): The fundamental result enabling gradient-based optimization of parameterized policies: the policy gradient theorem and the algorithms it spawns. - [Principal Component Analysis](https://theorempath.com/topics/principal-component-analysis): Dimensionality reduction via variance maximization: PCA as eigendecomposition of the covariance matrix, PCA as truncated SVD of the centered data matrix, reconstruction error, and when sample PCA works. - [Proximal Gradient Methods](https://theorempath.com/topics/proximal-gradient-methods): Optimize composite objectives by alternating gradient steps on smooth terms with proximal operators on nonsmooth terms. ISTA and its accelerated variant FISTA. - [Quasi-Newton Methods](https://theorempath.com/topics/quasi-newton-methods): Approximate the Hessian instead of computing it: BFGS builds a dense approximation, L-BFGS stores only a few vectors. Superlinear convergence without second derivatives. - [Rademacher Complexity](https://theorempath.com/topics/rademacher-complexity): A data-dependent measure of hypothesis class complexity that gives tighter generalization bounds than VC dimension by measuring how well the class can fit random noise on the actual data. - [Random Forests](https://theorempath.com/topics/random-forests): Random forests combine bagging with random feature subsampling to decorrelate trees, reducing ensemble variance beyond what pure bagging achieves. Out-of-bag estimation, variable importance, consistency theory, and practical strengths and weaknesses. - [Regularization in Practice](https://theorempath.com/topics/regularization-in-practice): Practical guide to regularization: L2 (weight decay), L1 (sparsity), dropout, early stopping, data augmentation, and batch normalization. How each constrains model complexity and how to choose between them. - [Reinforcement Learning from Human Feedback: Deep Dive](https://theorempath.com/topics/reinforcement-learning-from-human-feedback-deep-dive): The full RLHF pipeline: supervised fine-tuning, Bradley-Terry reward modeling, PPO with KL penalty, reward hacking via Goodhart, and the post-RLHF landscape of DPO, GRPO, and RLVR. - [Reward Design and Reward Misspecification](https://theorempath.com/topics/reward-design): The hardest problem in RL: specifying what you want. Reward shaping, potential-based shaping theorem, specification gaming, Goodhart's law in RL, and the bridge from classic RL to alignment. - [Ridge Regression](https://theorempath.com/topics/ridge-regression): OLS with L2 regularization: closed-form shrinkage, bias-variance tradeoff, SVD interpretation, and the Bayesian connection to Gaussian priors. - [Sample Complexity Bounds](https://theorempath.com/topics/sample-complexity-bounds): How many samples do you need to learn? Tight answers for finite hypothesis classes, VC classes, and Rademacher-bounded classes, plus matching lower bounds via Fano and Le Cam. - [Skewness, Kurtosis, and Higher Moments](https://theorempath.com/topics/skewness-kurtosis-and-higher-moments): Distribution shape beyond mean and variance: skewness measures tail asymmetry, kurtosis measures tail extremeness, cumulants are the cleaner language, and heavy-tailed distributions break all of these. - [Skip Connections and ResNets](https://theorempath.com/topics/skip-connections-and-resnets): Residual connections let gradients flow through identity paths, enabling training of very deep networks. ResNets learn residual functions F(x) = H(x) - x, which is easier than learning H(x) directly. - [Softmax and Numerical Stability](https://theorempath.com/topics/softmax-and-numerical-stability): The softmax function maps arbitrary reals to a probability distribution. Getting it right numerically: avoiding overflow and underflow: is the first lesson in writing ML code that actually works. - [Stochastic Gradient Descent Convergence](https://theorempath.com/topics/stochastic-gradient-descent-convergence): SGD convergence rates for convex and strongly convex functions, the role of noise as both curse and blessing, mini-batch variance reduction, learning rate schedules, and the Robbins-Monro conditions. - [Sub-Exponential Random Variables](https://theorempath.com/topics/subexponential-random-variables): The distributional class between sub-Gaussian and heavy-tailed: heavier tails than Gaussian, the psi_1 norm, Bernstein condition, and the two-regime concentration bound. - [Sub-Gaussian Random Variables](https://theorempath.com/topics/subgaussian-random-variables): Sub-Gaussian random variables: the precise characterization of 'light-tailed' behavior that underpins every concentration inequality in learning theory. - [Support Vector Machines](https://theorempath.com/topics/support-vector-machines): Maximum-margin classifiers via convex optimization: hard margin, soft margin with slack variables, hinge loss, the dual formulation, and the kernel trick. - [Symmetrization Inequality](https://theorempath.com/topics/symmetrization-inequality): The symmetrization technique: the proof template that connects the generalization gap to Rademacher complexity by introducing a ghost sample and random signs. - [Train-Test Split and Data Leakage](https://theorempath.com/topics/train-test-split-and-data-leakage): Why you need three data splits, how to construct them correctly, and the common ways information leaks from test data into training. Temporal splits, stratified splits, and leakage detection. - [Types of Bias in Statistics](https://theorempath.com/topics/types-of-bias-in-statistics): A taxonomy of the major biases that corrupt statistical inference: selection bias, measurement bias, survivorship bias, response bias, attrition bias, publication bias, and their consequences for ML. - [Understanding Machine Learning (Shalev-Shwartz, Ben-David)](https://theorempath.com/topics/understanding-machine-learning-book): Reading guide for the definitive learning theory textbook. Covers PAC learning, VC dimension, Rademacher complexity, uniform convergence, stability, online learning, boosting, and regularization with rigorous proofs. - [Uniform Convergence](https://theorempath.com/topics/uniform-convergence): Uniform convergence of empirical risk to population risk over an entire hypothesis class: the key property that makes ERM provably work. - [Universal Approximation Theorem](https://theorempath.com/topics/universal-approximation-theorem): A single hidden layer neural network can approximate any continuous function on a compact set to arbitrary accuracy. Why this is both important and misleading: it says nothing about width, weight-finding, or generalization. - [Value Iteration and Policy Iteration](https://theorempath.com/topics/value-iteration-and-policy-iteration): The two foundational algorithms for solving MDPs exactly: value iteration applies the Bellman optimality operator until convergence, while policy iteration alternates between exact evaluation and greedy improvement. - [Variational Autoencoders](https://theorempath.com/topics/variational-autoencoders): Deriving the ELBO, the reparameterization trick for backpropagation through sampling, and how VAEs turn autoencoders into principled generative models via amortized variational inference. - [VC Dimension](https://theorempath.com/topics/vc-dimension): The Vapnik-Chervonenkis dimension: a combinatorial measure of hypothesis class complexity that characterizes learnability in binary classification. - [Weight Initialization](https://theorempath.com/topics/weight-initialization): Why initialization determines whether gradients vanish or explode, and how Xavier and He initialization preserve variance across layers. ## Featured theorem pages - [Algorithmic Stability](https://theorempath.com/topics/algorithmic-stability): Algorithmic stability provides generalization bounds by analyzing how much a learning algorithm's output changes when a single training example is replaced: a structurally different lens from complexity-based approaches. - [Attention Is All You Need (Paper)](https://theorempath.com/topics/attention-is-all-you-need-paper): The 2017 paper that introduced the transformer: self-attention replacing recurrence, multi-head attention, positional encoding, and what survived versus what changed in modern LLMs. - [Causal Inference and the Ladder of Causation](https://theorempath.com/topics/causal-inference-pearl): Pearl's three-level hierarchy: association, intervention, counterfactual. Structural causal models, do-calculus, the adjustment formula, and why prediction is not causation. - [The Era of Experience](https://theorempath.com/topics/era-of-experience): Sutton and Silver's thesis: the next phase of AI moves beyond imitation from human data toward agents that learn predominantly from their own experience. Text is not enough for general intelligence. - [Fat Tails and Heavy-Tailed Distributions](https://theorempath.com/topics/fat-tails): When the tails dominate. Power laws, Pareto distributions, subexponential tails, why the law of large numbers converges slowly or fails, and why most of ML silently assumes thin tails. - [Gradient Boosting](https://theorempath.com/topics/gradient-boosting): Gradient boosting as functional gradient descent: fit weak learners to pseudo-residuals sequentially, reducing bias at each round. Covers AdaBoost, shrinkage, XGBoost second-order methods, and LightGBM leaf-wise growth. - [Hallucination Theory](https://theorempath.com/topics/hallucination-theory): Why large language models confabulate, the mathematical frameworks for understanding when model outputs are unreliable, and what current theory says about mitigation. - [Implicit Bias and Modern Generalization](https://theorempath.com/topics/implicit-bias-and-modern-generalization): Why classical generalization theory breaks for overparameterized models: the random labels experiment, the interpolation threshold, implicit bias of gradient descent, double descent, and the frontier of understanding why deep learning works. - [Matrix Concentration](https://theorempath.com/topics/matrix-concentration): Matrix Bernstein, Matrix Hoeffding, Weyl's inequality, and Davis-Kahan: the operator-norm concentration tools needed for covariance estimation, dimensionality reduction, and spectral analysis in high dimensions. - [McDiarmid's Inequality](https://theorempath.com/topics/mcdiarmids-inequality): The bounded-differences inequality: if changing one input to a function changes the output by at most c_i, the function concentrates around its mean with sub-Gaussian tails. - [Optimizer Theory: SGD, Adam, and Muon](https://theorempath.com/topics/optimizer-theory-sgd-adam-muon): Convergence theory of SGD (convex and strongly convex), momentum methods (Polyak and Nesterov), Adam as adaptive + momentum, why SGD can generalize better, the Muon optimizer, and learning rate schedules. - [Policy Gradient Theorem](https://theorempath.com/topics/policy-gradient-theorem): The fundamental result enabling gradient-based optimization of parameterized policies: the policy gradient theorem and the algorithms it spawns. - [Rademacher Complexity](https://theorempath.com/topics/rademacher-complexity): A data-dependent measure of hypothesis class complexity that gives tighter generalization bounds than VC dimension by measuring how well the class can fit random noise on the actual data. - [Reinforcement Learning from Human Feedback: Deep Dive](https://theorempath.com/topics/reinforcement-learning-from-human-feedback-deep-dive): The full RLHF pipeline: supervised fine-tuning, Bradley-Terry reward modeling, PPO with KL penalty, reward hacking via Goodhart, and the post-RLHF landscape of DPO, GRPO, and RLVR. - [Reward Design and Reward Misspecification](https://theorempath.com/topics/reward-design): The hardest problem in RL: specifying what you want. Reward shaping, potential-based shaping theorem, specification gaming, Goodhart's law in RL, and the bridge from classic RL to alignment. - [Symmetrization Inequality](https://theorempath.com/topics/symmetrization-inequality): The symmetrization technique: the proof template that connects the generalization gap to Rademacher complexity by introducing a ghost sample and random signs. - [Variational Autoencoders](https://theorempath.com/topics/variational-autoencoders): Deriving the ELBO, the reparameterization trick for backpropagation through sampling, and how VAEs turn autoencoders into principled generative models via amortized variational inference. ## Side-by-side comparisons - [Adam vs. SGD](https://theorempath.com/compare/adam-vs-sgd): Adam adapts the learning rate per parameter using first and second moment estimates for fast early convergence; SGD with momentum uses a single global learning rate and often finds flatter minima that generalize better. The choice depends on your priorities: speed to convergence or final model quality. - [AdamW vs. Adam](https://theorempath.com/compare/adamw-vs-adam): Adam applies L2 regularization inside the gradient, where the adaptive scaling distorts the penalty. AdamW decouples weight decay from the adaptive step, applying it directly to parameters. This distinction matters: every modern transformer uses AdamW, not Adam with L2. - [Autoregressive Models vs. Diffusion Models](https://theorempath.com/compare/autoregressive-vs-diffusion): Autoregressive models generate tokens sequentially via next-token prediction; diffusion models generate data by iteratively denoising from Gaussian noise. Sequential discrete generation vs. parallel continuous denoising: why LLMs dominate text and diffusion dominates images. - [Autoregressive Models vs. JEPA](https://theorempath.com/compare/autoregressive-vs-jepa): Two competing paradigms for learning world models: autoregressive models predict raw tokens or pixels sequentially, while JEPA predicts abstract representations in a learned latent space without generating observable outputs. - [Azuma-Hoeffding vs. Freedman Inequality](https://theorempath.com/compare/azuma-hoeffding-vs-freedman): Azuma-Hoeffding uses only bounded increments and gives sub-Gaussian tails. Freedman incorporates the predictable quadratic variation and is tighter when the variance is small. Freedman interpolates between sub-Gaussian and sub-exponential behavior. - [CNN vs. ViT vs. Swin Transformer](https://theorempath.com/compare/cnn-vs-vit-vs-swin): CNNs bake in local inductive bias and translation equivariance. ViT applies global self-attention to image patches but needs large datasets. Swin Transformer uses hierarchical shifted windows to get the best of both: local efficiency with global reasoning. - [Contrastive Loss vs. Triplet Loss](https://theorempath.com/compare/contrastive-vs-triplet): Contrastive loss (InfoNCE) pushes apart a query from N-1 negatives simultaneously using a softmax over similarities. Triplet loss pushes apart one anchor-negative pair relative to one anchor-positive pair with a fixed margin. InfoNCE scales better with batch size and dominates modern self-supervised learning. Triplet loss is simpler but requires careful hard negative mining to train effectively. - [Covering Numbers vs. Packing Numbers](https://theorempath.com/compare/covering-vs-packing-numbers): Covering numbers count the minimum eps-net size. Packing numbers count the maximum number of eps-separated points. They are within a factor of 2, but each appears naturally in different proof contexts. - [Cramér-Rao Bound vs. Minimax Lower Bounds](https://theorempath.com/compare/cramer-rao-vs-minimax): Two frameworks for bounding estimation difficulty: Cramér-Rao gives a local lower bound for unbiased estimators at a single parameter value, while minimax lower bounds apply to all estimators over an entire parameter class. - [Cross-Entropy vs. MSE Loss](https://theorempath.com/compare/cross-entropy-vs-mse): Cross-entropy is the natural loss for classification because it equals the negative log-likelihood of a Bernoulli or categorical model, produces strong gradients even when the model is confidently wrong, and decomposes as entropy plus KL divergence. MSE is the natural loss for regression, corresponding to Gaussian likelihood, but causes gradient saturation when paired with sigmoid or softmax outputs. - [Dense Transformers vs. Mixture-of-Experts](https://theorempath.com/compare/dense-vs-mixture-of-experts): Dense transformers activate all parameters for every token, giving simple training but high compute per token. Mixture-of-experts routes each token to k of N experts, achieving higher total capacity with lower per-token compute, at the cost of routing complexity and load balancing challenges. - [Diffusion Models vs. GANs vs. VAEs](https://theorempath.com/compare/diffusion-vs-gans-vs-vaes): Three generative model families compared: GANs use adversarial training for sharp samples but suffer mode collapse, VAEs optimize ELBO for smooth latent spaces but produce blurry outputs, and diffusion models iteratively denoise for high quality at the cost of slow sampling. - [Dropout vs. Batch Normalization](https://theorempath.com/compare/dropout-vs-batch-norm): Dropout regularizes by stochastic masking of activations, approximating an ensemble of exponentially many subnetworks. Batch normalization normalizes activations to stabilize training, with an incidental regularization effect from mini-batch noise. Both reduce overfitting, but through completely different mechanisms, and they interact poorly because dropout shifts the statistics that batch norm estimates. - [Early Stopping vs. Weight Decay](https://theorempath.com/compare/early-stopping-vs-weight-decay): Early stopping halts training when validation loss increases, limiting effective model capacity by restricting optimization time. Weight decay adds an explicit penalty on weight magnitude to the loss function. For linear models, early stopping with gradient descent is equivalent to L2 regularization. In deep networks, they control capacity through different mechanisms and are typically used together. - [Encoder-Only vs. Decoder-Only vs. Encoder-Decoder](https://theorempath.com/compare/encoder-only-vs-decoder-only-vs-encoder-decoder): Encoder-only models (BERT) use bidirectional attention for classification and extraction. Decoder-only models (GPT) use causal masking for autoregressive generation. Encoder-decoder models (T5) use cross-attention to condition generation on a fully encoded input. The architecture choice determines what tasks the model can perform natively. - [Fano's Method vs. Le Cam's Method](https://theorempath.com/compare/fano-vs-le-cam): Two techniques for proving minimax lower bounds: Fano reduces to many-hypothesis testing via mutual information, Le Cam reduces to binary hypothesis testing via total variation distance. - [FlashAttention vs. Vanilla Attention](https://theorempath.com/compare/flash-attention-vs-vanilla-attention): FlashAttention and vanilla attention compute the exact same output. The difference is entirely in IO complexity: vanilla materializes the full n x n attention matrix in GPU HBM, while FlashAttention tiles the computation into SRAM blocks using an online softmax trick, reducing memory from O(n^2) to O(n) and achieving 2-4x wall-clock speedup. - [Focal Loss vs. Cross-Entropy Loss](https://theorempath.com/compare/focal-vs-cross-entropy): Cross-entropy loss treats all examples equally, weighting each by its negative log-probability. Focal loss multiplies the cross-entropy by a modulating factor that downweights well-classified (easy) examples, focusing training on hard examples. Focal loss is a strict generalization of cross-entropy (setting the focusing parameter to zero recovers cross-entropy). It is most effective for severe class imbalance where easy negatives dominate the gradient. - [Frequentist vs. Bayesian Inference](https://theorempath.com/compare/frequentist-vs-bayesian): Two foundational philosophies of statistical inference: frequentists treat parameters as fixed unknowns and data as random, Bayesians treat parameters as random variables with prior distributions and compute posteriors. - [Gradient Clipping vs. Weight Decay](https://theorempath.com/compare/gradient-clipping-vs-weight-decay): Gradient clipping limits the magnitude of gradients during backpropagation, preventing training instability from exploding gradients. Weight decay limits the magnitude of weights, preventing overfitting by penalizing large parameters. They address different problems: clipping is about training stability, weight decay is about generalization. Modern LLM training uses both. - [Hoeffding vs. Bernstein Inequality](https://theorempath.com/compare/hoeffding-vs-bernstein): When to use range-only bounds vs. variance-aware bounds: Bernstein is tighter when variance is small, Hoeffding is simpler and sufficient when it is not. - [Kaplan vs. Chinchilla Scaling](https://theorempath.com/compare/kaplan-vs-chinchilla-scaling): Kaplan (2020) said scale up parameters faster than data. Chinchilla (2022) showed the opposite: many large models were undertrained. The disagreement came from a methodological flaw in how Kaplan fitted the scaling exponents. - [Kernel Methods vs. Feature Learning](https://theorempath.com/compare/kernel-methods-vs-feature-learning): Kernel methods fix a feature map and learn weights. Feature learning methods learn the features themselves. The NTK regime is kernel-like; the rich regime learns features. When each approach suffices and when it does not. - [KL Divergence vs. Cross-Entropy](https://theorempath.com/compare/kl-divergence-vs-cross-entropy): Cross-entropy and KL divergence are related by a constant: H(P,Q) = H(P) + KL(P||Q). When the true distribution P is fixed (as in supervised classification), minimizing cross-entropy is equivalent to minimizing KL divergence. They differ in symmetry, interpretation, and usage context. - [L1 vs. L2 Regularization](https://theorempath.com/compare/l1-vs-l2): L1 (Lasso) penalizes the absolute value of weights, producing sparse solutions via the diamond geometry of the L1 ball. L2 (Ridge) penalizes squared weights, shrinking all coefficients toward zero without eliminating any. The choice depends on whether the true model is sparse or dense. - [Lazy (NTK) Regime vs. Feature Learning Regime](https://theorempath.com/compare/lazy-vs-feature-learning): Neural networks can operate in two regimes: the lazy regime where weights barely move and the network behaves like a fixed kernel, or the feature learning regime where weights move substantially and learn task-specific representations. - [LoRA vs. Full Fine-Tune vs. QLoRA](https://theorempath.com/compare/lora-vs-full-finetune-vs-qlora): Full fine-tuning updates all parameters and achieves the best task performance but requires storing a complete copy of the model per task. LoRA freezes the base model and trains low-rank additive matrices, cutting trainable parameters by 100x or more. QLoRA quantizes the base model to 4-bit and applies LoRA on top, enabling fine-tuning of 65B models on a single 48GB GPU. - [Martingale CLT vs. Classical CLT](https://theorempath.com/compare/martingale-clt-vs-classical-clt): The classical CLT requires iid random variables. The martingale CLT extends to dependent sequences with mean-zero increments. The martingale version is needed for stochastic approximation and online learning. - [MLE vs. Method of Moments](https://theorempath.com/compare/mle-vs-method-of-moments): Two classical estimation strategies: MLE maximizes the likelihood and is asymptotically efficient, while Method of Moments matches sample moments to population moments and is simpler but typically less efficient. - [Model-Based vs. Model-Free RL](https://theorempath.com/compare/model-based-vs-model-free-rl): Model-based RL learns a dynamics model and plans internally (Dreamer, MuZero), while model-free RL learns value functions or policies directly from experience (DQN, PPO). The tradeoff is sample efficiency vs. model error. - [Multi-Head vs. Multi-Query vs. Grouped-Query Attention](https://theorempath.com/compare/multi-head-vs-multi-query-vs-gqa): Multi-head attention (MHA) gives each head its own K, V projections. Multi-query attention (MQA) shares a single K, V across all heads. Grouped-query attention (GQA) shares K, V within groups of heads. MQA and GQA reduce KV cache size during autoregressive inference, trading a small quality loss for dramatically lower memory and faster decoding. - [NTK Regime vs. Mean-Field Regime](https://theorempath.com/compare/ntk-vs-mean-field): Two limiting theories of wide neural networks: NTK linearizes training dynamics around initialization (lazy regime), while mean-field theory captures feature learning through substantial weight movement (rich regime). - [On-Policy vs. Off-Policy Learning](https://theorempath.com/compare/on-policy-vs-off-policy): On-policy methods learn from data generated by the current policy (SARSA, PPO), ensuring consistency but wasting samples. Off-policy methods learn from any data including replay buffers (Q-learning, SAC), gaining efficiency at the cost of stability. - [PAC Learning vs. Agnostic PAC Learning](https://theorempath.com/compare/pac-vs-agnostic-pac): Realizable PAC learning assumes the target is in the hypothesis class. Agnostic PAC drops this assumption and competes with the best hypothesis in the class. Agnostic learning is harder, requiring uniform convergence and yielding slower sample complexity. - [Pointwise vs. Uniform Convergence](https://theorempath.com/compare/pointwise-vs-uniform-convergence): Pointwise convergence allows different rates at different points. Uniform convergence requires the same rate everywhere. Learning theory needs uniform convergence because ERM must work simultaneously for all hypotheses. - [PPO vs. SAC](https://theorempath.com/compare/ppo-vs-sac): Two dominant actor-critic algorithms compared: PPO uses clipped surrogate objectives on-policy, while SAC maximizes entropy off-policy. PPO excels in discrete/LLM settings, SAC in continuous robotics. - [Pre-Norm vs. Post-Norm](https://theorempath.com/compare/pre-norm-vs-post-norm): Post-norm places layer normalization after the residual addition, matching the original transformer. Pre-norm places it before the sublayer, inside the residual branch. Pre-norm enables stable training at depth without learning rate warmup, which is why GPT, LLaMA, and most modern LLMs use it. Post-norm can achieve better final performance but requires careful initialization and warmup. - [Post-Training Quantization vs. Quantization-Aware Training](https://theorempath.com/compare/quantization-ptq-vs-qat): PTQ quantizes a pretrained model with no retraining. QAT simulates quantization during training to recover quality. GPTQ and AWQ are modern PTQ methods that close much of the gap. The tradeoff is compute cost vs. model quality at low bit widths. - [Ridge vs. Lasso Regression](https://theorempath.com/compare/ridge-vs-lasso): L2 penalty shrinks all coefficients toward zero; L1 penalty drives some exactly to zero. Ridge has a closed-form solution and handles multicollinearity; Lasso performs variable selection but requires iterative solvers. - [RLHF vs. DPO vs. GRPO](https://theorempath.com/compare/rlhf-vs-dpo-vs-grpo): Three approaches to aligning language models with human preferences. RLHF trains a separate reward model and optimizes via PPO. DPO eliminates the reward model by reparameterizing the preference objective. GRPO uses group-relative scoring without a reward model, suited for reasoning tasks with verifiable answers. - [RMSNorm vs. LayerNorm](https://theorempath.com/compare/rmsnorm-vs-layernorm): LayerNorm normalizes activations by centering (subtracting the mean) and scaling (dividing by the standard deviation), then applies a learned affine transformation. RMSNorm drops the mean centering step and normalizes by the root mean square only. RMSNorm is 10-15% faster at the same expressivity for transformer training. LLaMA, Gemma, Mistral, and most modern LLMs use RMSNorm. - [RoPE vs. ALiBi vs. Sinusoidal Positional Encoding](https://theorempath.com/compare/rope-vs-alibi-vs-sinusoidal): Sinusoidal positional encoding (original Transformer) adds fixed position vectors to token embeddings. RoPE (Rotary Position Embedding) applies position-dependent rotation to query and key vectors, encoding relative position through dot product geometry. ALiBi (Attention with Linear Biases) adds a linear position-dependent penalty directly to attention scores. RoPE extrapolates better and dominates modern LLMs. ALiBi is simplest to implement. Sinusoidal is largely historical. - [Self-Play vs. Independent Learning](https://theorempath.com/compare/self-play-vs-independent-learning): Two approaches to multi-agent reinforcement learning: self-play trains an agent against copies of itself in a non-stationary environment, while independent learning treats other agents as part of a fixed environment. Self-play converges in two-player zero-sum games; independent learning can cycle or diverge. - [SFT vs. DPO](https://theorempath.com/compare/sft-vs-dpo): Supervised fine-tuning (SFT) learns from demonstration data by maximizing log-likelihood on human-written outputs. Direct preference optimization (DPO) learns from pairwise preference data by directly optimizing the policy without fitting a separate reward model. SFT is simpler and data-efficient for instruction following. DPO is preferred when preference signals are available and you want to skip the reward model stage of RLHF. - [Shampoo vs. Adam vs. Muon](https://theorempath.com/compare/shampoo-vs-adam-vs-muon): Three approaches to preconditioning gradients: Adam (diagonal, per-parameter), Shampoo (full-matrix Kronecker), and Muon (orthogonalized updates via Newton-Schulz). Each uses increasingly rich curvature information at increasing computational cost. - [Sub-Gaussian vs. Sub-Exponential Random Variables](https://theorempath.com/compare/subgaussian-vs-subexponential): Two tail regimes for concentration: sub-Gaussian gives exp(-ct^2), sub-exponential gives exp(-ct) for large deviations, and the boundary between them explains when classical bounds break down. - [SVM vs. Logistic Regression](https://theorempath.com/compare/svm-vs-logistic-regression): SVMs maximize the margin using hinge loss and produce sparse support vectors; logistic regression maximizes likelihood using log loss and produces calibrated probabilities. SVMs handle nonlinearity via the kernel trick; LR needs explicit feature engineering. - [SwiGLU vs. GELU vs. ReLU](https://theorempath.com/compare/swiglu-vs-gelu-vs-relu): ReLU is the simplest activation: zero for negative inputs, identity for positive. GELU applies a smooth, probabilistic gate based on the Gaussian CDF. SwiGLU combines the Swish activation with a gated linear unit, using an extra linear projection to gate the hidden representation. SwiGLU outperforms GELU and ReLU in transformer feed-forward networks at the cost of additional parameters. LLaMA, PaLM, and Gemma use SwiGLU. GPT-2 and BERT use GELU. - [Transformer vs. Mamba vs. TTT](https://theorempath.com/compare/transformer-vs-mamba-vs-ttt): Three competing sequence architectures: attention (exact retrieval, quadratic cost), state-space models (linear cost, compressed state), and test-time training (gradient-based state updates, rich memory). Each makes different tradeoffs between memory, compute, and retrieval ability. - [Value Iteration vs. Policy Iteration](https://theorempath.com/compare/value-iteration-vs-policy-iteration): Both algorithms find optimal policies for finite MDPs. Value iteration applies the Bellman optimality operator repeatedly and extracts the policy at the end. Policy iteration alternates between full policy evaluation and greedy improvement, converging in fewer iterations but with more work per iteration. - [VC Dimension vs. Rademacher Complexity](https://theorempath.com/compare/vc-dimension-vs-rademacher-complexity): Worst-case combinatorial complexity vs. data-dependent average-case complexity: when each gives tighter generalization bounds. - [Weak Duality vs. Strong Duality](https://theorempath.com/compare/weak-vs-strong-duality): Weak duality always holds: dual optimal is at most primal optimal. Strong duality says they are equal, but requires constraint qualifications like Slater's condition. KKT conditions become necessary and sufficient under strong duality. - [Weight Decay vs. L2 Regularization](https://theorempath.com/compare/weight-decay-vs-l2): Weight decay and L2 regularization are identical for SGD but diverge under adaptive optimizers. L2 adds the penalty gradient before adaptive scaling, so heavily updated parameters get less regularization. Weight decay subtracts directly from weights after the update, applying uniform regularization regardless of gradient history. ## Additional - [llms-full.txt](https://theorempath.com/llms-full.txt): Complete index of all 491 topics with summaries. - [sitemap.xml](https://theorempath.com/sitemap.xml): Machine-readable sitemap. - [rss.xml](https://theorempath.com/rss.xml): Feed of recently updated topics.