Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Curriculum

The Full Theory Library

Every concept organized by depth layer and module. Layer 0 is foundations. Layer 5 is applied systems. Every topic links down to its prerequisites until you hit axioms.

431 curriculum topics+60 reference and insight pages (491 total)

Foundations (Layer 0A)

Axioms, definitions, and notation. The bedrock everything else depends on.

L0ATier 1~50m

Common Inequalities

The algebraic and probabilistic inequalities that appear everywhere in ML theory: Cauchy-Schwarz, Jensen, AM-GM, Holder, Minkowski, Young, Markov, and Chebyshev.

L0ATier 1~65m

Common Probability Distributions

The essential probability distributions for ML: Bernoulli through Dirichlet, with PMFs/PDFs, moments, relationships, and when each arises in practice.

sets functions and relations
L0ATier 1~40m

Compactness and Heine-Borel

Sequential compactness, the Heine-Borel theorem in finite dimensions, the extreme value theorem, and why compactness is the key assumption in optimization.

metric spaces convergence completeness
L0ATier 1~50m

Computability Theory

What can be computed? Turing machines, decidability, the Church-Turing thesis, recursive and recursively enumerable sets, reductions, Rice's theorem, and connections to learning theory.

basic logic and proof techniquessets functions and relations
L0ATier 1~35m

Continuity in R^n

Epsilon-delta continuity, uniform continuity, and Lipschitz continuity in Euclidean space. Lipschitz constants control how fast function values change and appear throughout optimization and generalization theory.

metric spaces convergence completeness
L0ATier 1~40m

Differentiation in Rn

Partial derivatives, the gradient, directional derivatives, the total derivative (Frechet), and the multivariable chain rule. Why the gradient points in the steepest ascent direction, and why this matters for all of optimization.

sets functions and relations
L0ATier 1~55m

Eigenvalues and Eigenvectors

Eigenvalues and eigenvectors: the directions a matrix scales without rotating. Characteristic polynomial, diagonalization, the spectral theorem for symmetric matrices, and the direct connection to PCA.

matrix operations and properties
L0ATier 1~55m

Expectation, Variance, Covariance, and Moments

Expectation, variance, covariance, correlation, linearity of expectation, variance of sums, and moment-based reasoning in ML.

common probability distributions
L0ATier 1~30m

Exponential Function Properties

The exponential function e^x: series definition, algebraic properties, and why it appears everywhere in ML. Softmax, MGFs, the Chernoff method, Boltzmann distributions, and exponential families all reduce to properties of exp.

L1Tier 1~40m

Gram Matrices and Kernel Matrices

The Gram matrix G_{ij} = <x_i, x_j> encodes pairwise inner products of a dataset. Always PSD. Appears in kernel methods, PCA, SVD, and attention. Understanding it connects linear algebra to ML.

inner product spaces and orthogonalityeigenvalues and eigenvectors
L0ATier 1~40m

Inner Product Spaces and Orthogonality

Inner product axioms, Cauchy-Schwarz inequality, orthogonality, Gram-Schmidt, projections, and the bridge to Hilbert spaces.

vectors matrices and linear maps
L0ATier 1~40m

Joint, Marginal, and Conditional Distributions

Joint distributions, marginalization, conditional distributions, independence, Bayes theorem, and the chain rule of probability.

common probability distributions
L1Tier 1~45m

KL Divergence

Kullback-Leibler divergence measures how one probability distribution differs from another. Asymmetric, always non-negative, and central to variational inference, MLE, and RLHF.

common probability distributionsinformation theory foundations
L0ATier 1~35m

Matrix Norms

Frobenius, spectral, and nuclear norms for matrices. Submultiplicativity. When and why each norm appears in ML theory.

vectors matrices and linear maps
L0ATier 1~50m

Matrix Operations and Properties

Essential matrix operations for ML: trace, determinant, inverse, pseudoinverse, Schur complement, and the Sherman-Morrison-Woodbury formula. When and why each matters.

sets functions and relations
L0ATier 1~55m

Metric Spaces, Convergence, and Completeness

Metric space axioms, convergence of sequences, Cauchy sequences, completeness, and the Banach fixed-point theorem.

L1Tier 1~45m

Numerical Stability and Conditioning

Continuous math becomes real only through finite-precision approximation. Condition numbers, backward stability, catastrophic cancellation, and why theorems about reals do not transfer cleanly to floating-point.

floating point arithmeticmatrix operations and properties
L0ATier 1~40m

Positive Semidefinite Matrices

PSD matrices: equivalent characterizations, Cholesky decomposition, Schur complement, and Loewner ordering. Covariance matrices are PSD. Hessians of convex functions are PSD. These facts connect linear algebra to optimization and statistics.

eigenvalues and eigenvectors
L0ATier 1~40m

Sets, Functions, and Relations

The language underneath all of mathematics: sets, Cartesian products, functions, injectivity, surjectivity, equivalence relations, and quotient sets.

basic logic and proof techniques
L0ATier 1~60m

Singular Value Decomposition

The SVD A = U Sigma V^T: the most important matrix factorization in applied mathematics. Geometric interpretation, relationship to eigendecomposition, low-rank approximation via Eckart-Young, and applications from PCA to pseudoinverses.

eigenvalues and eigenvectors
L1Tier 1~55m

Skewness, Kurtosis, and Higher Moments

Distribution shape beyond mean and variance: skewness measures tail asymmetry, kurtosis measures tail extremeness, cumulants are the cleaner language, and heavy-tailed distributions break all of these.

common probability distributionsexpectation variance covariance moments
L0ATier 1~40m

Taylor Expansion

Taylor approximation in one and many variables. Every optimization algorithm is a Taylor approximation: gradient descent uses first order, Newton's method uses second order.

L0ATier 1~55m

Tensors and Tensor Operations

What a tensor actually is: a multilinear map with specific transformation rules, how tensor contraction generalizes matrix multiplication, Einstein summation, tensor decompositions, and how ML frameworks use the word tensor to mean multidimensional array.

eigenvalues and eigenvectors
L0ATier 1~45m

Vectors, Matrices, and Linear Maps

Vector spaces, linear maps, matrix representation, rank, nullity, and the rank-nullity theorem. The algebraic backbone of ML.

L0ATier 2~40m

Basic Logic and Proof Techniques

The fundamental proof strategies used throughout mathematics: direct proof, contradiction, contrapositive, induction, construction, and cases. Required vocabulary for reading any theorem.

L0ATier 2~45m

Cantor's Theorem and Uncountability

Cantor's diagonal argument proves the reals are uncountable. The power set of any set has strictly greater cardinality. These results are the origin of the distinction between countable and uncountable infinity.

L0ATier 2~35m

Cardinality and Countability

Two sets have the same cardinality when a bijection exists between them. The naturals, integers, and rationals are countable. The reals are uncountable, proved by Cantor's diagonal argument.

sets functions and relations
L0ATier 2~35m

Counting and Combinatorics

Counting principles, binomial and multinomial coefficients, inclusion-exclusion, and Stirling's approximation. These tools appear whenever you count hypotheses, bound shattering coefficients, or analyze combinatorial arguments in learning theory.

L1Tier 2~30m

Cramér-Wold Theorem

A multivariate distribution is uniquely determined by all of its one-dimensional projections. This reduces multivariate convergence in distribution to checking univariate projections, and is the standard tool for proving multivariate CLT.

central limit theoremmeasure theoretic probability
L0ATier 2~40m

Integration and Change of Variables

Riemann integration, improper integrals, the substitution rule, multivariate change of variables via the Jacobian determinant, and Fubini theorem. The computational backbone of probability and ML.

L0ATier 2~40m

Inverse and Implicit Function Theorem

The inverse function theorem guarantees local invertibility when the Jacobian is nonsingular. The implicit function theorem guarantees that constraint surfaces are locally graphs. Both are essential for constrained optimization and implicit layers.

the jacobian matrix
L1Tier 2~50m

Markov Chains and Steady State

Markov chains: the Markov property, transition matrices, stationary distributions, irreducibility, aperiodicity, the ergodic theorem, and mixing time. The backbone of PageRank, MCMC, and reinforcement learning.

common probability distributionseigenvalues and eigenvectors
L0ATier 2~35m

Moment Generating Functions

The moment generating function M(t) = E[e^{tX}] encodes all moments of a distribution. The Chernoff method, sub-Gaussian bounds, and exponential family theory all reduce to MGF conditions.

expectation variance covariance moments
L2Tier 2~55m

SAT, SMT, and Automated Reasoning

SAT solvers decide Boolean satisfiability (NP-complete). SMT solvers extend SAT with theories like arithmetic and arrays. These tools verify constraints, discharge proof obligations, and complement LLMs in AI agent pipelines.

p vs np
L0ATier 2~35m

Sequences and Series of Functions

Pointwise vs uniform convergence of function sequences, the Weierstrass M-test, and why uniform convergence preserves continuity. The concept that makes learning theory work.

metric spaces convergence completeness
L1Tier 2~60m

Signals and Systems for ML

Linear time-invariant systems, convolution, Fourier transform, and the sampling theorem. The signal processing foundations that underpin CNNs, efficient attention, audio ML, and frequency-domain analysis of training dynamics.

L0ATier 3~45m

Formal Languages and Automata

Regular languages, context-free grammars, pushdown automata, the Chomsky hierarchy, pumping lemmas, and connections to parsing, neural sequence models, and computational complexity.

basic logic and proof techniquessets functions and relations

Mathematical Infrastructure (Layer 0B)

Serious math machinery: measure theory, functional analysis, convex duality.

L0BTier 1~75m

Convex Duality

Fenchel conjugates, the Fenchel-Moreau theorem, weak and strong duality, KKT conditions, and why duality gives the kernel trick for SVMs, connects regularization to constraints, and enables adversarial formulations in DRO.

convex optimization basics
L0BTier 1~80m

Measure-Theoretic Probability

The rigorous foundations of probability: sigma-algebras, measures, measurable functions as random variables, Lebesgue integration, and the convergence theorems that make modern probability and statistics possible.

L0BTier 1~80m

Radon-Nikodym and Conditional Expectation

The Radon-Nikodym theorem: what 'density' really means. Absolute continuity, the Radon-Nikodym derivative, conditional expectation as a projection, tower property, and why this undergirds likelihood ratios, importance sampling, and KL divergence.

measure theoretic probability
L0BTier 2~75m

Functional Analysis Core

The four pillars of functional analysis: Hahn-Banach (extending functionals), Uniform Boundedness (pointwise bounded implies uniformly bounded), Open Mapping (surjective bounded operators have open images), and Banach-Alaoglu (dual unit ball is weak-* compact). These underpin RKHS theory, optimization in function spaces, and duality.

L0BTier 2~70m

Information Theory Foundations

The core of information theory for ML: entropy, cross-entropy, KL divergence, mutual information, data processing inequality, and the chain rules that connect them. The language of variational inference, generalization bounds, and representation learning.

L3Tier 2~50m

Ito's Lemma

The chain rule of stochastic calculus: if X_t follows an SDE, then f(X_t) follows a modified SDE with an extra second-order correction term that has no analogue in ordinary calculus.

stochastic calculus for ml
L0BTier 2~70m

Martingale Theory

Martingales and their convergence properties: Doob martingale, optional stopping theorem, martingale convergence, Azuma-Hoeffding inequality, and Freedman inequality. The tools behind McDiarmid's inequality, online learning regret bounds, and stochastic approximation.

measure theoretic probability
L3Tier 3~60m

Information Geometry

Riemannian geometry on the space of probability distributions: the Fisher information metric, natural gradient descent, exponential families as dually flat manifolds, and the connection to mirror descent.

fisher informationconvex duality
L0BTier 3~65m

Spectral Theory of Operators

Spectral theorem for compact self-adjoint operators on Hilbert spaces: every such operator has a countable orthonormal eigenbasis with real eigenvalues accumulating only at zero. This is the infinite-dimensional backbone of PCA, kernel methods, and neural tangent kernel theory.

eigenvalues and eigenvectors
L3Tier 3~65m

Stochastic Calculus for ML

Brownian motion, Ito integrals, Ito's lemma, and stochastic differential equations: the mathematical machinery behind diffusion models, score-based generative models, and Langevin dynamics.

martingale theorymeasure theoretic probability

Statistical Estimation (Layer 0B)

MLE, Fisher information, Cramér-Rao, LLN, CLT — the estimation core.

L0BTier 1~55m

Central Limit Theorem

The CLT: the sample mean is approximately Gaussian for large n, regardless of the original distribution. Berry-Esseen rate, multivariate CLT, and why CLT explains asymptotic normality of MLE, confidence intervals, and the ubiquity of the Gaussian.

law of large numberscommon probability distributions
L0BTier 1~50m

Cramér-Rao Bound

The fundamental lower bound on the variance of any unbiased estimator: no unbiased estimator can have variance smaller than the reciprocal of the Fisher information.

fisher information
L0BTier 1~55m

Fisher Information

The Fisher information quantifies how much a sample tells you about an unknown parameter: it measures the curvature of the log-likelihood, sets the Cramér-Rao lower bound on estimator variance, and serves as a natural Riemannian metric on parameter space.

maximum likelihood estimation
L0BTier 1~50m

Law of Large Numbers

The weak and strong laws of large numbers: the sample mean converges to the population mean. Kolmogorov's conditions, the rate of convergence from the CLT, and why LLN justifies using empirical risk as a proxy for population risk.

common probability distributions
L0BTier 1~65m

Maximum Likelihood Estimation

MLE: find the parameter that maximizes the likelihood of observed data. Consistency, asymptotic normality, Fisher information, Cramér-Rao efficiency, and when MLE fails.

common probability distributionsdifferentiation in rn
L0BTier 1~55m

Shrinkage Estimation and the James-Stein Estimator

In three or more dimensions, the sample mean is inadmissible for estimating a multivariate normal mean. The James-Stein estimator shrinks toward zero and dominates the MLE in total MSE, a result that shocked the statistics world.

maximum likelihood estimation
L0BTier 2~60m

Bayesian Estimation

The Bayesian approach to parameter estimation: encode prior beliefs, update with data via Bayes rule, and obtain a full posterior distribution over parameters. Conjugate priors, MAP estimation, and the Bernstein-von Mises theorem showing that the posterior concentrates around the true parameter.

maximum likelihood estimationcommon probability distributions
L1Tier 2~50m

Goodness-of-Fit Tests

KS test, Shapiro-Wilk, Anderson-Darling, and the probability integral transform: testing whether data follows a specified distribution, with exact power comparisons and when each test fails.

hypothesis testing for mlcommon probability distributions
L0BTier 2~40m

Method of Moments

Match sample moments to population moments to estimate parameters. Simpler than MLE but less efficient. Covers classical MoM, generalized method of moments (GMM), and when MoM is the better choice.

common probability distributions
L0BTier 2~60m

Sufficient Statistics and Exponential Families

Sufficient statistics compress data without losing information about the parameter. The Neyman-Fisher factorization theorem, exponential families, completeness, and Rao-Blackwell improvement of estimators.

maximum likelihood estimation
L0BTier 3~65m

Asymptotic Statistics

The large-sample toolbox: delta method, Slutsky's theorem, asymptotic normality of MLE, local asymptotic normality, and Fisher efficiency. These results justify nearly every confidence interval and hypothesis test used in practice.

central limit theoremmaximum likelihood estimation
L0BTier 3~35m

Basu's Theorem

A complete sufficient statistic is independent of every ancillary statistic. This provides the cleanest method for proving independence between statistics without computing joint distributions.

sufficient statistics and exponential families

Learning Theory Core (Layer 1-2)

ERM, uniform convergence, VC dimension, Rademacher complexity.

L3Tier 1~65m

Algorithmic Stability

Algorithmic stability provides generalization bounds by analyzing how much a learning algorithm's output changes when a single training example is replaced: a structurally different lens from complexity-based approaches.

empirical risk minimizationvc dimensionconcentration inequalities
L2Tier 1~60m

Empirical Risk Minimization

The foundational principle of statistical learning: minimize average loss on training data as a proxy for minimizing true population risk.

concentration inequalitiescommon probability distributions
L2Tier 1~45m

Hypothesis Classes and Function Spaces

What is a hypothesis class, why the choice of hypothesis class determines what ERM can learn, and the approximation-estimation tradeoff: bigger classes reduce approximation error but increase estimation error.

empirical risk minimization
L1Tier 1~55m

PAC Learning Framework

The foundational formalization of what it means to learn from data: a concept is PAC-learnable if an algorithm can, with high probability, find a hypothesis that is approximately correct, using a polynomial number of samples.

vc dimensionconcentration inequalities
L3Tier 1~80m

Rademacher Complexity

A data-dependent measure of hypothesis class complexity that gives tighter generalization bounds than VC dimension by measuring how well the class can fit random noise on the actual data.

empirical risk minimizationvc dimensionconcentration inequalities
L2Tier 1~50m

Sample Complexity Bounds

How many samples do you need to learn? Tight answers for finite hypothesis classes, VC classes, and Rademacher-bounded classes, plus matching lower bounds via Fano and Le Cam.

vc dimensionrademacher complexity
L2Tier 1~65m

Uniform Convergence

Uniform convergence of empirical risk to population risk over an entire hypothesis class: the key property that makes ERM provably work.

empirical risk minimization
L2Tier 1~75m

VC Dimension

The Vapnik-Chervonenkis dimension: a combinatorial measure of hypothesis class complexity that characterizes learnability in binary classification.

empirical risk minimizationconcentration inequalities
L2Tier 2~55m

Kolmogorov Complexity and MDL

Kolmogorov complexity measures the shortest program that produces a string. The Minimum Description Length principle selects models that compress data best, providing a computable approximation to an incomputable ideal.

p vs np

Concentration & Probability (Layer 1-3)

Hoeffding through matrix Bernstein. The workhorse inequality family.

L1Tier 1~45m

Chernoff Bounds

The Chernoff method: the universal technique for deriving exponential tail bounds by optimizing over the moment generating function, yielding the tightest possible exponential concentration.

concentration inequalities
L1Tier 1~70m

Concentration Inequalities

Bounds on how far random variables deviate from their expectations: Markov, Chebyshev, Hoeffding, and Bernstein. Used throughout generalization theory, bandits, and sample complexity.

common probability distributionsexpectation variance covariance moments
L3Tier 1~60m

Epsilon-Nets and Covering Numbers

Discretizing infinite sets for concentration arguments: epsilon-nets, covering numbers, packing numbers, the Dudley integral, and the connection to Rademacher complexity.

subgaussian random variablesconcentration inequalities
L3Tier 1~70m

Matrix Concentration

Matrix Bernstein, Matrix Hoeffding, Weyl's inequality, and Davis-Kahan: the operator-norm concentration tools needed for covariance estimation, dimensionality reduction, and spectral analysis in high dimensions.

subgaussian random variablessubexponential random variablesconcentration inequalities
L3Tier 1~55m

McDiarmid's Inequality

The bounded-differences inequality: if changing one input to a function changes the output by at most c_i, the function concentrates around its mean with sub-Gaussian tails.

concentration inequalitiessubgaussian random variables
L2Tier 1~55m

Sub-Exponential Random Variables

The distributional class between sub-Gaussian and heavy-tailed: heavier tails than Gaussian, the psi_1 norm, Bernstein condition, and the two-regime concentration bound.

subgaussian random variablesconcentration inequalities
L2Tier 1~75m

Sub-Gaussian Random Variables

Sub-Gaussian random variables: the precise characterization of 'light-tailed' behavior that underpins every concentration inequality in learning theory.

concentration inequalities
L3Tier 1~50m

Symmetrization Inequality

The symmetrization technique: the proof template that connects the generalization gap to Rademacher complexity by introducing a ghost sample and random signs.

rademacher complexityconcentration inequalities
L3Tier 2~50m

Contraction Inequality

The Ledoux-Talagrand contraction principle: applying an L-Lipschitz function with phi(0)=0 to a function class can only contract Rademacher complexity, letting you bound the complexity of the loss class from the hypothesis class.

rademacher complexity
L3Tier 2~65m

Empirical Processes and Chaining

Bounding the supremum of empirical processes via covering numbers and chaining: Dudley's entropy integral and Talagrand's generic chaining, the sharpest tools in classical learning theory.

rademacher complexityepsilon nets and covering numbers
L3Tier 2~55m

Hanson-Wright Inequality

Concentration of quadratic forms X^T A X for sub-Gaussian random vectors: the two-term bound involving the Frobenius norm (Gaussian regime) and operator norm (extreme regime).

subgaussian random variablesmatrix concentration
L3Tier 2~55m

Measure Concentration and Geometric Functional Analysis

High-dimensional geometry is counterintuitive: Lipschitz functions concentrate, random projections preserve distances, and most of a sphere's measure sits near the equator. Johnson-Lindenstrauss, Gaussian concentration, and Levy's lemma.

subgaussian random variablesepsilon nets and covering numbers
L3Tier 2~50m

Restricted Isometry Property

The restricted isometry property (RIP): when a measurement matrix approximately preserves norms of sparse vectors, enabling exact sparse recovery via L1 minimization. Random Gaussian matrices satisfy RIP with O(s log(n/s)) rows.

sparse recovery and compressed sensingsubgaussian random variables

Optimization & Function Classes (Layer 1-3)

Convex optimization, regularization, kernels, RKHS.

L1Tier 1~60m

Convex Optimization Basics

Convex sets, convex functions, gradient descent convergence, strong convexity, and duality: the optimization foundation that every learning-theoretic result silently depends on.

differentiation in rnmatrix operations and properties
L1Tier 1~45m

Gradient Descent Variants

From full-batch to stochastic to mini-batch gradient descent, plus momentum, Nesterov acceleration, AdaGrad, RMSProp, and Adam. Why mini-batch SGD with momentum is the practical default.

convex optimization basicsdifferentiation in rn
L2Tier 1~50m

Gradient Flow and Vanishing Gradients

Why deep networks are hard to train: gradients shrink or explode as they propagate through layers. The Jacobian product chain, sigmoid saturation, ReLU dead neurons, skip connections, normalization, and gradient clipping.

feedforward networks and backpropagationthe jacobian matrix
L2Tier 1~55m

Stochastic Gradient Descent Convergence

SGD convergence rates for convex and strongly convex functions, the role of noise as both curse and blessing, mini-batch variance reduction, learning rate schedules, and the Robbins-Monro conditions.

gradient descent variantsconcentration inequalities
L2Tier 2~55m

Bias-Variance Tradeoff

The classical decomposition of mean squared error into bias squared, variance, and irreducible noise. The U-shaped test error curve, why it breaks in modern ML (double descent), and the connection to regularization.

expectation variance covariance momentsempirical risk minimization
L2Tier 2~45m

Cross-Validation Theory

The theory behind cross-validation as a model selection tool: LOO-CV, K-fold, the bias-variance tradeoff of the CV estimator itself, and why CV estimates generalization error.

empirical risk minimizationbias variance tradeoff
L3Tier 2~70m

Kernels and Reproducing Kernel Hilbert Spaces

Kernel functions, Mercer's theorem, the RKHS reproducing property, and the representer theorem: the mathematical framework that enables learning in infinite-dimensional function spaces via finite-dimensional computations.

convex optimization basicsrademacher complexity
L3Tier 2~55m

Preconditioned Optimizers: Shampoo, K-FAC, and Natural Gradient

Optimizers that use curvature information to precondition gradients: the natural gradient via Fisher information, K-FAC's Kronecker approximation, and Shampoo's full-matrix preconditioning. How they connect to Riemannian optimization and why they outperform Adam on certain architectures.

convex optimization basicsfisher informationthe hessian matrix
L2Tier 2~60m

Regularization Theory

Why unconstrained ERM overfits and how regularization controls complexity: Tikhonov (L2), sparsity (L1), elastic net, early stopping, dropout, the Bayesian prior connection, and the link to algorithmic stability.

convex optimization basicsbias variance tradeoff
L3Tier 2~55m

Riemannian Optimization and Manifold Constraints

Optimization on curved spaces: the Stiefel manifold for orthogonal matrices, symmetric positive definite matrices, Riemannian gradient descent, retractions, and why flat-space intuitions break on manifolds. The geometric backbone of Shampoo, Muon, and constrained neural network training.

convex optimization basicsthe hessian matrixeigenvalues and eigenvectors
L2Tier 2~60m

Stability and Optimization Dynamics

Convergence of gradient descent for smooth and strongly convex objectives, the descent lemma, gradient flow as a continuous-time limit, Lyapunov stability analysis, and the edge of stability phenomenon.

convex optimization basicsalgorithmic stability
L2Tier 2~60m

Stochastic Approximation Theory

The Robbins-Monro framework, ODE method, and Polyak-Ruppert averaging: the unified theory behind why SGD, Q-learning, and TD-learning converge.

convex optimization basicsmartingale theory

Statistical Foundations (Layer 2-3)

Minimax, Fano, information-theoretic lower bounds, random matrix theory.

L2Tier 2~45m

Design-Based vs. Model-Based Inference

Two philosophies of statistical inference from survey data: design-based inference where randomness comes from the sampling design, and model-based inference where randomness comes from a statistical model, with the model-assisted hybrid approach.

survey sampling methods
L2Tier 2~55m

Detection Theory

Binary hypothesis testing, the Neyman-Pearson lemma (likelihood ratio tests are most powerful), ROC curves, Bayesian detection, and sequential testing. Classification IS detection theory. ROC/AUC comes directly from here.

hypothesis testing for mlbayesian estimation
L3Tier 2~60m

Fano Inequality

Fano inequality as the standard tool for information-theoretic lower bounds: if X -> Y -> X_hat, then error probability is bounded below by conditional entropy and alphabet size.

minimax lower bounds
L3Tier 2~55m

High-Dimensional Covariance Estimation

When dimension d is comparable to sample size n, the sample covariance matrix fails. Shrinkage estimators (Ledoit-Wolf), banding and tapering for structured covariance, and Graphical Lasso for sparse precision matrices.

matrix concentrationrandom matrix theory overview
L3Tier 2~50m

Kernel Two-Sample Tests

Maximum Mean Discrepancy (MMD): a kernel-based nonparametric test for whether two samples come from the same distribution, with unbiased estimation, permutation testing, and applications to GAN evaluation.

kernels and rkhs
L3Tier 2~60m

Minimax Lower Bounds

Why upper bounds are not enough: minimax risk, Le Cam two-point method, Fano inequality, and Assouad lemma for proving that no estimator can beat a given rate.

concentration inequalitiesmaximum likelihood estimation
L2Tier 2~50m

Neyman-Pearson and Hypothesis Testing Theory

The likelihood ratio test is the most powerful test for simple hypotheses (Neyman-Pearson lemma), UMP tests extend this to one-sided composites, and the power function characterizes a test's behavior across the parameter space.

common probability distributionsmaximum likelihood estimation
L2Tier 2~50m

Nonresponse and Missing Data

The taxonomy of missingness mechanisms (MCAR, MAR, MNAR), their consequences for inference, and the major correction methods: multiple imputation, inverse probability weighting, and the EM algorithm.

common probability distributions
L1Tier 2~45m

Order Statistics

Order statistics are the sorted values of a random sample. Their distributions govern quantile estimation, confidence intervals for medians, and the behavior of extremes.

common probability distributions
L4Tier 2~75m

Random Matrix Theory Overview

Why the spectra of random matrices matter for ML: Marchenko-Pastur law, Wigner semicircle, spiked models, and their applications to covariance estimation, PCA, and overparameterization.

matrix concentrationepsilon nets and covering numbers
L3Tier 2~55m

Robust Statistics and M-Estimators

When data has outliers or model assumptions are wrong, classical estimators break. M-estimators generalize MLE to handle contamination gracefully.

maximum likelihood estimation
L2Tier 2~45m

Sample Size Determination

How to compute the number of observations needed to estimate means, proportions, and treatment effects with specified precision and power, including corrections for finite populations and complex designs.

hypothesis testing for mlcommon probability distributions
L2Tier 2~60m

Survey Sampling Methods

The major probability sampling designs used in survey statistics: simple random, stratified, cluster, systematic, multi-stage, and multi-phase sampling, with their variance properties and estimators.

common probability distributionsexpectation variance covariance moments
L3Tier 2~55m

Survival Analysis

Modeling time-to-event data with censoring: Kaplan-Meier curves, hazard functions, and the Cox proportional hazards model.

maximum likelihood estimation
L3Tier 3~50m

Copulas

Copulas separate the dependence structure of a multivariate distribution from its marginals. Sklar's theorem guarantees that any joint CDF can be decomposed into marginals and a copula, making dependence modeling modular.

common probability distributions
L3Tier 3~50m

Longitudinal Surveys and Panel Data

Analysis of data where the same units are measured repeatedly over time: fixed effects, random effects, difference-in-differences, and the problems of attrition and time-varying confounding.

linear regression
L3Tier 3~55m

Small Area Estimation

Methods for producing reliable estimates in domains where direct survey estimates have too few observations for useful precision, using Fay-Herriot and unit-level models that borrow strength across areas.

bayesian estimationlinear regression

Modern Generalization Theory (Layer 3-4)

Implicit bias, double descent, NTK, mean field — where classical theory fails.

L4Tier 1~90m

Implicit Bias and Modern Generalization

Why classical generalization theory breaks for overparameterized models: the random labels experiment, the interpolation threshold, implicit bias of gradient descent, double descent, and the frontier of understanding why deep learning works.

gradient descent variantslinear regressionvc dimension+1
L4Tier 2~65m

Benign Overfitting

When interpolation (zero training error) does not hurt generalization: the min-norm interpolator fits noise in harmless directions while preserving signal. Bartlett et al. 2020, effective rank conditions, and why benign overfitting happens in overparameterized but not classical regimes.

implicit bias and modern generalizationrandom matrix theory overview
L4Tier 2~65m

Double Descent

Test error follows a double-descent curve: it decreases, peaks at the interpolation threshold, then decreases again in the overparameterized regime, defying classical bias-variance intuition.

implicit bias and modern generalizationrandom matrix theory overview
L4Tier 2~50m

Grokking

Models can memorize training data quickly, then generalize much later after continued training. This delayed generalization, called grokking, breaks the assumption that overfitting is a terminal state and connects to weight decay, implicit regularization, and phase transitions in learning.

regularization theorystochastic gradient descent convergenceimplicit bias and modern generalization
L4Tier 2~55m

Lazy vs Feature Learning

The fundamental dichotomy in neural network training: lazy regime (NTK, kernel-like, weights barely move) versus rich/feature learning regime (weights move substantially, representations emerge).

neural tangent kernelmean field theory
L4Tier 2~65m

Mean Field Theory

The mean field limit of neural networks: as width goes to infinity under the right scaling, neurons become independent particles whose weight distribution evolves by Wasserstein gradient flow, capturing feature learning that the NTK regime misses.

neural tangent kernel
L4Tier 2~55m

Neural Network Optimization Landscape

Loss surface geometry of neural networks: saddle points dominate in high dimensions, mode connectivity, flat vs sharp minima, Sharpness-Aware Minimization, and the edge of stability phenomenon.

training dynamics and loss landscapesthe hessian matrix
L4Tier 2~70m

Neural Tangent Kernel

In the infinite-width limit, neural networks trained with gradient descent behave like kernel regression with a specific kernel: the Neural Tangent Kernel: connecting deep learning to classical kernel theory.

kernels and rkhsimplicit bias and modern generalization
L3Tier 2~55m

Optimal Transport and Earth Mover's Distance

The Monge and Kantorovich formulations of optimal transport, the linear programming dual, Sinkhorn regularization, and applications to WGANs, domain adaptation, and fairness.

wasserstein distancesconvex duality
L3Tier 2~60m

PAC-Bayes Bounds

Generalization bounds that depend on the KL divergence between a learned posterior and a prior over hypotheses. PAC-Bayes gives non-vacuous bounds for overparameterized networks where VC and Rademacher bounds fail.

rademacher complexitybayesian estimation
L3Tier 2~60m

Representation Learning Theory

What makes a good learned representation: the information bottleneck, contrastive learning, sufficient statistics, rate-distortion theory, and why representation learning is the central unsolved problem of deep learning.

information theory foundationsvariational autoencoders
L4Tier 3~65m

Gaussian Processes for Machine Learning

A distribution over functions specified by a mean and kernel: closed-form posterior predictions with uncertainty, connection to kernel ridge regression, marginal likelihood for model selection, and the cubic cost bottleneck.

kernels and rkhs
L3Tier 3~50m

Information Bottleneck

The information bottleneck principle: compress the input X into a representation T that preserves information about the target Y. The Lagrangian formulation, connection to deep learning, Shwartz-Ziv and Tishby claims, and why the compression story may not hold for ReLU networks.

information theory foundations
L5Tier 3~70m

Open Problems in ML Theory

A curated list of genuinely open problems in machine learning theory: why overparameterized networks generalize, the right complexity measure for deep learning, feature learning beyond NTK, why scaling laws hold, emergent abilities, transformer-specific theory, and post-training theory.

implicit bias and modern generalizationscaling laws
L4Tier 3~60m

Sparse Recovery and Compressed Sensing

Recover a sparse signal from far fewer measurements than its ambient dimension: the restricted isometry property, basis pursuit via L1 minimization, random measurement matrices, and applications from MRI to single-pixel cameras.

lasso regressionsubgaussian random variables
L4Tier 3~55m

Wasserstein Distances

The Wasserstein (earth mover's) distance measures the minimum cost of transporting one probability distribution to another, with deep connections to optimal transport, GANs, and distributional robustness.

common probability distributions

LLM Construction (Layer 4-5)

Transformer math, attention, KV cache, optimizers, scaling laws, RLHF.

L3Tier 1~55m

Fine-Tuning and Adaptation

Methods for adapting pretrained models to new tasks: full fine-tuning, feature extraction, LoRA, QLoRA, adapter layers, and prompt tuning. When to use each, and how catastrophic forgetting constrains adaptation.

feedforward networks and backpropagationtransformer architecture
L4Tier 1~55m

Hallucination Theory

Why large language models confabulate, the mathematical frameworks for understanding when model outputs are unreliable, and what current theory says about mitigation.

empirical risk minimizationtransformer architecture
L3Tier 1~70m

Optimizer Theory: SGD, Adam, and Muon

Convergence theory of SGD (convex and strongly convex), momentum methods (Polyak and Nesterov), Adam as adaptive + momentum, why SGD can generalize better, the Muon optimizer, and learning rate schedules.

convex optimization basicsadam optimizer
L5Tier 1~70m

Reinforcement Learning from Human Feedback: Deep Dive

The full RLHF pipeline: supervised fine-tuning, Bradley-Terry reward modeling, PPO with KL penalty, reward hacking via Goodhart, and the post-RLHF landscape of DPO, GRPO, and RLVR.

policy gradient theoremrlhf and alignment
L4Tier 2~60m

Attention Mechanism Theory

Mathematical formulation of attention: scaled dot-product attention as soft dictionary lookup, why sqrt(d_k) scaling prevents softmax saturation, multi-head attention, and the connection to kernel methods.

matrix operations and propertiessoftmax and numerical stability
L3Tier 2~45m

Attention Mechanisms History

The evolution of attention from Bahdanau (2014) additive alignment to Luong dot-product attention to self-attention in transformers. How attention solved the fixed-length bottleneck of seq2seq models.

recurrent neural networks
L4Tier 2~45m

Attention Sinks and Retrieval Decay

Why transformers disproportionately attend to initial tokens (attention sinks), how StreamingLLM exploits this for infinite-length inference, and how retrieval accuracy degrades with distance and position within the context window.

attention mechanism theorykv cache
L4Tier 2~55m

Attention Variants and Efficiency

Multi-head, multi-query, grouped-query, linear, and sparse attention: how each variant trades expressivity for efficiency, and when to use which.

attention mechanism theoryflash attention
L3Tier 2~35m

Bits, Nats, Perplexity, and BPB

The four units people use to measure language model quality, how they relate to each other, when to use each one, and how mixing them up leads to wrong conclusions.

information theory foundationskl divergence
L5Tier 2~50m

Chain-of-Thought and Reasoning

Chain-of-thought prompting, why intermediate reasoning steps improve LLM performance, self-consistency, tree-of-thought, and the connection to inference-time compute scaling.

prompt engineering and in context learning
L5Tier 2~55m

Context Engineering

The discipline of building, routing, compressing, retrieving, and persisting context for LLMs: beyond prompt design into systems engineering for what the model sees.

kv cacheattention mechanism theory
L3Tier 2~45m

Decoding Strategies

How language models select output tokens: greedy decoding, beam search, temperature scaling, top-k sampling, and nucleus (top-p) sampling. The tradeoffs between coherence, diversity, and quality.

transformer architecturesoftmax and numerical stability
L5Tier 2~50m

Document Intelligence

Beyond OCR: understanding document layout, tables, figures, and structure using models that combine text, spatial position, and visual features to extract structured information from PDFs, invoices, and contracts.

multimodal rag
L5Tier 2~60m

DPO vs GRPO vs RL for Reasoning

Head-to-head comparison of preference optimization methods: DPO's implicit reward, GRPO's group-relative comparisons, and RL with verifier feedback for reasoning. When each works and when each fails.

rlhf and alignmentpolicy gradient theorem
L5Tier 2~45m

Edge and On-Device ML

Running models on phones, embedded devices, and edge servers: pruning, distillation, quantization, TinyML, and hardware-aware neural architecture search under memory, compute, and power constraints.

speculative decoding and quantization
L4Tier 2~55m

Efficient Transformers Survey

Sub-quadratic attention variants (linear attention, Linformer, Performer, Longformer, BigBird) and why FlashAttention, a hardware-aware exact method, made most of them unnecessary in practice.

attention variants and efficiencyflash attention
L5Tier 2~55m

Flash Attention

IO-aware exact attention: tile QKV matrices into SRAM-sized blocks, compute attention without materializing the full attention matrix in HBM, reducing memory reads/writes from quadratic to linear.

attention mechanism theorysoftmax and numerical stability
L4Tier 2~40m

Forgetting Transformer (FoX)

FoX adds a data-dependent forget gate to softmax attention. The gate down-weights unnormalized attention scores between past and present positions, giving the transformer a learned, recency-biased decay. FoX is FlashAttention-compatible, works without positional embeddings, and improves long-context language modeling and length extrapolation.

attention mechanism theoryrecurrent neural networkstransformer architecture
L5Tier 2~40m

Fused Kernels

Combine multiple GPU operations into a single kernel launch to eliminate intermediate HBM reads and writes. Why kernel fusion is the primary optimization technique for memory-bound ML operations.

gpu compute model
L5Tier 2~50m

GPU Compute Model

How GPUs execute ML workloads: streaming multiprocessors, warps, memory hierarchy (registers, SRAM, L2, HBM), arithmetic intensity, the roofline model, and why most ML operations are memory-bound.

L4Tier 2~50m

Induction Heads

Induction heads are attention head circuits that implement pattern completion: given a sequence like [A][B]...[A], they predict [B]. They are a leading candidate mechanism for in-context learning, with strong causal evidence in small attention-only models and correlational evidence in large transformers. They emerge through a phase transition during training.

attention mechanism theorytransformer architecture
L5Tier 2~60m

Inference Systems Overview

The modern LLM inference stack: batching strategies, scheduling, memory management with paged attention, model parallelism for serving, and why FLOPs do not equal latency when memory bandwidth is the bottleneck.

kv cachespeculative decoding and quantization
L5Tier 2~50m

Inference-Time Scaling Laws

How additional compute at inference time (repeated sampling, search, verification) improves output quality, why gains are task-dependent, and why verifier quality matters more than raw sample count.

scaling lawstest time compute and search
L3Tier 2~45m

Knowledge Distillation

Training a small student model to mimic a large teacher: soft targets, temperature scaling, dark knowledge, and why the teacher's mistakes carry useful information about class structure.

feedforward networks and backpropagation
L5Tier 2~45m

KV Cache Optimization

Advanced techniques for managing the KV cache memory bottleneck: paged attention for fragmentation-free allocation, prefix caching for shared prompts, token eviction for long sequences, and quantized KV cache for reduced footprint.

kv cache
L5Tier 2~45m

KV Cache

Why autoregressive generation recomputes attention at every step, how caching past key-value pairs makes it linear, and the memory bottleneck that drives MQA, GQA, and paged attention.

attention mechanism theory
L5Tier 2~50m

Latent Reasoning

Reasoning in hidden state space instead of generating chain-of-thought tokens: recurrent computation and continuous thought for scaling inference compute without scaling output length.

test time compute and search
L5Tier 2~50m

Memory Systems for LLMs

Taxonomy of LLM memory: short-term (KV cache), working (scratchpad), long-term (retrieval), and parametric (weights). Why extending context alone is insufficient and how memory consolidation works.

context engineeringkv cache
L4Tier 2~55m

Mixture of Experts

Sparse computation via learned routing: replace dense FFN layers with multiple expert networks, activate only a subset per token, and scale capacity without proportional compute.

transformer architecture
L3Tier 2~50m

Model Compression and Pruning

Reducing model size without proportional accuracy loss: unstructured and structured pruning, magnitude pruning, the lottery ticket hypothesis, entropy coding for compressed weights, and knowledge distillation as compression.

feedforward networks and backpropagation
L5Tier 2~45m

Multi-Token Prediction

Predicting k future tokens simultaneously using auxiliary prediction heads: forces planning, improves code generation, and connects to speculative decoding.

transformer architecture
L5Tier 2~55m

Multimodal RAG

RAG beyond text: retrieving images, tables, charts, and PDFs alongside text. Document parsing, multimodal chunking, vision-language retrievers, agentic RAG, and reasoning RAG with chain-of-thought retrieval.

context engineering
L5Tier 2~45m

PaddleOCR and Practical OCR

A practitioner's guide to modern OCR toolkits: PaddleOCR's three-stage pipeline, TrOCR's transformer approach, EasyOCR, and Tesseract. When to use which, and what accuracy to expect.

document intelligence
L5Tier 2~50m

Parallel Processing Fundamentals

Data, tensor, pipeline, expert, and sequence parallelism: the five strategies for distributing model training and inference across multiple GPUs, and how frontier labs combine all of them.

distributed training theory
L3Tier 2~40m

Perplexity and Language Model Evaluation

Perplexity as exp(cross-entropy): the standard intrinsic metric for language models, its information-theoretic interpretation, connection to bits-per-byte, and why low perplexity alone does not guarantee useful generation.

information theory foundations
L5Tier 2~70m

Post-Training Overview

The full post-training stack in 2026: SFT, RLHF, DPO, GRPO, constitutional AI, verifier-guided training, and self-improvement loops. Why post-training is now its own discipline.

rlhf and alignmenttransformer architecture
L5Tier 2~35m

Prefix Caching

Share computed KV cache entries across requests that share the same prefix. Radix attention trees enable efficient lookup. Significant latency savings for prefix-heavy production workloads.

kv cachekv cache optimization
L5Tier 2~50m

Prompt Engineering and In-Context Learning

In-context learning allows LLMs to adapt to new tasks from examples in the prompt without weight updates. Theories for why it works, prompting strategies, and why prompt engineering is configuring inference-time computation.

transformer architecture
L5Tier 2~45m

Reasoning Data Curation

How to build training data for reasoning models: math with verified solutions, code with test cases, rejection sampling, external verification, self-play for problem generation, and the connection to RLVR.

post training overview
L4Tier 2~50m

Residual Stream and Transformer Internals

The residual stream as the central computational highway in transformers: attention and FFN blocks read from and write to it. Pre-norm vs post-norm, FFN as key-value memory, and the logit lens for inspecting intermediate representations.

transformer architecture
L4Tier 2~65m

RLHF and Alignment

The RLHF pipeline for aligning language models with human preferences: reward modeling, PPO fine-tuning, KL penalties, DPO, and why none of it guarantees truthfulness.

policy gradient theoremmarkov decision processes
L5Tier 2~55m

Scaling Compute-Optimal Training

Chinchilla scaling: how to optimally allocate a fixed compute budget between model size and training data, why many models were undertrained, and the post-Chinchilla reality of data quality and inference cost.

scaling laws
L4Tier 2~65m

Scaling Laws

Power-law relationships between loss and compute, parameters, and data: Kaplan scaling, Chinchilla-optimal training, emergent abilities, and whether scaling laws are fundamental or empirical.

convex optimization basics
L4Tier 2~55m

Sparse Attention and Long Context

Standard attention is O(n^2). Sparse patterns (Longformer, Sparse Transformer, Reformer), ring attention for distributed sequences, streaming with attention sinks, and why extending context is harder than it sounds.

attention mechanism theoryflash attention
L4Tier 2~50m

Sparse Autoencoders for Interpretability

Sparse autoencoders decompose neural network activations into interpretable feature directions by learning an overcomplete dictionary with sparsity constraints. They are the primary tool for extracting monosemantic features from polysemantic neurons.

autoencodersmechanistic interpretability
L5Tier 2~50m

Speculative Decoding and Quantization

Two core inference optimizations: speculative decoding for latency (draft-verify parallelism) and quantization for memory and throughput (reducing weight precision without destroying quality).

transformer architecturekv cache
L5Tier 2~45m

Structured Output and Constrained Generation

Forcing LLM output to conform to schemas via constrained decoding: logit masking for valid tokens, grammar-guided generation, and why constraining the search space can improve both reliability and quality.

transformer architecture
L5Tier 2~65m

Test-Time Compute and Search

One of the biggest frontier shifts: spending more compute at inference through repeated sampling, verifier-guided search, MCTS for reasoning, chain-of-thought as compute, and latent reasoning.

scaling laws
L3Tier 2~45m

Token Prediction and Language Modeling

Language modeling as probability assignment over sequences. Autoregressive and masked prediction objectives, perplexity evaluation, and the connection between prediction and compression.

information theory foundations
L5Tier 2~50m

Tool-Augmented Reasoning

LLMs that call external tools during reasoning: Toolformer for learning when to invoke APIs, ReAct for interleaving thought and action, and code-as-thought for replacing verbal arithmetic with executed programs.

agentic rl and tool usechain of thought and reasoning
L4Tier 2~60m

Training Dynamics and Loss Landscapes

The geometry of neural network loss surfaces: why saddle points dominate over local minima in high dimensions, how flat minima relate to generalization, and why SGD finds solutions that generalize.

convex optimization basicsthe hessian matrix
L4Tier 2~70m

Transformer Architecture

The mathematical formulation of the transformer block: self-attention, multi-head attention, layer normalization, FFN blocks, positional encoding, and parameter counting.

attention mechanism theoryfeedforward networks and backpropagationsoftmax and numerical stability
L5Tier 3~35m

AMD Competition Landscape

AMD's MI300X and MI325X GPUs compete with NVIDIA on memory bandwidth and capacity but lag on software ecosystem. Competition matters because pricing, supply diversity, and vendor lock-in determine who can train and serve models.

gpu compute model
L5Tier 3~40m

ASML and Chip Manufacturing

ASML is the sole manufacturer of EUV lithography machines used to produce every advanced AI chip. Understanding the semiconductor supply chain reveals a critical concentration risk for AI compute.

L4Tier 3~45m

Attention as Kernel Regression

Softmax attention viewed as Nadaraya-Watson kernel regression: the output at each position is a kernel-weighted average of values, with the softmax kernel K(q,k) = exp(q^T k / sqrt(d)). Connects attention to classical nonparametric statistics and motivates linear attention via random features.

attention mechanism theorykernels and rkhs
L5Tier 3~55m

Distributed Training Theory

Training frontier models requires thousands of GPUs. Data parallelism, model parallelism, and communication-efficient methods make this possible.

optimizer theory sgd adam muon
L5Tier 3~40m

Donut and OCR-Free Document Understanding

End-to-end document understanding without OCR: Donut reads document images directly and generates structured output, bypassing the error-prone OCR pipeline. Nougat extends this to academic paper parsing.

transformer architecturedocument intelligence
L5Tier 3~45m

Model Merging and Weight Averaging

Combining trained models by averaging or interpolating their weights: SWA, SLERP, TIES-Merging, DARE. Why it works (loss landscape mode connectivity), when it fails, and applications to combining specialized models.

transformer architecture
L4Tier 3~50m

Neural Architecture Search

Automating network architecture design: search spaces, search strategies (RL, evolutionary, differentiable), performance estimation via weight sharing, and the gap between NAS hype and practical gains.

feedforward networks and backpropagation
L5Tier 3~45m

NVIDIA GPU Architectures

A concrete reference for the GPU hardware that determines what ML models can be trained and served: H100, H200, B200, GB200, and Rubin architectures, with emphasis on memory bandwidth as the primary bottleneck.

gpu compute model
L5Tier 3~45m

Plan-then-Generate

Beyond strict left-to-right token generation: plan what to say before producing tokens. Outline-then-fill, hierarchical generation, and multi-token prediction. Planning reduces incoherence in long outputs and enables structural editing.

transformer architecture
L4Tier 3~50m

Positional Encoding

Why attention needs position information, sinusoidal encoding, learned positions, RoPE (rotary position encoding via 2D rotations), ALiBi, and why RoPE became the default for modern LLMs.

attention mechanism theory
L5Tier 3~55m

Quantization Theory

Reduce model weight precision from FP32 to FP16, INT8, or INT4. Post-training quantization, quantization-aware training, GPTQ, AWQ, and GGUF. Quantization is how large language models actually get deployed.

softmax and numerical stability
L5Tier 3~40m

Table Extraction and Structure Recognition

Detecting tables in documents, recognizing row and column structure, and extracting cell content. Why tables are hard: merged cells, borderless layouts, nested headers, and cascading pipeline errors.

document intelligence
L4Tier 3~50m

Tokenization and Information Theory

Tokenization determines an LLM's vocabulary and shapes everything from compression efficiency to multilingual ability. Information theory explains what good tokenization looks like.

information theory foundations

Methodology & Experimental Design

Hypothesis testing, ablations, significance, reproducibility.

L3Tier 1~30m

The Bitter Lesson

Sutton's meta-principle: scalable general methods that exploit computation tend to beat hand-crafted domain-specific approaches in the long run. Search and learning win; brittle cleverness loses.

L3Tier 1~50m

Causal Inference and the Ladder of Causation

Pearl's three-level hierarchy: association, intervention, counterfactual. Structural causal models, do-calculus, the adjustment formula, and why prediction is not causation.

common probability distributionsbayesian estimation
L1Tier 1~45m

Confusion Matrices and Classification Metrics

The complete framework for evaluating binary classifiers: confusion matrix entries, precision, recall, F1, specificity, ROC curves, AUC, and precision-recall curves. When to use each metric.

common probability distributions
L1Tier 1~40m

Confusion Matrix Deep Dive

Beyond binary classification: multi-class confusion matrices, micro vs macro averaging, Matthews correlation coefficient, Cohen's kappa, cost-sensitive evaluation, and how to read a confusion matrix without getting the axes wrong.

L4Tier 1~35m

The Era of Experience

Sutton and Silver's thesis: the next phase of AI moves beyond imitation from human data toward agents that learn predominantly from their own experience. Text is not enough for general intelligence.

bitter lessonmarkov decision processes
L1Tier 1~45m

Model Evaluation Best Practices

Train/validation/test splits, cross-validation, stratified and temporal splits, data leakage, reporting with confidence intervals, and statistical tests for model comparison. Why single-number comparisons are misleading.

confusion matrices and classification metrics
L1Tier 1~35m

Train-Test Split and Data Leakage

Why you need three data splits, how to construct them correctly, and the common ways information leaks from test data into training. Temporal splits, stratified splits, and leakage detection.

L1Tier 1~50m

Types of Bias in Statistics

A taxonomy of the major biases that corrupt statistical inference: selection bias, measurement bias, survivorship bias, response bias, attrition bias, publication bias, and their consequences for ML.

L3Tier 2~45m

Ablation Study Design

How to properly design ablation studies: remove one component at a time, measure the effect against proper baselines, and report results with statistical rigor.

hypothesis testing for ml
L1Tier 2~40m

Class Imbalance and Resampling

When class frequencies differ dramatically, standard accuracy is misleading. Resampling, cost-sensitive learning, and threshold tuning restore meaningful evaluation and training.

confusion matrices and classification metrics
L2Tier 2~40m

Convex Tinkering

Taleb's concept applied to ML research: designing small experiments with bounded downside and unbounded upside, and why this strategy dominates scale-first approaches under uncertainty.

common inequalities
L2Tier 2~50m

Evaluation Metrics and Properties

The metrics that determine whether a model is good: accuracy, precision, recall, F1, AUC-ROC, log loss, RMSE, calibration, and proper scoring rules. Why choosing the right metric matters more than improving the wrong one.

L1Tier 2~40m

Exploratory Data Analysis

The disciplined practice of looking at data before modeling: summary statistics, distributions, correlations, missing values, outliers, and class balance. You cannot model what you do not understand.

L2Tier 2~50m

Feature Importance and Interpretability

Methods for attributing model predictions to input features: permutation importance, SHAP values, LIME, partial dependence, and why none of these imply causality.

random forestsgradient boosting
L3Tier 2~50m

Federated Learning

Train a global model without centralizing data. FedAvg, communication efficiency, non-IID convergence challenges, differential privacy integration, and applications in healthcare and mobile computing.

distributed training theory
L1Tier 2~35m

Hardware for ML Practitioners

Practical hardware guidance for ML work: GPU memory as the real bottleneck, when local GPUs make sense, cloud options compared, and why you should not spend $5000 before knowing what you need.

L2Tier 2~55m

Hypothesis Testing for ML

Null and alternative hypotheses, p-values, confidence intervals, error types, multiple comparisons, and proper statistical tests for comparing ML models.

L2Tier 2~50m

Meta-Analysis

Combining results from multiple studies: fixed-effect and random-effects models, heterogeneity measures, publication bias, and why this matters for ML benchmarking.

hypothesis testing for mlbayesian estimation
L1Tier 2~45m

ML Project Lifecycle

The full ML project workflow from problem definition through deployment and monitoring. Why most projects fail at data quality, not model architecture. Cross-functional requirements and MLOps basics.

L2Tier 2~45m

P-Hacking and Multiple Testing

How selective reporting and multiple comparisons inflate false positive rates, and how Bonferroni and Benjamini-Hochberg corrections control them. Why hyperparameter tuning is multiple testing and benchmark shopping is p-hacking.

hypothesis testing for ml
L2Tier 2~40m

Proper Scoring Rules

A scoring rule is proper if the expected score is maximized when the forecaster reports their true belief. Log score and Brier score are strictly proper. Accuracy is not. Why this matters for calibrated probability estimates.

evaluation metrics and properties
L2Tier 2~50m

Reproducibility and Experimental Rigor

What it takes to make ML experiments truly reproducible: seeds, variance reporting, data hygiene, configuration management, and the discipline of multi-run evaluation.

L2Tier 2~50m

Statistical Significance and Multiple Comparisons

p-values, significance levels, confidence intervals, and the multiple comparisons problem. Bonferroni correction, Benjamini-Hochberg FDR control, and why these matter for model selection and benchmark evaluation.

hypothesis testing for ml
L3Tier 2~45m

Synthetic Data Generation

Using models to generate training data: LLM-generated instructions, diffusion-based image augmentation, code synthesis. When synthetic data helps (low-resource, privacy) and when it hurts (model collapse).

common probability distributions
L3Tier 3~45m

Benchmarking Methodology

What makes a good benchmark, how benchmarks fail (contamination, leaderboard gaming, single-number comparisons), and how to report results honestly with variance, seeds, and proper statistical practice.

evaluation metrics and propertiesreproducibility and experimental rigor
L3Tier 3~55m

Causal Inference Basics

Correlation is not causation. The potential outcomes framework, average treatment effects, confounders, and the methods that let you estimate causal effects from data.

hypothesis testing for ml
L5Tier 3~40m

Energy Efficiency and Green AI

The compute cost of training frontier models, carbon footprint, FLOPs vs wall-clock time vs dollars, and why reporting efficiency matters. Efficient alternatives: distillation, pruning, quantization, and scaling laws for optimal compute allocation.

L2Tier 3~35m

Experiment Tracking and Tooling

MLflow, Weights and Biases, TensorBoard, Hydra, and the discipline of recording everything. What to log, how to compare experiments, and why you cannot reproduce what you did not record.

reproducibility and experimental rigor
L3Tier 3~45m

Official Statistics and National Surveys

How government statistical agencies produce population, economic, and social data through censuses and surveys, with quality frameworks and implications for ML practitioners using these datasets.

survey sampling methods

Training Techniques & Regularization

Adam, dropout, batch norm, data augmentation, learning rate schedules.

L2Tier 1~55m

Adam Optimizer

Adaptive moment estimation: first and second moment tracking, bias correction, AdamW vs Adam+L2, learning rate warmup, and when Adam fails compared to SGD.

gradient descent variantsstochastic gradient descent convergence
L2Tier 1~50m

Batch Normalization

Normalize activations within mini-batches to stabilize and accelerate training, then scale and shift with learnable parameters.

feedforward networks and backpropagationexpectation variance covariance moments
L2Tier 1~45m

Dropout

Dropout as stochastic regularization: Bernoulli masking, inverted dropout scaling, implicit ensemble interpretation, Bayesian connection via MC dropout, and equivalence to L2 regularization in linear models.

feedforward networks and backpropagationcommon probability distributions
L2Tier 1~45m

Learning Rate Scheduling

Why learning rate moves the loss more than any other hyperparameter, and how warmup, cosine decay, step decay, cyclic schedules, and 1cycle policy control training dynamics.

stochastic gradient descent convergence
L1Tier 1~45m

Regularization in Practice

Practical guide to regularization: L2 (weight decay), L1 (sparsity), dropout, early stopping, data augmentation, and batch normalization. How each constrains model complexity and how to choose between them.

regularization theory
L2Tier 1~40m

Weight Initialization

Why initialization determines whether gradients vanish or explode, and how Xavier and He initialization preserve variance across layers.

feedforward networks and backpropagationeigenvalues and eigenvectors
L3Tier 2~35m

Activation Checkpointing

Trade compute for memory by recomputing activations during the backward pass instead of storing them all. Reduces memory from O(L) to O(sqrt(L)) for L layers.

feedforward networks and backpropagation
L2Tier 2~45m

Batch Size and Learning Dynamics

How batch size affects what SGD finds: gradient noise, implicit regularization, the linear scaling rule, sharp vs flat minima, and the gradient noise scale as the key quantity governing the tradeoff.

stochastic gradient descent convergenceadam optimizer
L2Tier 2~45m

Data Augmentation Theory

Why data augmentation works as a regularizer: invariance injection, effective sample size, MixUp, CutMix, and the connection to Vicinal Risk Minimization.

L2Tier 2~35m

Label Smoothing and Regularization

Replace hard targets with soft targets to prevent overconfidence: the label smoothing formula, its connection to maximum entropy regularization, calibration effects, and when it hurts.

logistic regression
L3Tier 2~45m

Mixed Precision Training

Train with FP16 or BF16 for speed while keeping FP32 master weights for accuracy. Loss scaling, overflow prevention, and when mixed precision fails.

floating point arithmetic
L2Tier 3~35m

Curriculum Learning

Train on easy examples first, gradually increase difficulty. Curriculum learning can speed convergence and improve generalization, but defining difficulty is the hard part. Self-paced learning, anti-curriculum, and the connection to importance sampling.

AI Safety & Alignment

RLHF failure modes, hallucination theory, interpretability, reward hacking.

L4Tier 2~55m

Adversarial Machine Learning

Attacks across the ML lifecycle: evasion (adversarial examples), poisoning (corrupt training data), model extraction (steal via queries), privacy leakage (membership inference), and LLM jailbreaks. Why defenses are hard.

feedforward networks and backpropagation
L3Tier 2~50m

Calibration and Uncertainty Quantification

When a model says 70% confidence, is it right 70% of the time? Calibration measures the alignment between predicted probabilities and actual outcomes. Temperature scaling, Platt scaling, conformal prediction, and MC dropout provide practical tools for trustworthy uncertainty.

logistic regression
L4Tier 2~50m

Catastrophic Forgetting

Fine-tuning a neural network on new data destroys knowledge of old data. Understanding the stability-plasticity dilemma and mitigation strategies: EWC, progressive networks, replay: is essential for continual learning and safe LLM fine-tuning.

L5Tier 2~50m

Constitutional AI

Anthropic's approach to alignment: replace human harmlessness labels with a constitution of principles that the model uses to self-evaluate, enabling scalable AI feedback.

rlhf and alignment
L3Tier 2~50m

Continual Learning and Forgetting

Learning sequentially without destroying previous knowledge: Elastic Weight Consolidation, progressive networks, replay methods, and the stability-plasticity tradeoff in deployed systems.

catastrophic forgetting
L5Tier 2~55m

Data Contamination and Evaluation

When training data overlaps test benchmarks, model scores become meaningless. Types of contamination, detection methods, dynamic benchmarks, and how to read evaluation claims skeptically.

hypothesis testing for ml
L3Tier 2~55m

Differential Privacy

Formal privacy guarantees for algorithms: epsilon-delta DP, Laplace and Gaussian mechanisms, composition theorems, DP-SGD for training neural networks, and the privacy-utility tradeoff.

common probability distributions
L3Tier 2~55m

Ethics and Fairness in ML

Fairness definitions (demographic parity, equalized odds, calibration), the impossibility theorem showing they cannot all hold simultaneously, bias sources, and mitigation strategies at each stage of the pipeline.

L5Tier 2~50m

LLM Application Security

The OWASP LLM Top 10: prompt injection, insecure output handling, training data poisoning, model denial of service, supply chain vulnerabilities, sensitive information disclosure, insecure plugin design, excessive agency, overreliance, and model theft. Standard application security for the GenAI era.

adversarial machine learningrlhf and alignment
L4Tier 2~55m

Mechanistic Interpretability

Understanding what individual neurons and circuits compute inside neural networks: sparse autoencoders, superposition, induction heads, probing, and the limits of interpretability.

transformer architectureprincipal component analysis
L5Tier 2~45m

Model Collapse and Data Quality

When models train on their own outputs, the learned distribution narrows, tails disappear, and quality degrades across generations. Why synthetic data feedback loops threaten pretraining data quality and how to mitigate them.

synthetic data generation
L3Tier 2~50m

Out-of-Distribution Detection

Methods for detecting when test inputs differ from training data, where naive softmax confidence fails and principled alternatives based on energy, Mahalanobis distance, and typicality succeed.

calibration and uncertainty
L5Tier 2~45m

Red-Teaming and Adversarial Evaluation

Systematically trying to make models produce harmful or incorrect outputs: manual and automated red-teaming, jailbreaks, prompt injection, adversarial suffixes, and why adversarial evaluation is necessary before deployment.

rlhf and alignment
L5Tier 2~50m

Reward Hacking

Goodhart's law for AI: when models exploit reward model weaknesses instead of being genuinely helpful, including verbosity hacking, sycophancy, and structured mitigation strategies.

reward models and verifiersrlhf and alignment
L5Tier 2~55m

Reward Models and Verifiers

Reward models trained on human preferences vs verifiers that check output correctness. Bradley-Terry models, process vs outcome rewards, Goodhart's law, and why verifiers are more robust.

rlhf and alignment
L5Tier 2~55m

Verifier Design and Process Reward

Detailed treatment of verifier types, process vs outcome reward models, verifier-guided search, self-verification, and the connection to test-time compute scaling. How to design reward signals for reasoning models.

reward models and verifiers

Reinforcement Learning Theory

MDPs, Bellman, policy gradients, multi-agent, game theory.

L2Tier 1~55m

Kalman Filter

Optimal state estimation for linear Gaussian systems via recursive prediction and update steps using the Kalman gain.

common probability distributionseigenvalues and eigenvectors
L2Tier 1~70m

Markov Decision Processes

The mathematical framework for sequential decision-making under uncertainty: states, actions, transitions, rewards, and the Bellman equations that make solving them possible.

convex optimization basicsconcentration inequalities
L3Tier 1~65m

Policy Gradient Theorem

The fundamental result enabling gradient-based optimization of parameterized policies: the policy gradient theorem and the algorithms it spawns.

markov decision processesconvex optimization basics
L2Tier 1~55m

Value Iteration and Policy Iteration

The two foundational algorithms for solving MDPs exactly: value iteration applies the Bellman optimality operator until convergence, while policy iteration alternates between exact evaluation and greedy improvement.

markov decision processes
L3Tier 2~55m

Actor-Critic Methods

The dominant paradigm for deep RL and LLM training: an actor (policy network) guided by a critic (value network), with advantage estimation, PPO clipping, and entropy regularization.

policy gradient theoremq learning
L5Tier 2~60m

Agentic RL and Tool Use

The shift from passive sequence generation to autonomous multi-turn decision making. LLMs as RL policies, tool use as actions, ReAct, AgentRL, and why agentic RL differs from chat RLHF.

markov decision processespolicy gradient theorem
L2Tier 2~45m

Bayesian State Estimation

The filtering problem: recursively estimate a hidden state from noisy observations using predict-update cycles. Kalman filter for linear Gaussian systems, particle filters for the general case.

bayesian estimationcommon probability distributions
L2Tier 2~45m

Exploration vs Exploitation

The fundamental tradeoff in sequential decision-making: exploit known good actions to collect reward now, or explore uncertain actions to discover potentially better strategies. Epsilon-greedy, Boltzmann exploration, UCB, count-based methods, and intrinsic motivation.

multi armed bandits theorymarkov decision processes
L3Tier 2~55m

GraphSLAM and Factor Graphs

SLAM as graph optimization: poses as nodes, constraints as edges, factor graph representation, MAP estimation via nonlinear least squares, and the sparsity structure that makes large-scale mapping tractable.

L3Tier 2~50m

Markov Games and Self-Play

Multi-agent extensions of MDPs where multiple agents with separate rewards interact. Nash equilibria, minimax values in zero-sum games, and self-play as a training method.

markov decision processes
L2Tier 2~50m

Minimax and Saddle Points

Minimax theorems characterize when max-min equals min-max. Saddle points arise in zero-sum games, duality theory, GANs, and robust optimization.

convex optimization basicsconvex duality
L4Tier 2~50m

Multi-Agent Collaboration

Multiple LLM agents working together on complex tasks: debate for improving reasoning, division of labor across specialist agents, structured communication protocols, and when multi-agent outperforms single-agent systems.

agentic rl and tool useagent protocols mcp a2a
L2Tier 2~55m

Multi-Armed Bandits Theory

The exploration-exploitation tradeoff formalized: K arms, regret as the cost of not knowing the best arm, and algorithms (UCB, Thompson sampling) that achieve near-optimal regret bounds.

common probability distributions
L3Tier 2~55m

No-Regret Learning

Online learning against adversarial losses: regret as cumulative loss minus the best fixed action in hindsight, multiplicative weights, follow the regularized leader, and why no-regret dynamics converge to Nash equilibria in zero-sum games.

L3Tier 2~55m

Offline Reinforcement Learning

Learning policies from a fixed dataset without environment interaction: distributional shift as the core challenge, conservative Q-learning (CQL) as the standard fix, and Decision Transformer as an alternative sequence modeling approach.

q learning
L3Tier 2~60m

Online Learning and Bandits

Sequential decision making with adversarial or stochastic feedback: the bandit setting, explore-exploit tradeoff, UCB, Thompson sampling, and regret bounds. Connections to RL and A/B testing.

no regret learning
L3Tier 2~55m

Policy Optimization: PPO and TRPO

Trust region and clipped surrogate methods for stable policy optimization: TRPO enforces a KL constraint for monotonic improvement, PPO replaces it with a simpler clipped objective, and PPO became the default optimizer for LLM post-training.

policy gradient theoremactor critic methods
L3Tier 2~40m

Policy Representations

How to parameterize policies in reinforcement learning: categorical for discrete actions, Gaussian for continuous actions, and why the choice affects gradient variance and exploration.

markov decision processes
L2Tier 2~50m

Q-Learning

Model-free, off-policy value learning: the Q-learning update rule, convergence under Robbins-Monro conditions, and the deep Q-network revolution that introduced function approximation, experience replay, and the deadly triad.

value iteration and policy iteration
L3Tier 2~55m

Self-Play and Multi-Agent RL

Self-play as a training paradigm for competitive games, fictitious play convergence, AlphaGo/AlphaZero, and the challenges of multi-agent reinforcement learning: non-stationarity, partial observability, and centralized training.

markov decision processes
L2Tier 2~50m

Temporal Difference Learning

Temporal difference methods bootstrap value estimates from other value estimates, enabling online, incremental learning without waiting for episode termination. TD(0), SARSA, and TD(lambda) with eligibility traces.

markov decision processesvalue iteration and policy iteration
L4Tier 3~50m

Active SLAM and POMDPs

Choosing robot actions to simultaneously map an environment and localize, formulated as a partially observable Markov decision process over belief states.

graphslam and factor graphsmarkov decision processes
L5Tier 3~45m

Agent Protocols: MCP and A2A

The protocol layer for AI agents: MCP (Model Context Protocol) for tool access, A2A (Agent-to-Agent) for inter-agent communication, and why standardized interfaces matter for the agent ecosystem.

agentic rl and tool use
L4Tier 3~55m

Mean-Field Games

The many-agent limit of strategic interactions: as the number of agents goes to infinity, each agent solves an MDP against the population distribution, and equilibrium becomes a fixed-point condition on the mean field.

markov decision processesmean field theory
L3Tier 3~50m

Options and Temporal Abstraction

The options framework for hierarchical RL: temporally extended actions with initiation sets, internal policies, and termination conditions. Semi-MDPs and learning options end-to-end.

markov decision processesvalue iteration and policy iteration
L3Tier 3~50m

Particle Filters

Sequential Monte Carlo: represent the posterior over hidden states as a set of weighted particles, propagate through dynamics, reweight by likelihood, and resample to combat degeneracy.

metropolis hastingsimportance sampling
L3Tier 3~40m

Reinforcement Learning Environments and Benchmarks

The standard RL evaluation stack: Gymnasium API, classic control tasks, Atari, MuJoCo, ProcGen, the sim-to-real gap, and why benchmark performance is a poor predictor of real-world RL capability.

markov decision processes
L4Tier 3~45m

Robust Adversarial Policies

Robust MDPs optimize against worst-case transition dynamics within an uncertainty set. Adversarial policies formalize distribution shift in RL as a game between agent and environment.

markov decision processesminimax lower bounds
L4Tier 3~50m

Visual and Semantic SLAM

Replacing laser range finders with cameras for SLAM, and enriching maps with semantic labels to improve data association and planning.

graphslam and factor graphs

Beyond LLMs

JEPA, world models, vision-first AI, diffusion, state-space models.

L4Tier 2~50m

CLIP and OpenCLIP in Practice

CLIP learns a shared embedding space for images and text via contrastive learning on 400M pairs. Practical guide to zero-shot classification, image search, OpenCLIP variants, embedding geometry, and known limitations.

contrastive learningvision transformer lineage
L4Tier 2~70m

Diffusion Models

Generative models that learn to reverse a noise-adding process: the math of score matching, denoising, SDEs, and why diffusion dominates image generation.

variational autoencoders
L4Tier 2~45m

Equilibrium and Implicit-Layer Models

Deep Equilibrium Models (DEQ) replace explicit depth with a fixed-point equation: instead of stacking L layers, solve for the equilibrium state where one more layer would not change the output. This enables infinite-depth networks with constant memory, using implicit differentiation for backprop.

skip connections and resnetsimplicit differentiation
L4Tier 2~50m

Equivariant Deep Learning

Networks that respect symmetry: if the input transforms under a group action, the output transforms predictably. Equivariance generalizes translation equivariance in CNNs to rotations, permutations, and gauge symmetries, reducing sample complexity and improving generalization on structured data.

convolutional neural networksgraph neural networks
L5Tier 2~50m

Florence and Vision Foundation Models

Vision foundation models that unify classification, detection, segmentation, captioning, and VQA under a single pretrained backbone. Florence, Florence-2, and the path toward GPT-4V-style multimodal understanding.

vision transformer lineageself supervised vision
L4Tier 2~55m

Flow Matching

Learn a velocity field that transports noise to data along straight-line paths. Simpler training than diffusion, faster sampling, and cleaner math.

diffusion models
L4Tier 2~55m

JEPA and Joint Embedding

LeCun's Joint Embedding Predictive Architecture: learning by predicting abstract representations rather than pixels, energy-based formulation, I-JEPA and V-JEPA implementations, and the connection to contrastive learning and world models.

autoencodersvariational autoencoders
L4Tier 2~60m

Mamba and State-Space Models

Linear-time sequence modeling via structured state spaces: S4, HiPPO initialization, selective state-space models (Mamba), and the architectural fork from transformers.

recurrent neural networksattention mechanism theory
L4Tier 2~50m

Neural ODEs and Continuous-Depth Networks

Treating neural network depth as a continuous variable: the ODE formulation of residual networks, the adjoint method for memory-efficient backpropagation, connections to dynamical systems theory, and practical limitations.

skip connections and resnetsgradient flow and vanishing gradientsautomatic differentiation
L4Tier 2~50m

Self-Supervised Vision

Learning visual representations without labels: contrastive methods (SimCLR, MoCo), self-distillation (DINO/DINOv2), and masked image modeling (MAE). Why self-supervised vision matters for transfer learning and label-scarce domains.

vision transformer lineage
L5Tier 2~40m

Test-Time Training and Adaptive Inference

Updating model parameters at inference time using self-supervised objectives on the test input itself. TTT layers replace fixed linear recurrences (as in Mamba) with learned update rules, blurring the boundary between training and inference.

stochastic gradient descent convergencerecurrent neural networks
L5Tier 2~50m

Video World Models

Turning pretrained video diffusion models into interactive world simulators: condition on actions to generate future frames, enabling RL agent training, robot planning, and game AI without physical environments.

world models and planningdiffusion models
L4Tier 2~55m

Vision Transformer Lineage

The evolution of visual representation learning: from CNNs (AlexNet, ResNet) to ViT (pure attention for images), Swin (hierarchical attention), and DINOv2 (self-supervised ViT with self-distillation), with connections to CLIP.

transformer architectureconvolutional neural networks
L4Tier 2~60m

World Models and Planning

Learning a model of the environment and planning inside it: Dreamer's latent dynamics, MuZero's learned model with MCTS, video world models, and why planning in imagination is the path to sample-efficient and safe AI.

markov decision processes
L5Tier 3~50m

Audio Language Models

Models that process and generate speech alongside text: audio tokenization, Whisper for transcription, end-to-end voice models, music generation, and the audio-language frontier.

speech and audio mltransformer architecture
L5Tier 3~35m

Continuous Thought Machines

Neural networks that process information through continuous-time internal dynamics rather than discrete layer-by-layer computation. Inspired by neural ODEs and dynamical systems, these architectures let the network 'think' for a variable amount of internal time before producing an output.

neural odesequilibrium and implicit models
L4Tier 3~45m

3D Gaussian Splatting

Represent a 3D scene as millions of 3D Gaussians, each with position, covariance, opacity, and color. Render by projecting to 2D and alpha-compositing. Real-time, high-quality novel view synthesis without neural networks at render time.

L4Tier 3~45m

Occupancy Networks and Neural Fields

Representing 3D geometry and appearance as continuous functions parameterized by neural networks: NeRF, occupancy networks, DeepSDF, volume rendering, and the connection to Gaussian splatting.

feedforward networks and backpropagation
L5Tier 3~40m

World Model Evaluation

How to measure whether a learned world model is useful: prediction accuracy, controllability (sim-to-real transfer), planning quality, and why long-horizon evaluation is hard.

world models and planning