Curated Track

Engineer to ML Theory

You can train models and ship code. This path helps you understand why things work, when they break, and what the theorems actually say. No unnecessary math. Just the theory that changes how you think about models.

This is a curated default sequence. Use it as a strong spine, then jump phases when the prerequisites are already in place.

1/5

Phase 1: Why Your Model Works (or Doesn't)

You already use these tools. Now understand what they're doing mathematically.

Bias-Variance TradeoffWhy models underfit or overfit Overfitting and UnderfittingDiagnosis and fixes Cross-ValidationWhy K-fold works, when it lies Loss FunctionsWhy cross-entropy, when MSE, what Huber RegularizationL1, L2, dropout, and what they're actually doing

2/5

Phase 2: The Optimizer You're Actually Using

Adam, SGD, learning rate schedules. What the math says about convergence.

Convex OptimizationThe ideal case: no local minima SGD ConvergenceWhy SGD works despite noise AdamAdaptive moments, bias correction, when it fails Learning Rate SchedulesWarmup, cosine decay, why they matter Batch NormalizationWhat it actually does to the loss landscape

3/5

Phase 3: Transformers from the Inside

Not just how to use them. How they compute, what the math looks like, and why they work.

AttentionQKV, softmax, multi-head Transformer ArchitectureThe residual stream, pre-norm, FFN Positional EncodingSinusoidal, RoPE, and why position matters Flash AttentionIO-aware exact attention KV CacheWhy inference is memory-bound

4/5

Phase 4: Scaling and Training at Scale

The math behind training large models. Scaling laws, RLHF, and post-training.

Scaling LawsKaplan vs Chinchilla, compute allocation RLHFHow models learn to follow instructions MoEScale capacity without scaling compute QuantizationShrinking models for deployment DecodingTemperature, top-k, top-p, and sampling

5/5

Phase 5: Why Things Break

The theory that explains failure modes. Double descent, grokking, interpretability.

Double DescentClassical learning theory is incomplete GrokkingModels can memorize then generalize CalibrationConfidence is not accuracy Induction HeadsHow in-context learning works mechanistically Sparse AutoencodersWhat's inside the black box

Already know some theory?

Skip to whatever phase matches your current understanding. Each page shows prerequisites in the sidebar so you can always check what you might be missing.

Browse the full curriculum →