Curated Track
LLM From Scratch
This path is for readers who want a clean route from token prediction to a tiny decoder-only GPT, then onward to the mechanics of modern LLM inference. Stage 1 ends at a small trainable model. Stage 2 explains why production systems care about KV cache, GQA, FlashAttention, prefix caching, and prefill versus decode.
Scope is intentionally narrow: decoder-only models only. RLHF, tools, agents, and post-training recipes are out of scope for this first pass. TheoremPath text can be a toy corpus for tiny examples, but not a serious pretraining dataset.
What This Path Is Optimizing For
Minimal end-to-end build path
No giant detours. The goal is to reach a tiny GPT-style model without skipping the causal-loss and decoder-block details.
Modern inference intuition
The second stage exists so "KV cache" and "FlashAttention" are not just words you memorize for interviews.
Honest gap marking
Existing pages are linked. Missing bridge pages are flagged explicitly so the conceptual jumps are visible instead of buried.
Stage 1
Stage 1: From Tokens to a Minimal GPT
This is the shortest credible route from language-model basics to a tiny decoder-only model. The main gap today is not transformer theory; it is the missing bridge material between count baselines, causal training, and a minimal end-to-end GPT build.
End State
I can explain and build a tiny decoder-only language model.
token-prediction-and-language-modeling
Why decoder-only models are trained as next-token predictors.
tokenization-and-information-theory
How text becomes a sequence and what a token budget really means.
Bigram Language Models and Count Baselines
bigram-language-models-and-count-baselines
Missing bridge page. The simplest trainable baseline before neural language models.
Causal Masking and Shifted Next-Token Loss
causal-masking-and-shifted-next-token-loss
Missing bridge page. Makes the decoder-only training setup explicit.
softmax-and-numerical-stability
Cross-entropy and logits without hand-waving over overflow issues.
word-embeddings
How tokens become vectors before attention does anything useful.
Embedding, Unembedding, and Weight Tying
embedding-unembedding-and-weight-tying
Missing bridge page. Connects token embeddings to the LM head cleanly.
attention-mechanism-theory
Single-head causal attention before full transformer stacks.
transformer-architecture
The decoder block, residual stream, and MLP in one place.
Minimal Decoder-Only Transformer From Scratch
minimal-decoder-only-transformer-from-scratch
Missing capstone page. Turns the theory pages into one tiny GPT-style build.
Training Small Language Models on Tiny Corpora
training-small-language-models-on-tiny-corpora
Missing capstone page. Covers batching, overfitting, and what 'working' looks like.
perplexity-and-language-model-evaluation
How to tell whether the tiny model is actually learning anything.
decoding-strategies
Sampling, temperature, and why generation quality can change at inference time.
Stage 2
Stage 2: Modern Decoder Internals and Inference
Once the tiny GPT is clear, the real question becomes why modern deployments look the way they do. This stage is the modernization layer: cache growth, attention variants, memory bandwidth, prefill/decode separation, and the serving tradeoffs that matter in practice.
End State
I can explain why modern LLM inference uses RoPE, KV cache, GQA or MQA, FlashAttention, and prefill/decode separation.
residual-stream-and-transformer-internals
The right entry point once the tiny GPT is no longer mysterious.
GPT-2 as a Reference Decoder-Only Model
gpt-2-as-a-reference-decoder-only-model
Missing bridge page. A concrete small model to anchor the architecture discussion.
kv-cache
Why decoder inference is dominated by past keys and values, not just weights.
Prefill vs. Decode and Why KV Cache Matters
prefill-vs-decode-and-why-kv-cache-matters
Missing bridge page. The clean conceptual jump from model math to serving behavior.
kv-cache-optimization
MQA, GQA, quantized caches, and what memory pressure does to systems design.
attention-variants-and-efficiency
MQA, GQA, sparse patterns, and the deployment trade space.
sparse-attention-and-long-context
How longer context changes architecture rather than just scaling constants.
memory-systems-for-llms
Distinguishes parametric memory, retrieval, scratchpad, and KV cache.
inference-systems-overview
Continuous batching, paged attention, TTFT, and throughput tradeoffs.
speculative-decoding-and-quantization
The main speed knobs once the base decoder story is clear.
multi-token-prediction
A modern extension once standard next-token decoding is already solid.
BPE vs. Byte vs. Character-Level Modeling
bpe-vs-byte-vs-character-level-modeling
Missing compare page. Makes tokenizer choice concrete for from-scratch readers.
Build Checkpoints
These checkpoints are the real spine of the path. If a checkpoint still feels magical, the pages above it are not yet doing enough work.
Checkpoint A: bigram baseline on a toy corpus
You can count transitions, estimate next-token probabilities, and explain why even a crude language model already uses the same factorization as GPT.
Checkpoint B: causal next-token training setup
You can explain teacher forcing, shifted labels, causal masking, and why cross-entropy is the default objective.
Checkpoint C: one decoder block with embeddings, attention, norm, residual, and MLP
You can draw the minimal decoder block and say what each tensor is doing at each step.
Checkpoint D: minimal GPT-style model that trains and samples
You can build and train a tiny decoder-only model, then sample from it without mystifying what the code is doing.
Checkpoint E: modern inference and systems intuition
You can explain why modern LLM serving talks about KV cache, GQA, FlashAttention, prefix caching, and prefill/decode separation.
Compare Pages Around This Series
A few compare pages already help around the edges of this path. Two more are still missing and should be written once the core decoder path lands.
autoregressive-vs-diffusion
Useful once you want to contrast decoder-only LMs with non-AR generative models.
encoder-only-vs-decoder-only-vs-encoder-decoder
Good architectural orientation before diving into GPT-specific details.
rope-vs-alibi-vs-sinusoidal
Positional encoding choices after the base mechanism already makes sense.
multi-head-vs-multi-query-vs-gqa
The cleanest comparison page for KV-cache-aware attention variants.
rmsnorm-vs-layernorm
Helpful once you are reading modern decoder model cards.
Temperature vs. Top-k vs. Top-p
temperature-vs-top-k-vs-top-p
Missing compare page. The most obvious decoding comparison gap.
Greedy vs. Beam vs. Nucleus
greedy-vs-beam-vs-nucleus
Missing compare page. Useful once decoding strategies page is in the reader's head.
Practical Next Writing Targets
If we want the path to become truly self-contained, the highest-value additions are:
- Stage 1 first: bigram baselines, causal masking, embedding-unembedding, and the minimal decoder-only transformer build page.
- Stage 2 second: prefill versus decode, GPT-2 as a reference decoder, and tokenizer comparison as an explicit modeling choice.
- Compare cleanup: temperature versus top-k versus top-p, then greedy versus beam versus nucleus so decoding questions have a direct home.