Skip to main content

Curated Track

LLM From Scratch

This path is for readers who want a clean route from token prediction to a tiny decoder-only GPT, then onward to the mechanics of modern LLM inference. Stage 1 ends at a small trainable model. Stage 2 explains why production systems care about KV cache, GQA, FlashAttention, prefix caching, and prefill versus decode.

Scope is intentionally narrow: decoder-only models only. RLHF, tools, agents, and post-training recipes are out of scope for this first pass. TheoremPath text can be a toy corpus for tiny examples, but not a serious pretraining dataset.

What This Path Is Optimizing For

Minimal end-to-end build path

No giant detours. The goal is to reach a tiny GPT-style model without skipping the causal-loss and decoder-block details.

Modern inference intuition

The second stage exists so "KV cache" and "FlashAttention" are not just words you memorize for interviews.

Honest gap marking

Existing pages are linked. Missing bridge pages are flagged explicitly so the conceptual jumps are visible instead of buried.

Stage 1

Stage 1: From Tokens to a Minimal GPT

This is the shortest credible route from language-model basics to a tiny decoder-only model. The main gap today is not transformer theory; it is the missing bridge material between count baselines, causal training, and a minimal end-to-end GPT build.

End State

I can explain and build a tiny decoder-only language model.

existingtopic

token-prediction-and-language-modeling

Why decoder-only models are trained as next-token predictors.

existingtopic

tokenization-and-information-theory

How text becomes a sequence and what a token budget really means.

bridgetopic

Bigram Language Models and Count Baselines

bigram-language-models-and-count-baselines

Missing bridge page. The simplest trainable baseline before neural language models.

bridgetopic

Causal Masking and Shifted Next-Token Loss

causal-masking-and-shifted-next-token-loss

Missing bridge page. Makes the decoder-only training setup explicit.

existingtopic

softmax-and-numerical-stability

Cross-entropy and logits without hand-waving over overflow issues.

existingtopic

word-embeddings

How tokens become vectors before attention does anything useful.

bridgetopic

Embedding, Unembedding, and Weight Tying

embedding-unembedding-and-weight-tying

Missing bridge page. Connects token embeddings to the LM head cleanly.

existingtopic

attention-mechanism-theory

Single-head causal attention before full transformer stacks.

existingtopic

positional-encoding

Why token order is added rather than assumed.

existingtopic

transformer-architecture

The decoder block, residual stream, and MLP in one place.

new requiredtopic

Minimal Decoder-Only Transformer From Scratch

minimal-decoder-only-transformer-from-scratch

Missing capstone page. Turns the theory pages into one tiny GPT-style build.

new requiredtopic

Training Small Language Models on Tiny Corpora

training-small-language-models-on-tiny-corpora

Missing capstone page. Covers batching, overfitting, and what 'working' looks like.

existingtopic

perplexity-and-language-model-evaluation

How to tell whether the tiny model is actually learning anything.

existingtopic

decoding-strategies

Sampling, temperature, and why generation quality can change at inference time.

Stage 2

Stage 2: Modern Decoder Internals and Inference

Once the tiny GPT is clear, the real question becomes why modern deployments look the way they do. This stage is the modernization layer: cache growth, attention variants, memory bandwidth, prefill/decode separation, and the serving tradeoffs that matter in practice.

End State

I can explain why modern LLM inference uses RoPE, KV cache, GQA or MQA, FlashAttention, and prefill/decode separation.

existingtopic

residual-stream-and-transformer-internals

The right entry point once the tiny GPT is no longer mysterious.

bridgetopic

GPT-2 as a Reference Decoder-Only Model

gpt-2-as-a-reference-decoder-only-model

Missing bridge page. A concrete small model to anchor the architecture discussion.

existingtopic

kv-cache

Why decoder inference is dominated by past keys and values, not just weights.

bridgetopic

Prefill vs. Decode and Why KV Cache Matters

prefill-vs-decode-and-why-kv-cache-matters

Missing bridge page. The clean conceptual jump from model math to serving behavior.

existingtopic

kv-cache-optimization

MQA, GQA, quantized caches, and what memory pressure does to systems design.

existingtopic

flash-attention

Why faster attention is mostly a memory-traffic story.

existingtopic

attention-variants-and-efficiency

MQA, GQA, sparse patterns, and the deployment trade space.

existingtopic

sparse-attention-and-long-context

How longer context changes architecture rather than just scaling constants.

existingtopic

memory-systems-for-llms

Distinguishes parametric memory, retrieval, scratchpad, and KV cache.

existingtopic

prefix-caching

How repeated prompts change the economics of serving.

existingtopic

inference-systems-overview

Continuous batching, paged attention, TTFT, and throughput tradeoffs.

existingtopic

speculative-decoding-and-quantization

The main speed knobs once the base decoder story is clear.

existingtopic

multi-token-prediction

A modern extension once standard next-token decoding is already solid.

bridgecompare

BPE vs. Byte vs. Character-Level Modeling

bpe-vs-byte-vs-character-level-modeling

Missing compare page. Makes tokenizer choice concrete for from-scratch readers.

Build Checkpoints

These checkpoints are the real spine of the path. If a checkpoint still feels magical, the pages above it are not yet doing enough work.

Checkpoint 1

Checkpoint A: bigram baseline on a toy corpus

You can count transitions, estimate next-token probabilities, and explain why even a crude language model already uses the same factorization as GPT.

Checkpoint 2

Checkpoint B: causal next-token training setup

You can explain teacher forcing, shifted labels, causal masking, and why cross-entropy is the default objective.

causal-masking-and-shifted-next-token-lossCausal Masking and Shifted Next-Token Losssoftmax-and-numerical-stabilitySoftmax and Numerical Stability
Checkpoint 3

Checkpoint C: one decoder block with embeddings, attention, norm, residual, and MLP

You can draw the minimal decoder block and say what each tensor is doing at each step.

Checkpoint 4

Checkpoint D: minimal GPT-style model that trains and samples

You can build and train a tiny decoder-only model, then sample from it without mystifying what the code is doing.

minimal-decoder-only-transformer-from-scratchMinimal Decoder-Only Transformer From Scratchtraining-small-language-models-on-tiny-corporaTraining Small Language Models on Tiny Corporadecoding-strategiesDecoding Strategies
Checkpoint 5

Checkpoint E: modern inference and systems intuition

You can explain why modern LLM serving talks about KV cache, GQA, FlashAttention, prefix caching, and prefill/decode separation.

kv-cacheKV Cacheprefill-vs-decode-and-why-kv-cache-mattersPrefill vs. Decode and Why KV Cache Mattersflash-attentionFlashAttentioninference-systems-overviewInference Systems Overview

Compare Pages Around This Series

A few compare pages already help around the edges of this path. Two more are still missing and should be written once the core decoder path lands.

existingcompare

autoregressive-vs-diffusion

Useful once you want to contrast decoder-only LMs with non-AR generative models.

existingcompare

encoder-only-vs-decoder-only-vs-encoder-decoder

Good architectural orientation before diving into GPT-specific details.

existingcompare

rope-vs-alibi-vs-sinusoidal

Positional encoding choices after the base mechanism already makes sense.

existingcompare

multi-head-vs-multi-query-vs-gqa

The cleanest comparison page for KV-cache-aware attention variants.

existingcompare

rmsnorm-vs-layernorm

Helpful once you are reading modern decoder model cards.

new requiredcompare

Temperature vs. Top-k vs. Top-p

temperature-vs-top-k-vs-top-p

Missing compare page. The most obvious decoding comparison gap.

new requiredcompare

Greedy vs. Beam vs. Nucleus

greedy-vs-beam-vs-nucleus

Missing compare page. Useful once decoding strategies page is in the reader's head.

Practical Next Writing Targets

If we want the path to become truly self-contained, the highest-value additions are:

  • Stage 1 first: bigram baselines, causal masking, embedding-unembedding, and the minimal decoder-only transformer build page.
  • Stage 2 second: prefill versus decode, GPT-2 as a reference decoder, and tokenizer comparison as an explicit modeling choice.
  • Compare cleanup: temperature versus top-k versus top-p, then greedy versus beam versus nucleus so decoding questions have a direct home.