Skip to main content

Systems Track

Deep Learning Systems From Scratch

A focused rebuild path from neural networks to transformer inference. The rule is strict: every concept becomes code, a shape ledger, a memory estimate, a test, and a short site note.

CUDA, TPUs, and RunPod come later. First we need the habit that makes those tools useful: looking at a forward pass and predicting shapes, memory movement, arithmetic cost, and likely bottlenecks.

Dependency Chain

1tensor layout2matmul3linear layer4backprop5normalization6attention7transformer block8autoregressive inference9KV cache10memory bandwidth11roofline12CUDA / TPU13parallelism

Week 1 ready

Linear Layers, Shapes, and Memory

Implement a linear layer, track every tensor shape, estimate parameter memory and FLOPs, and verify gradients with finite differences.

Week 1 mental model

A linear layer is a shape-changing machine.

middle dimensions must match

Read the layer from left to right: the input batch has B examples, the weight matrix converts each example from D_in features to D_out features, and the bias shifts each output feature.

Input batch

XB by D_in
times

Weights

WD_in by D_out
plus

Bias

bD_out
gives

Output

YB by D_out
Shape rule: inX @ W, the shared D_in dimension cancels. The outside dimensions remain, so the result has shape B by D_out.

Backward pass

The gradient tells each object how to change.

Weight gradient

dW = X.T @ dY

Match each input feature with each output-gradient signal.

Bias gradient

db = dY.sum(axis=0)

Add the output-gradient signal across all B examples.

Input gradient

dX = dY @ W.T

Send the loss signal back to the previous layer.

Here .T means transpose: flip rows and columns so the matrix multiply shapes line up.

Shape ledger

What each symbol means

SymbolShapeMeaning
XB by D_inThe mini-batch: B examples, each with D_in input features.
WD_in by D_outThe trainable weight matrix that maps input features to output features.
bD_outThe bias: one learned offset for each output feature.
YB by D_outThe output: one D_out-dimensional result for each example.
dYB by D_outThe upstream loss signal arriving from the next layer.
dWD_in by D_outHow the loss changes with each weight.
dbD_outHow the loss changes with each bias value.
dXB by D_inThe signal sent backward to the previous layer.

Deliverables

  • NumPy linear forward pass with explicit shape checks
  • Shape ledger for the input batch, weights, bias, output, upstream gradient, weight gradient, bias gradient, and input gradient
  • Parameter count, byte estimate, and matmul FLOP estimate
  • Finite-difference gradient tests for weights, bias, and input
  • Short site note with review questions

Review questions

  • Why is there one bias value per output feature?
  • Why does the bias gradient add up contributions from every example in the batch?
  • If the input is B by D_in and the weights are D_in by D_out, what shape is the output?
  • How many approximate FLOPs does the matmul cost?
  • How would a finite-difference test catch a wrong gradient?

Planned Sequence

Week 1ready

Linear Layers, Shapes, and Memory

Implement a linear layer, track every tensor shape, estimate parameter memory and FLOPs, and verify gradients with finite differences.

Week 2planned

Manual Backprop and Toy Training

Add ReLU, softmax or MSE, and a tiny training loop whose loss decreases under tests.

Week 3planned

Attention Shape Ledger

Build single-head attention and explain why scores have shape [B, T, T].

Week 4planned

Transformer Forward Pass and KV Cache

Connect a tiny decoder block to autoregressive inference and the KV-cache memory formula.