Systems Track

Deep Learning Systems From Scratch

A focused rebuild path from neural networks to transformer inference. The rule is strict: every concept becomes code, a shape ledger, a memory estimate, a test, and a short site note.

CUDA, TPUs, and RunPod come later. First we need the habit that makes those tools useful: looking at a forward pass and predicting shapes, memory movement, arithmetic cost, and likely bottlenecks.

Dependency Chain

1tensor layout2matmul3linear layer4backprop5normalization6attention7transformer block8autoregressive inference9KV cache10memory bandwidth11roofline12CUDA / TPU13parallelism

Week 1 ready

Linear Layers, Shapes, and Memory

Implement a linear layer, track every tensor shape, estimate parameter memory and FLOPs, and verify gradients with finite differences.

Week 1 mental model

A linear layer is a shape-changing machine.

middle dimensions must match

Read the layer from left to right: the input batch has B examples, the weight matrix converts each example from D_in features to D_out features, and the bias shifts each output feature.

Input batch

XB by D_in

times

Weights

WD_in by D_out

plus

Bias

bD_out

gives

Output

YB by D_out

Shape rule: inX @ W, the shared D_in dimension cancels. The outside dimensions remain, so the result has shape B by D_out.

Backward pass

The gradient tells each object how to change.

Weight gradient

dW = X.T @ dY

Match each input feature with each output-gradient signal.

Bias gradient

db = dY.sum(axis=0)

Add the output-gradient signal across all B examples.

Input gradient

dX = dY @ W.T

Send the loss signal back to the previous layer.

Here .T means transpose: flip rows and columns so the matrix multiply shapes line up.

Shape ledger

What each symbol means

Symbol	Shape	Meaning
X	B by D_in	The mini-batch: B examples, each with D_in input features.
W	D_in by D_out	The trainable weight matrix that maps input features to output features.
b	D_out	The bias: one learned offset for each output feature.
Y	B by D_out	The output: one D_out-dimensional result for each example.
dY	B by D_out	The upstream loss signal arriving from the next layer.
dW	D_in by D_out	How the loss changes with each weight.
db	D_out	How the loss changes with each bias value.
dX	B by D_in	The signal sent backward to the previous layer.

Deliverables

NumPy linear forward pass with explicit shape checks
Shape ledger for the input batch, weights, bias, output, upstream gradient, weight gradient, bias gradient, and input gradient
Parameter count, byte estimate, and matmul FLOP estimate
Finite-difference gradient tests for weights, bias, and input
Short site note with review questions

Review questions

Why is there one bias value per output feature?
Why does the bias gradient add up contributions from every example in the batch?
If the input is B by D_in and the weights are D_in by D_out, what shape is the output?
How many approximate FLOPs does the matmul cost?
How would a finite-difference test catch a wrong gradient?

Open Week 1 note

Planned Sequence

Week 1ready

Linear Layers, Shapes, and Memory

Implement a linear layer, track every tensor shape, estimate parameter memory and FLOPs, and verify gradients with finite differences.

Week 2planned

Manual Backprop and Toy Training

Add ReLU, softmax or MSE, and a tiny training loop whose loss decreases under tests.

Week 3planned

Attention Shape Ledger

Build single-head attention and explain why scores have shape [B, T, T].

Week 4planned

Transformer Forward Pass and KV Cache

Connect a tiny decoder block to autoregressive inference and the KV-cache memory formula.