Linear Layers, Shapes, and Memory
Implement a linear layer, track every tensor shape, estimate parameter memory and FLOPs, and verify gradients with finite differences.
Systems Track
A focused rebuild path from neural networks to transformer inference. The rule is strict: every concept becomes code, a shape ledger, a memory estimate, a test, and a short site note.
CUDA, TPUs, and RunPod come later. First we need the habit that makes those tools useful: looking at a forward pass and predicting shapes, memory movement, arithmetic cost, and likely bottlenecks.
Week 1 ready
Implement a linear layer, track every tensor shape, estimate parameter memory and FLOPs, and verify gradients with finite differences.
Week 1 mental model
Read the layer from left to right: the input batch has B examples, the weight matrix converts each example from D_in features to D_out features, and the bias shifts each output feature.
Input batch
Weights
Bias
Output
X @ W, the shared D_in dimension cancels. The outside dimensions remain, so the result has shape B by D_out.Backward pass
Weight gradient
dW = X.T @ dYMatch each input feature with each output-gradient signal.
Bias gradient
db = dY.sum(axis=0)Add the output-gradient signal across all B examples.
Input gradient
dX = dY @ W.TSend the loss signal back to the previous layer.
Here .T means transpose: flip rows and columns so the matrix multiply shapes line up.
Shape ledger
| Symbol | Shape | Meaning |
|---|---|---|
| X | B by D_in | The mini-batch: B examples, each with D_in input features. |
| W | D_in by D_out | The trainable weight matrix that maps input features to output features. |
| b | D_out | The bias: one learned offset for each output feature. |
| Y | B by D_out | The output: one D_out-dimensional result for each example. |
| dY | B by D_out | The upstream loss signal arriving from the next layer. |
| dW | D_in by D_out | How the loss changes with each weight. |
| db | D_out | How the loss changes with each bias value. |
| dX | B by D_in | The signal sent backward to the previous layer. |
Implement a linear layer, track every tensor shape, estimate parameter memory and FLOPs, and verify gradients with finite differences.
Add ReLU, softmax or MSE, and a tiny training loop whose loss decreases under tests.
Build single-head attention and explain why scores have shape [B, T, T].
Connect a tiny decoder block to autoregressive inference and the KV-cache memory formula.