LLM Construction
Linear Layer: Shapes, Bias, and Memory
A systems-first note on the linear layer: tensor shapes, the bias term, forward pass, backward pass, parameter memory, FLOPs, and finite-difference gradient tests.
Prerequisites
Why This Matters
The linear layer is the first place where deep learning becomes a systems object instead of a drawing of neurons. It has tensor shapes, parameter memory, activation memory, FLOPs, and a backward pass that can be tested. If you cannot predict these quantities for a single layer, attention, transformer blocks, KV cache, and accelerator performance will feel like vocabulary rather than engineering.
It is also the first useful bridge from matrix calculus to attention, transformer architecture, and KV-cache inference.
This note is Week 1 of the Deep Learning Systems From Scratch track. The rule is simple: every concept becomes code, a shape ledger, a memory estimate, a test, and review questions.
Core Object
Linear layer
For a mini-batch , weights , and bias , a linear layer computes
where .
The bias broadcasts across the batch dimension. It gives each output coordinate an intercept. Without , the affine map is forced through the origin of feature space, which is unnecessarily restrictive.
Shape Ledger
| Quantity | Shape | Meaning |
|---|---|---|
| mini-batch input | ||
| trainable weights | ||
| trainable bias, broadcast over | ||
| layer output | ||
| upstream gradient | ||
| weight gradient | ||
| bias gradient | ||
| input gradient |
The safest shape habit is to write the ledger before writing code. If the ledger is wrong, the implementation will only be accidentally right.
Forward Pass
import numpy as np
def linear_forward(X: np.ndarray, W: np.ndarray, b: np.ndarray) -> np.ndarray:
assert X.ndim == 2
assert W.ndim == 2
assert b.ndim == 1
assert X.shape[1] == W.shape[0]
assert W.shape[1] == b.shape[0]
return X @ W + b
The matrix product contracts the shared dimension:
Backward Pass
Linear Layer Backward Pass
Statement
For the linear layer , the gradients are
and
Intuition
The weight gradient pairs each input coordinate with each output-gradient coordinate. The bias gradient sums over the batch because the same bias is used for every row. The input gradient sends the upstream signal backward through the transpose of the weights.
Why It Matters
These three equations are the smallest useful backpropagation testbed. If they are wrong, deeper networks silently learn the wrong update.
def linear_backward(X: np.ndarray, W: np.ndarray, dY: np.ndarray):
assert X.ndim == 2 and W.ndim == 2 and dY.ndim == 2
assert X.shape[0] == dY.shape[0]
assert W.shape[1] == dY.shape[1]
dW = X.T @ dY
db = dY.sum(axis=0)
dX = dY @ W.T
return dX, dW, db
Parameter Memory and FLOPs
The parameter count is
For fp32 parameters, parameter memory is roughly
The forward matmul costs approximately
counting one multiply and one add for each input-output pair. A rough lower bound on forward memory traffic is
elements read or written, times bytes per element.
This estimate is intentionally rough. Real hardware uses caches, tiling, kernel fusion, vectorization, and mixed precision. The point of the ledger is to know the first-order cost before profiling.
Finite-Difference Test
A finite-difference test checks whether the analytical gradient predicts a small perturbation in the loss:
def finite_difference_param(loss_fn, param, index, eps=1e-5):
old = param[index]
param[index] = old + eps
plus = loss_fn()
param[index] = old - eps
minus = loss_fn()
param[index] = old
return (plus - minus) / (2 * eps)
Use this against selected entries of , , and . It is slower than backprop, but it catches sign errors, missing batch sums, and transpose bugs.
Common Bugs
- Treating as in one place and in another.
- Forgetting that sums over the batch dimension.
- Letting broadcasting hide a wrong bias shape.
- Testing only output values and never testing gradients.
- Counting parameters but forgetting activation memory.
Review Questions
- Why does the bias have shape ?
- Why does sum over the batch dimension?
- What are the parameter counts for and ?
- What is the output shape of ?
- How many approximate FLOPs does the matmul cost?
- What memory must be read and written?
- Why is this operation usually better suited to batching?
- How would a finite-difference test catch a wrong gradient?
References
Canonical:
- Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press, Chapters 6-8.
- Petersen, K. B. and Pedersen, M. S. (2012). "The Matrix Cookbook."
- Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). "Learning representations by back-propagating errors." Nature.
Current:
- PyTorch documentation.
torch.nn.Linear. - NumPy documentation.
numpy.matmul. - Stanford CS231n notes. "Backpropagation, Intuitions."
Exercises
Problem
Let have shape , have shape , and have shape . What is the output shape of , and how many trainable parameters does the layer have?
Problem
Explain why instead of .
Last reviewed: May 1, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
3- Matrix Operations and Propertieslayer 0A · tier 1
- Matrix Calculuslayer 1 · tier 1
- Feedforward Networks and Backpropagationlayer 2 · tier 1
Derived topics
0No published topic currently declares this as a prerequisite.