Skip to main content

LLM Construction

Linear Layer: Shapes, Bias, and Memory

A systems-first note on the linear layer: tensor shapes, the bias term, forward pass, backward pass, parameter memory, FLOPs, and finite-difference gradient tests.

CoreTier 1StableCore spine~35 min

Why This Matters

The linear layer is the first place where deep learning becomes a systems object instead of a drawing of neurons. It has tensor shapes, parameter memory, activation memory, FLOPs, and a backward pass that can be tested. If you cannot predict these quantities for a single layer, attention, transformer blocks, KV cache, and accelerator performance will feel like vocabulary rather than engineering.

It is also the first useful bridge from matrix calculus to attention, transformer architecture, and KV-cache inference.

This note is Week 1 of the Deep Learning Systems From Scratch track. The rule is simple: every concept becomes code, a shape ledger, a memory estimate, a test, and review questions.

Core Object

Definition

Linear layer

For a mini-batch XRB×DinX \in \mathbb{R}^{B \times D_{\mathrm{in}}}, weights WRDin×DoutW \in \mathbb{R}^{D_{\mathrm{in}} \times D_{\mathrm{out}}}, and bias bRDoutb \in \mathbb{R}^{D_{\mathrm{out}}}, a linear layer computes

Y=XW+bY = XW + b

where YRB×DoutY \in \mathbb{R}^{B \times D_{\mathrm{out}}}.

The bias broadcasts across the batch dimension. It gives each output coordinate an intercept. Without bb, the affine map is forced through the origin of feature space, which is unnecessarily restrictive.

Shape Ledger

QuantityShapeMeaning
XX[B,Din][B, D_{\mathrm{in}}]mini-batch input
WW[Din,Dout][D_{\mathrm{in}}, D_{\mathrm{out}}]trainable weights
bb[Dout][D_{\mathrm{out}}]trainable bias, broadcast over BB
YY[B,Dout][B, D_{\mathrm{out}}]layer output
dYdY[B,Dout][B, D_{\mathrm{out}}]upstream gradient
dWdW[Din,Dout][D_{\mathrm{in}}, D_{\mathrm{out}}]weight gradient
dbdb[Dout][D_{\mathrm{out}}]bias gradient
dXdX[B,Din][B, D_{\mathrm{in}}]input gradient

The safest shape habit is to write the ledger before writing code. If the ledger is wrong, the implementation will only be accidentally right.

Forward Pass

import numpy as np

def linear_forward(X: np.ndarray, W: np.ndarray, b: np.ndarray) -> np.ndarray:
    assert X.ndim == 2
    assert W.ndim == 2
    assert b.ndim == 1
    assert X.shape[1] == W.shape[0]
    assert W.shape[1] == b.shape[0]
    return X @ W + b

The matrix product contracts the shared DinD_{\mathrm{in}} dimension:

[B,Din][Din,Dout]=[B,Dout].[B, D_{\mathrm{in}}] \cdot [D_{\mathrm{in}}, D_{\mathrm{out}}] = [B, D_{\mathrm{out}}].

Backward Pass

Proposition

Linear Layer Backward Pass

Statement

For the linear layer Y=XW+bY = XW + b, the gradients are

dW=XdY,dW = X^\top dY,

db=i=1BdYi,db = \sum_{i=1}^{B} dY_i,

and

dX=dYW.dX = dY W^\top.

Intuition

The weight gradient pairs each input coordinate with each output-gradient coordinate. The bias gradient sums over the batch because the same bias is used for every row. The input gradient sends the upstream signal backward through the transpose of the weights.

Why It Matters

These three equations are the smallest useful backpropagation testbed. If they are wrong, deeper networks silently learn the wrong update.

def linear_backward(X: np.ndarray, W: np.ndarray, dY: np.ndarray):
    assert X.ndim == 2 and W.ndim == 2 and dY.ndim == 2
    assert X.shape[0] == dY.shape[0]
    assert W.shape[1] == dY.shape[1]
    dW = X.T @ dY
    db = dY.sum(axis=0)
    dX = dY @ W.T
    return dX, dW, db

Parameter Memory and FLOPs

The parameter count is

DinDout+Dout.D_{\mathrm{in}}D_{\mathrm{out}} + D_{\mathrm{out}}.

For fp32 parameters, parameter memory is roughly

4(DinDout+Dout) bytes.4(D_{\mathrm{in}}D_{\mathrm{out}} + D_{\mathrm{out}})\ \text{bytes}.

The forward matmul costs approximately

2BDinDout FLOPs,2BD_{\mathrm{in}}D_{\mathrm{out}}\ \text{FLOPs},

counting one multiply and one add for each input-output pair. A rough lower bound on forward memory traffic is

BDin+DinDout+Dout+BDoutB D_{\mathrm{in}} + D_{\mathrm{in}}D_{\mathrm{out}} + D_{\mathrm{out}} + B D_{\mathrm{out}}

elements read or written, times bytes per element.

This estimate is intentionally rough. Real hardware uses caches, tiling, kernel fusion, vectorization, and mixed precision. The point of the ledger is to know the first-order cost before profiling.

Finite-Difference Test

A finite-difference test checks whether the analytical gradient predicts a small perturbation in the loss:

L(θ+ϵej)L(θϵej)2ϵLθj.\frac{L(\theta + \epsilon e_j) - L(\theta - \epsilon e_j)}{2\epsilon} \approx \frac{\partial L}{\partial \theta_j}.

def finite_difference_param(loss_fn, param, index, eps=1e-5):
    old = param[index]
    param[index] = old + eps
    plus = loss_fn()
    param[index] = old - eps
    minus = loss_fn()
    param[index] = old
    return (plus - minus) / (2 * eps)

Use this against selected entries of WW, bb, and XX. It is slower than backprop, but it catches sign errors, missing batch sums, and transpose bugs.

Common Bugs

  • Treating WW as [Dout,Din][D_{\mathrm{out}}, D_{\mathrm{in}}] in one place and [Din,Dout][D_{\mathrm{in}}, D_{\mathrm{out}}] in another.
  • Forgetting that dbdb sums over the batch dimension.
  • Letting broadcasting hide a wrong bias shape.
  • Testing only output values and never testing gradients.
  • Counting parameters but forgetting activation memory.

Review Questions

  • Why does the bias have shape [Dout][D_{\mathrm{out}}]?
  • Why does dbdb sum over the batch dimension?
  • What are the parameter counts for WW and bb?
  • What is the output shape of XW+bXW + b?
  • How many approximate FLOPs does the matmul cost?
  • What memory must be read and written?
  • Why is this operation usually better suited to batching?
  • How would a finite-difference test catch a wrong gradient?

References

Canonical:

  • Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press, Chapters 6-8.
  • Petersen, K. B. and Pedersen, M. S. (2012). "The Matrix Cookbook."
  • Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). "Learning representations by back-propagating errors." Nature.

Current:

  • PyTorch documentation. torch.nn.Linear.
  • NumPy documentation. numpy.matmul.
  • Stanford CS231n notes. "Backpropagation, Intuitions."

Exercises

ExerciseCore

Problem

Let XX have shape [32,128][32, 128], WW have shape [128,64][128, 64], and bb have shape [64][64]. What is the output shape of Y=XW+bY = XW + b, and how many trainable parameters does the layer have?

ExerciseAdvanced

Problem

Explain why db=dY.sum(axis=0)db = dY.\mathrm{sum}(\mathrm{axis}=0) instead of dY.sum(axis=1)dY.\mathrm{sum}(\mathrm{axis}=1).

Last reviewed: May 1, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

3

Derived topics

0

No published topic currently declares this as a prerequisite.