LLM Construction

Linear Layer: Shapes, Bias, and Memory

A systems-first note on the linear layer: tensor shapes, the bias term, forward pass, backward pass, parameter memory, FLOPs, and finite-difference gradient tests.

CoreTier 1StableCore spine~35 min

Prerequisites

Matrix Operations and Properties Matrix Calculus Feedforward Networks and Backpropagation

Prereq Map

Why This Matters

The linear layer is the first place where deep learning becomes a systems object instead of a drawing of neurons. It has tensor shapes, parameter memory, activation memory, FLOPs, and a backward pass that can be tested. If you cannot predict these quantities for a single layer, attention, transformer blocks, KV cache, and accelerator performance will feel like vocabulary rather than engineering.

It is also the first useful bridge from matrix calculus to attention, transformer architecture, and KV-cache inference.

This note is Week 1 of the Deep Learning Systems From Scratch track. The rule is simple: every concept becomes code, a shape ledger, a memory estimate, a test, and review questions.

Core Object

Definition

Linear layer

For a mini-batch $X \in \mathbb{R}^{B \times D_{\mathrm{in}}}$ , weights $W \in \mathbb{R}^{D_{\mathrm{in}} \times D_{\mathrm{out}}}$ , and bias $b \in \mathbb{R}^{D_{\mathrm{out}}}$ , a linear layer computes

$Y = XW + b$

where $Y \in \mathbb{R}^{B \times D_{\mathrm{out}}}$ .

The bias broadcasts across the batch dimension. It gives each output coordinate an intercept. Without $b$ , the affine map is forced through the origin of feature space, which is unnecessarily restrictive.

Shape Ledger

Quantity	Shape	Meaning
$X$	$[B, D_{\mathrm{in}}]$	mini-batch input
$W$	$[D_{\mathrm{in}}, D_{\mathrm{out}}]$	trainable weights
$b$	$[D_{\mathrm{out}}]$	trainable bias, broadcast over $B$
$Y$	$[B, D_{\mathrm{out}}]$	layer output
$dY$	$[B, D_{\mathrm{out}}]$	upstream gradient
$dW$	$[D_{\mathrm{in}}, D_{\mathrm{out}}]$	weight gradient
$db$	$[D_{\mathrm{out}}]$	bias gradient
$dX$	$[B, D_{\mathrm{in}}]$	input gradient

The safest shape habit is to write the ledger before writing code. If the ledger is wrong, the implementation will only be accidentally right.

Forward Pass

import numpy as np

def linear_forward(X: np.ndarray, W: np.ndarray, b: np.ndarray) -> np.ndarray:
    assert X.ndim == 2
    assert W.ndim == 2
    assert b.ndim == 1
    assert X.shape[1] == W.shape[0]
    assert W.shape[1] == b.shape[0]
    return X @ W + b

The matrix product contracts the shared $D_{\mathrm{in}}$ dimension:

$[B, D_{\mathrm{in}}] \cdot [D_{\mathrm{in}}, D_{\mathrm{out}}] = [B, D_{\mathrm{out}}].$

Backward Pass

Proposition

Linear Layer Backward Pass

Statement

For the linear layer $Y = XW + b$ , the gradients are

$dW = X^\top dY,$

$db = \sum_{i=1}^{B} dY_i,$

and

$dX = dY W^\top.$

Intuition

The weight gradient pairs each input coordinate with each output-gradient coordinate. The bias gradient sums over the batch because the same bias is used for every row. The input gradient sends the upstream signal backward through the transpose of the weights.

Why It Matters

These three equations are the smallest useful backpropagation testbed. If they are wrong, deeper networks silently learn the wrong update.

report a correction →

def linear_backward(X: np.ndarray, W: np.ndarray, dY: np.ndarray):
    assert X.ndim == 2 and W.ndim == 2 and dY.ndim == 2
    assert X.shape[0] == dY.shape[0]
    assert W.shape[1] == dY.shape[1]
    dW = X.T @ dY
    db = dY.sum(axis=0)
    dX = dY @ W.T
    return dX, dW, db

Parameter Memory and FLOPs

The parameter count is

$D_{\mathrm{in}}D_{\mathrm{out}} + D_{\mathrm{out}}.$

For fp32 parameters, parameter memory is roughly

$4(D_{\mathrm{in}}D_{\mathrm{out}} + D_{\mathrm{out}})\ \text{bytes}.$

The forward matmul costs approximately

$2BD_{\mathrm{in}}D_{\mathrm{out}}\ \text{FLOPs},$

counting one multiply and one add for each input-output pair. A rough lower bound on forward memory traffic is

$B D_{\mathrm{in}} + D_{\mathrm{in}}D_{\mathrm{out}} + D_{\mathrm{out}} + B D_{\mathrm{out}}$

elements read or written, times bytes per element.

This estimate is intentionally rough. Real hardware uses caches, tiling, kernel fusion, vectorization, and mixed precision. The point of the ledger is to know the first-order cost before profiling.

Finite-Difference Test

A finite-difference test checks whether the analytical gradient predicts a small perturbation in the loss:

$\frac{L(\theta + \epsilon e_j) - L(\theta - \epsilon e_j)}{2\epsilon} \approx \frac{\partial L}{\partial \theta_j}.$

def finite_difference_param(loss_fn, param, index, eps=1e-5):
    old = param[index]
    param[index] = old + eps
    plus = loss_fn()
    param[index] = old - eps
    minus = loss_fn()
    param[index] = old
    return (plus - minus) / (2 * eps)

Use this against selected entries of $W$ , $b$ , and $X$ . It is slower than backprop, but it catches sign errors, missing batch sums, and transpose bugs.

Common Bugs

Treating $W$ as $[D_{\mathrm{out}}, D_{\mathrm{in}}]$ in one place and $[D_{\mathrm{in}}, D_{\mathrm{out}}]$ in another.
Forgetting that $db$ sums over the batch dimension.
Letting broadcasting hide a wrong bias shape.
Testing only output values and never testing gradients.
Counting parameters but forgetting activation memory.

Review Questions

Why does the bias have shape $[D_{\mathrm{out}}]$ ?
Why does $db$ sum over the batch dimension?
What are the parameter counts for $W$ and $b$ ?
What is the output shape of $XW + b$ ?
How many approximate FLOPs does the matmul cost?
What memory must be read and written?
Why is this operation usually better suited to batching?
How would a finite-difference test catch a wrong gradient?

References

Canonical:

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press, Chapters 6-8.
Petersen, K. B. and Pedersen, M. S. (2012). "The Matrix Cookbook."
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). "Learning representations by back-propagating errors." Nature.

Current:

PyTorch documentation. torch.nn.Linear.
NumPy documentation. numpy.matmul.
Stanford CS231n notes. "Backpropagation, Intuitions."

Exercises

ExerciseCore

Problem

Let $X$ have shape $[32, 128]$ , $W$ have shape $[128, 64]$ , and $b$ have shape $[64]$ . What is the output shape of $Y = XW + b$ , and how many trainable parameters does the layer have?

ExerciseAdvanced

Problem

Explain why $db = dY.\mathrm{sum}(\mathrm{axis}=0)$ instead of $dY.\mathrm{sum}(\mathrm{axis}=1)$ .

Last reviewed: May 1, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

Matrix Operations and Propertieslayer 0A · tier 1
Matrix Calculuslayer 1 · tier 1
Feedforward Networks and Backpropagationlayer 2 · tier 1

Derived topics

No published topic currently declares this as a prerequisite.