Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Transformer Architecture

The mathematical formulation of the transformer block: self-attention, multi-head attention, layer normalization, FFN blocks, positional encoding, and parameter counting.

AdvancedTier 2Stable~70 min

Why This Matters

Residual Streamx⁰ = embed + posx^L → logitsMulti-Head AttentionQ, K, V from residual streamReads from stream, computes attention, writes backread+ writeskip connectionLayer Norm + FFNTwo linear layers + activationReads from stream, transforms per-token, writes backread+ writeMulti-Head AttentionLayer 2Same structure, different learned parametersread+ writeLayer Norm + FFNLayer 2Reads updated stream, transforms, writes backread+ writeLayer 1Layer 2x^(l) = x^(l-1) + Attn(LN(x^(l-1))) + FFN(LN(x^(l-1) + Attn(...)))

The transformer is the architecture behind every modern large language model: GPT-4, Claude, Gemini, Llama. It replaced recurrent and convolutional architectures for sequence modeling because of one key property: self-attention allows every token to attend to every other token in parallel, enabling the model to learn long-range dependencies without the vanishing gradient problem of RNNs.

Understanding the transformer mathematically (not just as a diagram, but as a sequence of matrix operations with specific dimensions, costs, and properties) is essential for understanding everything built on top of it: RLHF, mechanistic interpretability, scaling laws, and efficiency research.

Mental Model

A transformer processes a sequence of tokens by passing them through a stack of identical blocks. Each block has two sub-layers: a self-attention layer (which lets tokens communicate with each other) and a feed-forward network (which processes each token independently). Residual connections and layer normalization stabilize training.

Self-attention is the key innovation. Each token creates a query ("what am I looking for?"), a key ("what do I contain?"), and a value ("what do I contribute?"). Tokens attend to each other based on query-key similarity, and the output is a weighted sum of values.

Formal Setup and Notation

Let the input sequence have nn tokens, each represented as a dd-dimensional vector. The input is a matrix XRn×dX \in \mathbb{R}^{n \times d}.

Self-Attention

Definition

Scaled Dot-Product Attention

Given an input XRn×dX \in \mathbb{R}^{n \times d}, compute queries, keys, and values:

Q=XWQ,K=XWK,V=XWVQ = XW_Q, \quad K = XW_K, \quad V = XW_V

where WQ,WKRd×dkW_Q, W_K \in \mathbb{R}^{d \times d_k} and WVRd×dvW_V \in \mathbb{R}^{d \times d_v} are learned weight matrices.

The attention output is:

Attention(Q,K,V)=softmax(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

where the softmax is applied row-wise (each row sums to 1).

Proposition

Attention Dimensions

Statement

For input XRn×dX \in \mathbb{R}^{n \times d}:

  • QRn×dkQ \in \mathbb{R}^{n \times d_k}, KRn×dkK \in \mathbb{R}^{n \times d_k}, VRn×dvV \in \mathbb{R}^{n \times d_v}
  • QKRn×nQK^\top \in \mathbb{R}^{n \times n}: the attention matrix
  • softmax(QK/dk)Rn×n\text{softmax}(QK^\top / \sqrt{d_k}) \in \mathbb{R}^{n \times n}: each row is a probability distribution
  • Attention(Q,K,V)Rn×dv\text{Attention}(Q, K, V) \in \mathbb{R}^{n \times d_v}: the output

The output of attention for token ii is a weighted average of value vectors:

outputi=j=1nαijvjwhere αij=exp(qikj/dk)exp(qik/dk)\text{output}_i = \sum_{j=1}^{n} \alpha_{ij} \, v_j \quad \text{where } \alpha_{ij} = \frac{\exp(q_i \cdot k_j / \sqrt{d_k})}{\sum_{\ell} \exp(q_i \cdot k_\ell / \sqrt{d_k})}

Intuition

Each token ii computes a query qiq_i and compares it against all keys k1,,knk_1, \ldots, k_n via dot products. The softmax converts these similarity scores into attention weights αij\alpha_{ij}. The output is a weighted sum of value vectors, where tokens with similar query-key pairs get higher weight. The dk\sqrt{d_k} scaling prevents the dot products from becoming too large (which would cause the softmax to saturate).

Why It Matters

Tracking dimensions through the transformer is the single most useful exercise for understanding the architecture. Every research paper assumes you can do this fluently. The n×nn \times n attention matrix is both the source of the transformer's power (global context) and its main computational bottleneck.

Why scale by dk\sqrt{d_k}? If the entries of qq and kk are independent with zero mean and unit variance, then qkq \cdot k has variance dkd_k. Without scaling, large dkd_k causes the dot products to have large magnitude, pushing the softmax into regions with near-zero gradients. Dividing by dk\sqrt{d_k} normalizes the variance to 1, keeping the softmax in a useful range.

Multi-Head Attention

Definition

Multi-Head Attention

Instead of computing a single attention function, use hh heads in parallel:

headi=Attention(XWQ(i),XWK(i),XWV(i))\text{head}_i = \text{Attention}(XW_Q^{(i)}, XW_K^{(i)}, XW_V^{(i)})

where WQ(i),WK(i)Rd×dkW_Q^{(i)}, W_K^{(i)} \in \mathbb{R}^{d \times d_k} and WV(i)Rd×dvW_V^{(i)} \in \mathbb{R}^{d \times d_v} with dk=dv=d/hd_k = d_v = d/h.

Concatenate the heads and project:

MHA(X)=Concat(head1,,headh)WO\text{MHA}(X) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W_O

where WORd×dW_O \in \mathbb{R}^{d \times d}.

Why multiple heads? Each head can attend to different aspects of the input: one head might focus on syntactic relationships, another on semantic similarity, another on positional proximity. Multi-head attention allows the model to jointly attend to information from different representation subspaces. Mechanistic interpretability work gives concrete examples of specialization: previous-token heads that copy information from position t1t-1 to position tt (Elhage et al., 2021), induction heads that implement in-context pattern completion (Olsson et al., 2022), and name mover heads that move subject tokens to the final position in the indirect object identification circuit (Wang et al., 2022).

Parameter count for MHA: Each head has WQ(i),WK(i),WV(i)W_Q^{(i)}, W_K^{(i)}, W_V^{(i)} each of size d×(d/h)d \times (d/h). With hh heads, total QKV parameters are 3×h×d×(d/h)=3d23 \times h \times d \times (d/h) = 3d^2. The output projection WOW_O adds d2d^2. Total: 4d24d^2 parameters (ignoring biases).

Residual Connections and Layer Normalization

Definition

Transformer Sub-Layer with Residual Connection

Each sub-layer (attention or FFN) is wrapped with a residual connection:

output=x+SubLayer(x)\text{output} = x + \text{SubLayer}(x)

This allows gradients to flow directly through the network and enables training of deep transformers.

Definition

Layer Normalization

For a vector xRdx \in \mathbb{R}^d, layer normalization computes:

LayerNorm(x)=xμσγ+β\text{LayerNorm}(x) = \frac{x - \mu}{\sigma} \odot \gamma + \beta

where μ=1dixi\mu = \frac{1}{d}\sum_i x_i, σ=1di(xiμ)2+ϵ\sigma = \sqrt{\frac{1}{d}\sum_i (x_i - \mu)^2 + \epsilon}, and γ,βRd\gamma, \beta \in \mathbb{R}^d are learned scale and shift parameters.

Pre-norm vs. post-norm. The original transformer (Vaswani et al., 2017) uses post-norm: LayerNorm(x+SubLayer(x))\text{LayerNorm}(x + \text{SubLayer}(x)). Most modern LLMs use pre-norm: x+SubLayer(LayerNorm(x))x + \text{SubLayer}(\text{LayerNorm}(x)). Pre-norm is more stable for training deep networks because the residual path is unobstructed.

Feed-Forward Network

Definition

Position-wise Feed-Forward Network

The original 2017 transformer FFN applies two linear transformations with a nonlinearity:

FFN(x)=W2σ(W1x+b1)+b2\text{FFN}(x) = W_2 \, \sigma(W_1 x + b_1) + b_2

where W1Rdff×dW_1 \in \mathbb{R}^{d_{\text{ff}} \times d}, W2Rd×dffW_2 \in \mathbb{R}^{d \times d_{\text{ff}}}, and σ\sigma is ReLU (Vaswani et al., 2017) or GeLU in later variants. The standard choice is dff=4dd_{\text{ff}} = 4d.

Parameter count for the 2017 FFN. W1W_1 has ddffd \cdot d_{\text{ff}} parameters, W2W_2 has dffdd_{\text{ff}} \cdot d parameters. With dff=4dd_{\text{ff}} = 4d: total is 8d28d^2 parameters (ignoring biases).

Definition

Gated FFN (SwiGLU / GeGLU)

Since 2023, essentially every frontier LLM (Llama 2, Llama 3, Mistral, Mixtral, DeepSeek, Qwen, Gemma) replaces the two-matrix FFN with a gated variant that has three projection matrices W1,W2,W3W_1, W_2, W_3:

FFNSwiGLU(x)=W2(σ(W1x)W3x)\text{FFN}_{\text{SwiGLU}}(x) = W_2 \left( \sigma(W_1 x) \odot W_3 x \right)

where \odot is elementwise product, W1,W3Rdff×dW_1, W_3 \in \mathbb{R}^{d_{\text{ff}} \times d}, W2Rd×dffW_2 \in \mathbb{R}^{d \times d_{\text{ff}}}, and σ\sigma is SiLU (SwiGLU) or GeLU (GeGLU). The gate W3xW_3 x modulates the activation elementwise.

Parameter count for gated FFN. Three projections give 3ddff3 \cdot d \cdot d_{\text{ff}} parameters. To keep the parameter budget comparable to the 2017 FFN, Shazeer (2020) recommends dff=234d=8d3d_{\text{ff}} = \frac{2}{3} \cdot 4d = \frac{8d}{3}, which yields 3d8d3=8d23 \cdot d \cdot \frac{8d}{3} = 8d^2 parameters. This matches the legacy 12d212d^2 per-layer formula used below. Llama 2 7B (d=4096d = 4096) picks dff=1100884096/3=10922d_{\text{ff}} = 11008 \approx 8 \cdot 4096 / 3 = 10922, rounded up to a multiple of 256 for kernel alignment.

The role of the FFN. Geva et al. (2021) proposed that FFN layers act as key-value memories: W1W_1 maps inputs to a high-dimensional space where patterns are detected, and W2W_2 maps back to the residual stream with the associated information. Under this view, the FFN is where factual knowledge is primarily stored. This is an empirical interpretation from mechanistic interpretability, not a settled fact. Hase et al. (2023) show that the location where Causal Tracing localizes a fact is not a reliable predictor of which layer is best to edit, which complicates the simple "localization equals storage" reading.

Positional Encoding

Self-attention is permutation-equivariant: shuffling the input tokens shuffles the output tokens identically. Without positional information, the model cannot distinguish "the dog bit the man" from "the man bit the dog."

Definition

Sinusoidal Positional Encoding

The original transformer uses fixed sinusoidal encodings added to the input embeddings:

PE(pos,2i)=sin(pos100002i/d),PE(pos,2i+1)=cos(pos100002i/d)\text{PE}(pos, 2i) = \sin\left(\frac{pos}{10000^{2i/d}}\right), \quad \text{PE}(pos, 2i+1) = \cos\left(\frac{pos}{10000^{2i/d}}\right)

for position pospos and dimension ii. This allows the model to attend to relative positions because PE(pos+k)\text{PE}(pos + k) is a linear function of PE(pos)\text{PE}(pos).

Definition

Rotary Position Embedding (RoPE)

RoPE (Su et al., 2021) encodes position by rotating the query and key vectors in d/2d/2 independent 2D subspaces:

q~m=Rmq,k~n=Rnk\tilde{q}_m = R_m q, \quad \tilde{k}_n = R_n k

where RmR_m is a block-diagonal matrix of d/2d/2 planar rotations, with the ii-th block a 2x2 rotation by angle mθim \theta_i using the base frequencies

θi=100002i/d,i=0,1,,d/21.\theta_i = 10000^{-2i/d}, \quad i = 0, 1, \ldots, d/2 - 1.

Rotations that act on the same 2D subspace commute and satisfy RmRn=RnmR_m^\top R_n = R_{n-m} block-by-block. Since RmR_m is block-diagonal, the full matrix product also satisfies RmRn=RnmR_m^\top R_n = R_{n-m}, so the attention score becomes:

q~mk~n=qRmRnk=qRnmk.\tilde{q}_m^\top \tilde{k}_n = q^\top R_m^\top R_n k = q^\top R_{n-m} k.

This depends only on the relative position nmn - m, giving the model translation-invariant attention.

Why RoPE dominates. RoPE naturally encodes relative positions (not absolute), extrapolates better to longer sequences than seen during training, and does not add parameters. It is used in Llama, Mistral, and most modern open-source LLMs.

Computational Complexity

Proposition

Attention is Quadratic in Sequence Length

Statement

The computational cost of self-attention is:

O(n2d)O(n^2 d)

The n2n^2 factor comes from computing the n×nn \times n attention matrix QKQK^\top. The memory cost for storing attention weights is O(n2)O(n^2) per head.

For a full transformer with LL layers and hh heads:

  • Attention cost per layer: O(n2d)O(n^2 d)
  • FFN cost per layer: O(nddff)=O(nd2)O(n d \cdot d_{\text{ff}}) = O(nd^2) with dff=4dd_{\text{ff}} = 4d
  • Total cost: O(L(n2d+nd2))O(L(n^2 d + n d^2))

Intuition

Every token must attend to every other token, producing an n×nn \times n matrix. For short sequences (ndn \ll d), the FFN dominates. For long sequences (ndn \gg d), attention dominates. This is why extending context length is hard: doubling nn quadruples the attention cost.

Why It Matters

The quadratic cost is the fundamental bottleneck for long-context models. A model processing 100K tokens needs attention matrices with 101010^{10} entries per layer. This has motivated extensive research into efficient attention: sparse attention, linear attention, FlashAttention (which reduces memory but not FLOPs), and sub-quadratic architectures like Mamba.

Failure Mode

The O(n2)O(n^2) scaling is for standard dense attention. Methods like FlashAttention reduce the memory cost from O(n2)O(n^2) to O(n)O(n) by computing attention in tiles, but the compute cost remains O(n2d)O(n^2 d). True sub-quadratic compute requires architectural changes (sparse or linear attention), which can reduce model quality.

Parameter Counting

Proposition

Transformer Parameter Count

Statement

A decoder-only transformer with LL layers has approximately:

Total paramsVd+L12d2+Vd\text{Total params} \approx V \cdot d + L \cdot 12d^2 + V \cdot d

Breaking this down:

  • Token embedding: V×dV \times d parameters
  • Per layer:
    • Multi-head attention (QKV + output): 4d24d^2
    • FFN (two linear layers): 8d28d^2
    • Layer norm (2 per layer): 4d4d (negligible)
    • Subtotal: 12d2\approx 12d^2 per layer
  • Output projection (often tied with embedding): V×dV \times d

For GPT-3 scale (L=96L = 96, d=12288d = 12288, V=50257V = 50257): approximately 12288×50257+96×12×12288217512288 \times 50257 + 96 \times 12 \times 12288^2 \approx 175B parameters.

This is an order-of-magnitude approximation. It ignores biases, layer norm scale and shift, the final output projection head, and absolute positional embeddings, and it double-counts in the presence of weight tying. Many modern implementations tie the input embedding V×dV \times d with the output unembedding matrix, so only one V×dV \times d block is counted. The headline 175B figure works out because the omitted and double-counted terms approximately cancel against the neglected layer norm and bias parameters, not because 12d2L12d^2 L plus two embedding copies is exact.

Intuition

The vast majority of parameters are in the transformer layers, not the embeddings (unless the vocabulary is very large). Within each layer, the FFN contains 2/32/3 of the parameters (8d28d^2 vs. 4d24d^2 for attention). This is why the FFN layers are where most of the model's knowledge capacity resides.

Why It Matters

Parameter counting is essential for: (1) estimating compute costs for training and inference, (2) understanding scaling laws, (3) comparing architectures, and (4) estimating memory requirements. A model with NN parameters in fp16 requires 2N2N bytes of memory just for weights, plus additional memory for activations and optimizer states. Techniques like speculative decoding and quantization reduce these costs at serving time.

A Complete Transformer Block

Putting it all together, one transformer block computes (using pre-norm):

h=x+MHA(LayerNorm(x))h = x + \text{MHA}(\text{LayerNorm}(x)) output=h+FFN(LayerNorm(h))\text{output} = h + \text{FFN}(\text{LayerNorm}(h))

The full model stacks LL such blocks, preceded by token embedding + positional encoding and followed by a final layer norm and linear output projection to vocabulary logits.

Common Confusions

Watch Out

Attention is not a learned weight matrix

The attention weights αij\alpha_{ij} are computed dynamically from the input. They change for every input sequence. The learned parameters are WQ,WK,WV,WOW_Q, W_K, W_V, W_O, which determine how attention is computed. This input-dependence is what gives transformers their flexibility compared to fixed-weight architectures.

Watch Out

Multi-head attention does not multiply the cost by h

Each head operates on d/hd/h dimensions, so the total computation across all heads is the same as a single head with full dd dimensions. Multi-head attention is a reorganization of computation, not a multiplication.

Watch Out

FlashAttention reduces memory, not FLOPs

FlashAttention computes the same mathematical operation as standard attention. It reduces memory from O(n2)O(n^2) to O(n)O(n) by computing attention in blocks and never materializing the full n×nn \times n matrix. But the number of floating-point operations is unchanged. True compute savings require architectural changes.

Summary

  • Self-attention: Attention(Q,K,V)=softmax(QK/dk)V\text{Attention}(Q,K,V) = \text{softmax}(QK^\top / \sqrt{d_k})V
  • Multi-head attention: hh parallel heads with dk=d/hd_k = d/h, concatenated and projected
  • Each transformer block: attention + residual + LayerNorm + FFN + residual + LayerNorm
  • Attention cost is O(n2d)O(n^2 d). quadratic in sequence length
  • FFN cost is O(nd2)O(n d^2). dominates for short sequences
  • Per-layer parameters: 12d2\approx 12d^2 (attention 4d24d^2 + FFN 8d28d^2). Modern LLMs replace the two-matrix FFN with a SwiGLU or GeGLU gated FFN that has three projections and dff8d/3d_{\text{ff}} \approx 8d/3, which preserves the 8d28d^2 budget. Mixture-of-experts variants sparsely activate a subset of FFN parameters.
  • RoPE gives relative position encoding via rotation of Q and K
  • Pre-norm (LayerNorm before sub-layer) is standard in modern LLMs

Exercises

ExerciseCore

Problem

For a transformer with d=512d = 512, h=8h = 8 heads, and dff=2048d_{\text{ff}} = 2048, compute the number of parameters in one transformer block (ignoring biases and layer norm parameters).

ExerciseCore

Problem

If the sequence length doubles from nn to 2n2n, by what factor does the attention computation cost increase? By what factor does the FFN computation cost increase?

ExerciseAdvanced

Problem

Show that without positional encoding, self-attention is permutation-equivariant: if you permute the input tokens by a permutation σ\sigma, the output tokens are permuted by the same σ\sigma.

ExerciseAdvanced

Problem

A transformer model has L=32L = 32 layers, d=4096d = 4096, h=32h = 32 heads, dff=11008d_{\text{ff}} = 11008 (as in Llama 2 7B, which uses a SwiGLU-gated FFN with dff8d/3d_{\text{ff}} \approx 8d/3, rounded to a multiple of 256 for kernel alignment), and vocabulary V=32000V = 32000. Estimate the total parameter count and the memory required to store weights in fp16.

Related Comparisons

References

Canonical:

  • Vaswani et al., "Attention Is All You Need" (2017). The original transformer paper, Sections 3.1-3.3 and 3.5 for architecture and positional encoding.

Current:

  • Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding" (2021). RoPE, Sections 3.1-3.4 for the θi=100002i/d\theta_i = 10000^{-2i/d} construction.
  • Shazeer, "GLU Variants Improve Transformer" (2020). Motivates SwiGLU and GeGLU, and recommends shrinking dffd_{\text{ff}} by 2/32/3 to match the 2017 parameter budget.
  • Touvron et al., "Llama 2: Open Foundation and Fine-Tuned Chat Models" (2023). SwiGLU-gated FFN with dff=11008d_{\text{ff}} = 11008 at d=4096d = 4096, Section 2.
  • Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (2022).

Mechanistic interpretability:

  • Elhage et al., "A Mathematical Framework for Transformer Circuits" (Anthropic, 2021). Previous-token heads and the residual-stream view.
  • Olsson et al., "In-context Learning and Induction Heads" (Anthropic, 2022). Induction heads as a mechanism for in-context learning.
  • Wang et al., "Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small" (2022). Name mover heads.
  • Geva et al., "Transformer Feed-Forward Layers Are Key-Value Memories" (EMNLP 2021). FFN-as-memory hypothesis.
  • Hase et al., "Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models" (NeurIPS 2023). Evidence that Causal Tracing localizations do not predict editable layers.

Textbooks:

  • Jurafsky & Martin, Speech and Language Processing (3rd ed., draft), Chapters 7-12.
  • Goodfellow, Bengio, Courville, Deep Learning (2016), Chapters 10-12.

Next Topics

The natural next steps from transformer architecture:

  • Mechanistic interpretability: what do the attention heads and FFN layers actually compute?
  • Hallucination theory: why the next-token prediction objective leads to confabulation
  • RLHF and alignment: fine-tuning the transformer for human preferences
  • Vision transformer lineage: how the transformer was adapted for computer vision (ViT, Swin, DINO, CLIP)

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics