Transformer Architecture

Sneiderman, Robby

LLM Construction

Transformer Architecture

The mathematical formulation of the transformer block: self-attention, multi-head attention, layer normalization, FFN blocks, positional encoding, and parameter counting.

ImportantAdvancedTier 2StableCore spine~70 min

For:ML

Prerequisites

Attention Mechanism Theory Feedforward Networks and Backpropagation Softmax and Numerical Stability Adam Optimizer

Start 8-question practice · 11 available 3-question pulse check Prereq Map

Learning position

Read this page in the graph.

llm-construction | layer 4 | tier 2. This page has 15 direct prerequisites and 33 published dependents.

Open Atlas Prerequisites Leads to

What next

Mechanistic Interpretability: Features, Circuits, and Causal Faithfulness

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

theorem visual

Transformer Residual Ledger

$Residual state stays in one ledger while attention opens a temporary pairwise view and the FFN expands tokenwise width.$

block flow

x_{l}

↓

LN

↓

attention

↓

add

↓

LN

↓

FFN

↓

add

x_{l}

→

LN

→

attention

→

add

→

LN

→

FFN

→

add

Normalization happens before each sub-layer, so the residual stream stays clean all the way through the skip path.

dimension ledger

n \times d

residual stream width

n \times n

attention matrix

n \times 4 d

FFN expansion

attention opens the only quadratic object in the block; everything else stays tokenwise.

Primary control: normalization placement

Toggle the block ordering and watch which part of the residual stream gets normalized.

The transformer is the public blueprint behind current open large language model families: Llama, Mistral, DeepSeek, and Qwen. It replaced recurrent and convolutional architectures for sequence modeling because of one key property: self-attention allows every token to attend to every other token in parallel, enabling the model to learn long-range dependencies without the vanishing gradient problem of RNNs.

Understanding the transformer mathematically as the sequence of matrix operations introduced by Vaswani et al. (2017, Sections 3.1-3.3) with specific dimensions, costs, and properties is the prerequisite for understanding RLHF, mechanistic interpretability, scaling laws, and efficiency research.

Mental Model

A transformer processes a sequence of tokens by passing them through a stack of identical blocks. Each block has two sub-layers: a self-attention layer (which lets tokens communicate with each other) and a feed-forward network (which processes each token independently). Residual connections and layer normalization stabilize training.

Self-attention is the key innovation. Each token creates a query ("what am I looking for?"), a key ("what do I contain?"), and a value ("what do I contribute?"). Tokens attend to each other based on query-key similarity, and the output is a weighted sum of values.

Formal Setup and Notation

Let the input sequence have $n$ tokens, each represented as a $d$ -dimensional vector. The input is a matrix $X \in \mathbb{R}^{n \times d}$ .

Self-Attention

Definition

Scaled Dot-Product Attention

Given an input $X \in \mathbb{R}^{n \times d}$ , compute queries, keys, and values:

$Q = XW_Q, \quad K = XW_K, \quad V = XW_V$

where $W_Q, W_K \in \mathbb{R}^{d \times d_k}$ and $W_V \in \mathbb{R}^{d \times d_v}$ are learned weight matrices.

The attention output is the scaled dot-product attention operation from Vaswani et al. (2017, Section 3.2.1):

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V$

where the softmax is applied row-wise (each row sums to 1).

Proposition

Attention Dimensions

Statement

For input $X \in \mathbb{R}^{n \times d}$ :

$Q \in \mathbb{R}^{n \times d_k}$ , $K \in \mathbb{R}^{n \times d_k}$ , $V \in \mathbb{R}^{n \times d_v}$
$QK^\top \in \mathbb{R}^{n \times n}$ : the attention matrix
$\text{softmax}(QK^\top / \sqrt{d_k}) \in \mathbb{R}^{n \times n}$ : each row is a probability distribution
$\text{Attention}(Q, K, V) \in \mathbb{R}^{n \times d_v}$ : the output

The output of attention for token $i$ is a weighted average of value vectors:

$\text{output}_i = \sum_{j=1}^{n} \alpha_{ij} \, v_j \quad \text{where } \alpha_{ij} = \frac{\exp(q_i \cdot k_j / \sqrt{d_k})}{\sum_{\ell} \exp(q_i \cdot k_\ell / \sqrt{d_k})}$

Intuition

Each token $i$ computes a query $q_i$ and compares it against all keys $k_1, \ldots, k_n$ via dot products. The softmax converts these similarity scores into attention weights $\alpha_{ij}$ . The output is a weighted sum of value vectors, where tokens with similar query-key pairs get higher weight. The $\sqrt{d_k}$ scaling prevents the dot products from becoming too large (which would cause the softmax to saturate).

Why It Matters

Tracking dimensions through the transformer is the single most useful exercise for understanding the architecture. Every research paper assumes you can do this fluently. The $n \times n$ attention matrix is both the source of the transformer's power (global context) and its main computational bottleneck.

report a correction →

Why scale by $\sqrt{d_k}$ ? If the entries of $q$ and $k$ are independent with zero mean and unit variance, then $q \cdot k$ has variance $d_k$ . Without scaling, large $d_k$ causes the dot products to have large magnitude, pushing the softmax into regions with near-zero gradients. Dividing by $\sqrt{d_k}$ normalizes the variance to 1, keeping the softmax in a useful range.

Multi-Head Attention

Definition

Multi-Head Attention $M H A$

Instead of computing a single attention function, use $h$ heads in parallel, following Vaswani et al. (2017, Section 3.2.2):

$\text{head}_i = \text{Attention}(XW_Q^{(i)}, XW_K^{(i)}, XW_V^{(i)})$

where $W_Q^{(i)}, W_K^{(i)} \in \mathbb{R}^{d \times d_k}$ and $W_V^{(i)} \in \mathbb{R}^{d \times d_v}$ with $d_k = d_v = d/h$ .

Concatenate the heads and project:

$\text{MHA}(X) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W_O$

where $W_O \in \mathbb{R}^{d \times d}$ .

Why multiple heads? Each head can attend to different aspects of the input: one head might focus on syntactic relationships, another on semantic similarity, another on positional proximity. Multi-head attention allows the model to jointly attend to information from different representation subspaces. Mechanistic interpretability work gives concrete examples of specialization: previous-token heads that copy information from position $t-1$ to position $t$ (Elhage et al., 2021), induction heads that implement in-context pattern completion (Olsson et al., 2022), and name mover heads that move subject tokens to the final position in the indirect object identification circuit (Wang et al., 2022).

Parameter count for MHA: Each head has $W_Q^{(i)}, W_K^{(i)}, W_V^{(i)}$ each of size $d \times (d/h)$ . With $h$ heads, total QKV parameters are $\,3 \times h \times d \times (d/h) = 3d^2$ . The output projection $W_O$ adds $d^2$ . Total: $\,4d^2$ parameters (ignoring biases).

MQA and GQA: KV-cache memory optimizations

At inference time, autoregressive decoding caches the keys and values of every past token. For standard multi-head attention, the KV cache grows as $\,2 \cdot L \cdot h \cdot d_k \cdot n$ per sequence, which becomes the dominant memory cost at long contexts and large batch sizes.

Definition

Multi-Query Attention $M Q A$

MQA (Shazeer, 2019, arXiv:1911.02150) uses $h$ query heads but only one shared key head and one shared value head across all queries:

$\text{head}_i = \text{Attention}(XW_Q^{(i)}, XW_K, XW_V)$

The KV cache shrinks by a factor of $h$ . Decoding throughput improves because each decode step reads far less memory. The quality cost versus full MHA is small but nonzero.

Definition

Grouped-Query Attention $GQ A$

GQA (Ainslie et al., 2023, arXiv:2305.13245) interpolates between MQA and full MHA: the $h$ query heads are partitioned into $G$ groups, with one shared key-value head per group. Each K/V head is shared across $h/G$ query heads. $G = 1$ recovers MQA, $G = h$ recovers MHA. Llama 2 70B, Llama 3, and Mistral use GQA with $G$ chosen to balance quality and KV-cache cost.

See attention variants and efficiency for a broader survey of efficient-attention schemes.

Residual Connections and Layer Normalization

Definition

Transformer Sub-Layer with Residual Connection

Each sub-layer (attention or FFN) is wrapped with a residual connection:

$\text{output} = x + \text{SubLayer}(x)$

This allows gradients to flow directly through the network and enables training of deep transformers.

Definition

Layer Normalization $L a y er N or m$

For a vector $x \in \mathbb{R}^d$ , layer normalization computes:

$\text{LayerNorm}(x) = \frac{x - \mu}{\sigma} \odot \gamma + \beta$

where $\mu = \frac{1}{d}\sum_i x_i$ , $\sigma = \sqrt{\frac{1}{d}\sum_i (x_i - \mu)^2 + \epsilon}$ , and $\gamma, \beta \in \mathbb{R}^d$ are learned scale and shift parameters.

Definition

Root Mean Square Layer Normalization $R M S N or m$

RMSNorm (Zhang and Sennrich, 2019, arXiv:1910.07467) drops the mean-subtraction step of LayerNorm and re-scales only by the root mean square of the activations:

$\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2 + \epsilon}} \odot \gamma$

where $\gamma \in \mathbb{R}^d$ is a learned scale. No shift $\beta$ and no centering. Zhang and Sennrich report lower runtime than LayerNorm in their tested recurrent settings; in current decoder-only transformers, RMSNorm is mainly used because it is simpler, cheap, and has worked well in large open model families including Llama, Mistral, DeepSeek, and Qwen.

Pre-norm vs. post-norm. The original transformer (Vaswani et al., 2017) uses post-norm: $\text{LayerNorm}(x + \text{SubLayer}(x))$ . Most modern LLMs use pre-norm: $x + \text{SubLayer}(\text{LayerNorm}(x))$ . Pre-norm is more stable for training deep networks because the residual path stays unobstructed end to end.

Feed-Forward Network

Definition

Position-wise Feed-Forward Network $F F N$

The original 2017 transformer FFN applies two linear transformations with a nonlinearity (Vaswani et al., 2017, Section 3.3):

$\text{FFN}(x) = W_2 \, \sigma(W_1 x + b_1) + b_2$

where $W_1 \in \mathbb{R}^{d_{\text{ff}} \times d}$ , $W_2 \in \mathbb{R}^{d \times d_{\text{ff}}}$ , and $\sigma$ is ReLU (Vaswani et al., 2017) or GeLU in later variants. The standard choice is $d_{\text{ff}} = 4d$ .

Parameter count for the 2017 FFN. $W_1$ has $d \cdot d_{\text{ff}}$ parameters, $W_2$ has $d_{\text{ff}} \cdot d$ parameters. With $d_{\text{ff}} = 4d$ : total is $\,8d^2$ parameters (ignoring biases).

Definition

Gated FFN (SwiGLU / GeGLU)

Many current decoder-only LLM families, including Llama 2, Llama 3, Mistral, Mixtral, DeepSeek, Qwen, and Gemma, replace the two-matrix FFN with a gated variant that has three projection matrices $W_1, W_2, W_3$ :

$\text{FFN}_{\text{SwiGLU}}(x) = W_2 \left( \sigma(W_1 x) \odot W_3 x \right)$

where $\odot$ is elementwise product, $W_1, W_3 \in \mathbb{R}^{d_{\text{ff}} \times d}$ , $W_2 \in \mathbb{R}^{d \times d_{\text{ff}}}$ , and $\sigma$ is SiLU (SwiGLU) or GeLU (GeGLU). The gate $W_3 x$ modulates the activation elementwise.

Parameter count for gated FFN. Three projections give $\,3 \cdot d \cdot d_{\text{ff}}$ parameters. To keep the parameter budget comparable to the 2017 FFN, Shazeer (2020, arXiv:2002.05202) recommends $d_{\text{ff}} = \frac{2}{3} \cdot 4d = \frac{8d}{3}$ , which yields $\,3 \cdot d \cdot \frac{8d}{3} = 8d^2$ parameters. This matches the legacy $\,12d^2$ per-layer formula used below. Llama 2 7B ( $d = 4096$ ) picks $d_{\text{ff}} = 11008 \approx 8 \cdot 4096 / 3 = 10922$ , rounded up to a multiple of 256 for kernel alignment.

The role of the FFN. Geva et al. (2021) proposed that FFN layers act as key-value memories: $W_1$ maps inputs to a high-dimensional space where patterns are detected, and $W_2$ maps back to the residual stream with the associated information. Under this view, the FFN is where factual knowledge is primarily stored. This is an empirical interpretation from mechanistic interpretability, not a settled fact. Hase et al. (2023) show that the location where Causal Tracing localizes a fact is not a reliable predictor of which layer is best to edit, which complicates the simple "localization equals storage" reading.

Positional Encoding

Self-attention is permutation-equivariant: shuffling the input tokens shuffles the output tokens identically. Without positional information, the model cannot distinguish "the dog bit the man" from "the man bit the dog."

Definition

Sinusoidal Positional Encoding

The original transformer uses fixed sinusoidal encodings added to the input embeddings:

$\text{PE}(pos, 2i) = \sin\left(\frac{pos}{10000^{2i/d}}\right), \quad \text{PE}(pos, 2i+1) = \cos\left(\frac{pos}{10000^{2i/d}}\right)$

for position $pos$ and dimension $i$ . This allows the model to attend to relative positions because $\text{PE}(pos + k)$ is a linear function of $\text{PE}(pos)$ .

Definition

Rotary Position Embedding (RoPE)

RoPE (Su et al., 2021, arXiv:2104.09864, Sections 3.1-3.4) encodes position by rotating the query and key vectors in $d/2$ independent 2D subspaces:

$\tilde{q}_m = R_m q, \quad \tilde{k}_n = R_n k$

where $R_m$ is a block-diagonal matrix of $d/2$ planar rotations, with the $i$ -th block a 2x2 rotation by angle $m \theta_i$ using the base frequencies

$\theta_i = 10000^{-2i/d}, \quad i = 0, 1, \ldots, d/2 - 1.$

Rotations that act on the same 2D subspace commute and satisfy $R_m^\top R_n = R_{n-m}$ block-by-block. Since $R_m$ is block-diagonal, the full matrix product also satisfies $R_m^\top R_n = R_{n-m}$ , so the attention score becomes:

$\tilde{q}_m^\top \tilde{k}_n = q^\top R_m^\top R_n k = q^\top R_{n-m} k.$

The QK dot product depends only on the relative position $n - m$ . This makes the attention pattern (the matrix of softmax weights) translation-equivariant in the relative-position sense; the full attention output also picks up content from $V$ , which RoPE leaves unrotated, so the layer is not literally translation-invariant in input space; only its query-key similarity structure is. That is enough to give the model good length generalization without absolute positional embeddings.

Why RoPE is common. RoPE naturally encodes relative positions (not absolute), extrapolates better than absolute embeddings in many long-context settings, and does not add parameters. It is used in Llama, Mistral, and many current open-weight LLMs.

Definition

Attention with Linear Biases $A L i B i$

ALiBi (Press, Smith, and Lewis, 2022, arXiv:2108.12409) is the canonical length-extrapolation alternative to RoPE. Instead of rotating queries and keys, ALiBi adds a fixed linear bias to the attention logits based on the relative position:

$\text{logit}_{ij} = q_i^\top k_j - m \cdot |i - j|$

where $m > 0$ is a per-head slope fixed at initialization (heads get a geometric sequence of slopes). No positional embeddings are added to the inputs. The linear penalty gives tokens increasing distance a monotonically decreasing prior, and the model extrapolates to sequence lengths well beyond the training window. ALiBi is used in BLOOM and MPT.

Computational Complexity

Proposition

Attention is Quadratic in Sequence Length

Statement

The computational cost of self-attention is:

$O(n^2 d)$

The $n^2$ factor comes from computing the $n \times n$ attention matrix $QK^\top$ . The memory cost for storing attention weights is $O(n^2)$ per head.

For a full transformer with $L$ layers and $h$ heads:

Attention cost per layer: $O(n^2 d)$
FFN cost per layer: $O(n d \cdot d_{\text{ff}}) = O(nd^2)$ with $d_{\text{ff}} = 4d$
Total cost: $O(L(n^2 d + n d^2))$

Intuition

Every token must attend to every other token, producing an $n \times n$ matrix. For short sequences ( $n \ll d$ ), the FFN dominates. For long sequences ( $n \gg d$ ), attention dominates. This is why extending context length is hard: doubling $n$ quadruples the attention cost.

Proof Sketch

The attention logits are $QK^\top$ . With $Q, K \in \mathbb{R}^{n \times d_k}$ , forming this matrix costs $O(n^2 d_k)$ per head. Across $h$ heads with $h d_k = d$ , the logit cost is $O(n^2 d)$ . Multiplying the resulting $n \times n$ attention matrix by $V \in \mathbb{R}^{n \times d_v}$ has the same order, again summing to $O(n^2 d)$ across heads. The FFN applies two or three dense projections independently to each token, so its cost scales linearly in $n$ and quadratically in $d$ .

Why It Matters

The quadratic cost is the reason long-context models spend so much work on attention. A model processing 100K tokens needs attention matrices with $\,10^{10}$ entries per layer. This has motivated extensive research into efficient attention: sparse attention, linear attention, FlashAttention (which reduces memory but not FLOPs), and sub-quadratic architectures like Mamba.

Failure Mode

The $O(n^2)$ scaling is for standard dense attention. FlashAttention (Dao et al., 2022) reduces the memory cost from $O(n^2)$ to $O(n)$ by computing attention in tiles, but the compute cost remains $O(n^2 d)$ . True sub-quadratic compute requires architectural changes (sparse or linear attention), which can reduce model quality.

report a correction →

Parameter Counting

Proposition

Transformer Parameter Count

Statement

A decoder-only transformer with $L$ layers has approximately:

$\text{Total params} \approx V \cdot d + L \cdot 12d^2 + V \cdot d$

Breaking this down:

Token embedding: $V \times d$ parameters
Per layer:
- Multi-head attention (QKV + output): $\,4d^2$
- FFN (two linear layers): $\,8d^2$
- Layer norm (2 per layer): $\,4d$ (negligible)
- Subtotal: $\approx 12d^2$ per layer
Output projection (often tied with embedding): $V \times d$

For GPT-3 scale ( $L = 96$ , $d = 12288$ , $V = 50257$ ): approximately $\,12288 \times 50257 + 96 \times 12 \times 12288^2 \approx 175$ B parameters.

This is an order-of-magnitude approximation, not an exact accounting identity.

Intuition

The vast majority of parameters are in the transformer layers, not the embeddings (unless the vocabulary is very large). Within each layer, the FFN contains $\,2/3$ of the parameters ( $\,8d^2$ vs. $\,4d^2$ for attention). This is why the FFN layers are where most of the model's knowledge capacity resides.

Why It Matters

Parameter counting is used to estimate compute costs for training and inference, (2) understanding scaling laws, (3) comparing architectures, and estimating memory requirements. A model with $N$ parameters in fp16 requires $\,2N$ bytes of memory just for weights, plus additional memory for activations and optimizer states. Techniques like speculative decoding and quantization reduce these costs at serving time.

Failure Mode

The $12d^2L$ approximation needs adjustment when the implementation ties the input embedding and output unembedding, uses an FFN width other than $4d$ , adds gated FFN projections, includes untied output heads, or uses MoE layers. It also omits biases, layer-norm scale and shift, and positional embeddings. For GPT-3 scale these terms are small relative to the main transformer blocks; for small models or unusual vocabularies they can matter.

report a correction →

A Complete Transformer Block

Putting it all together, one transformer block computes (using pre-norm):

$h = x + \text{MHA}(\text{LayerNorm}(x))$ $\text{output} = h + \text{FFN}(\text{LayerNorm}(h))$

The full model stacks $L$ such blocks, preceded by token embedding + positional encoding and followed by a final layer norm and linear output projection to vocabulary logits.

Architecture Claim Ladder

Transformer papers often combine mathematical architecture claims, systems claims, and interpretability claims. These should not be treated as one kind of evidence.

Claim	What supports it	What can go wrong
Shapes and parameter counts are correct	Q, K, V, FFN, embedding, and output-head dimensions are written at the same convention	Tied embeddings, GQA, SwiGLU width, or MoE experts make the $12d^2L$ estimate wrong
Attention cost is the bottleneck	Sequence length, model width, batch size, and kernel implementation are reported	For short contexts the FFN dominates; FlashAttention changes memory traffic, not asymptotic FLOPs
KV-cache optimization improves serving	Tokens per second, memory per request, and quality delta are measured at target batch and context length	MQA/GQA lowers memory but can hurt quality or interact with long-context extrapolation
A head has a function	Activation patching, ablation, and causal intervention agree on a prompt family	Attention maps look interpretable but intervention does not change the output
An architecture change improves models	Matched compute, matched data, and matched training recipe baselines are compared	The claimed gain comes from extra tokens, better data, longer training, or different evaluation

For this page, the key habit is to keep tensor identities separate from empirical claims. The formula for attention is exact; a statement about what attention "means" needs intervention evidence.

Common Confusions

Watch Out

Attention is not a learned weight matrix

The attention weights $\alpha_{ij}$ are computed dynamically from the input. They change for every input sequence. The learned parameters are $W_Q, W_K, W_V, W_O$ , which determine how attention is computed. This input-dependence is what gives transformers their flexibility compared to fixed-weight architectures.

Watch Out

Multi-head attention does not multiply the cost by h

Each head operates on $d/h$ dimensions, so the total computation across all heads is the same as a single head with full $d$ dimensions. Multi-head attention is a reorganization of computation, not a multiplication.

Watch Out

FlashAttention reduces memory, not FLOPs

FlashAttention computes the same mathematical operation as standard attention. It reduces memory from $O(n^2)$ to $O(n)$ by computing attention in blocks and never materializing the full $n \times n$ matrix. But the number of floating-point operations is unchanged. True compute savings require architectural changes.

Summary

Self-attention: $\text{Attention}(Q,K,V) = \text{softmax}(QK^\top / \sqrt{d_k})V$
Multi-head attention: $h$ parallel heads with $d_k = d/h$ , concatenated and projected
Each transformer block: attention + residual + LayerNorm + FFN + residual + LayerNorm
Attention cost is $O(n^2 d)$ , quadratic in sequence length
FFN cost is $O(n d^2)$ and dominates for short sequences
Per-layer parameters: $\approx 12d^2$ (attention $\,4d^2$ + FFN $\,8d^2$ ). Modern LLMs replace the two-matrix FFN with a SwiGLU or GeGLU gated FFN that has three projections and $d_{\text{ff}} \approx 8d/3$ , which preserves the $\,8d^2$ budget. Mixture-of-experts variants sparsely activate a subset of FFN parameters.
RoPE gives relative position encoding via rotation of Q and K. ALiBi is the canonical alternative: adds a linear bias $-m \cdot (i - j)$ to logits, extrapolates well
Pre-norm (LayerNorm before sub-layer) is standard in modern LLMs. RMSNorm drops mean-centering, keeps RMS re-scaling, and is used in Llama, Mistral, DeepSeek, and Qwen
MQA and GQA shrink the inference-time KV cache by sharing K/V heads across query heads

Evaluation Ladder

Question	What to measure	Failure signal
Shape contract	Check $Q$ , $K$ , $V$ , attention logits, head concatenation, and output projection dimensions for one concrete batch.	A silent transpose or wrong softmax axis can pass small examples while changing the model.
FLOP bottleneck	Separate attention FLOPs, FFN FLOPs, and embedding/output projection cost at the target sequence length.	A long-context claim that reports only parameter count is incomplete.
KV-cache budget	Compute bytes per token from layers, K/V heads, head dimension, dtype, and batch size.	Decode throughput collapses before raw FLOPs become the limit.
Kernel path	Verify whether the implementation uses dense attention, FlashAttention, sparse attention, MQA, or GQA.	A paper may claim the same architecture while changing the serving bottleneck.
Quality tradeoff	Compare perplexity, retrieval tasks, and long-context tasks after architectural substitutions.	Lower memory can hide quality loss on exact retrieval or rare-token copying.

Worked Diagnostic Pattern

When an LLM slows down, split the trace into prefill versus decode before changing the architecture. Prefill processes the whole prompt in parallel and usually exposes the $O(n^2d)$ attention cost. Decode generates one token at a time and often exposes KV-cache bandwidth, batch scheduling, and output-head cost. A real architecture diagnosis reports both regimes: sequence length, batch size, cache bytes per generated token, achieved memory bandwidth, and tokens per second. Without that split, "attention is quadratic" is too blunt to explain the observed system.

Worked Serving Budget Example

A long-context transformer claim should include a KV-cache budget, not only a parameter count. For a decoder with 32 layers, 8 grouped-query K/V heads, head dimension 128, fp16 cache entries, and a 16k-token prompt, the per-sequence cache is:

2 \times 32 \times 8 \times 128 \times 2 \times 16000 \approx 1.05\ \mathrm{GB}

The factors are K and V, layers, K/V heads, head dimension, bytes per fp16 number, and context tokens. This estimate excludes batching, allocator overhead, attention workspaces, activations kept for training, and output-head cost. It is still enough to show why decode can become a memory-bandwidth problem even when the FLOP count looks manageable.

Claim	Required serving evidence
"Supports 128k context"	KV-cache bytes per sequence, max batch size, and measured decode tokens per second
"GQA is cheaper"	K/V head count before and after the substitution, plus quality checks on retrieval tasks
"Attention is the bottleneck"	Separate prefill and decode traces rather than one aggregate latency number

Exercises

ExerciseCore

Problem

For a transformer with $d = 512$ , $h = 8$ heads, and $d_{\text{ff}} = 2048$ , using the 2017-original two-matrix FFN with ReLU or GeLU (not the modern SwiGLU-gated FFN used in Llama, Mistral, DeepSeek, Qwen; see Shazeer 2020, arXiv:2002.05202), compute the number of parameters in one transformer block (ignoring biases and layer norm parameters).

ExerciseCore

Problem

If the sequence length doubles from $n$ to $\,2n$ , by what factor does the attention computation cost increase? By what factor does the FFN computation cost increase?

ExerciseAdvanced

Problem

Show that without positional encoding, self-attention is permutation-equivariant: if you permute the input tokens by a permutation $\sigma$ , the output tokens are permuted by the same $\sigma$ .

ExerciseAdvanced

Problem

A transformer model has $L = 32$ layers, $d = 4096$ , $h = 32$ heads, $d_{\text{ff}} = 11008$ (as in Llama 2 7B, which uses a SwiGLU-gated FFN with $d_{\text{ff}} \approx 8d/3$ , rounded to a multiple of 256 for kernel alignment), and vocabulary $V = 32000$ . Estimate the total parameter count and the memory required to store weights in fp16.

Related Comparisons

Frequently Asked Questions

$Why does attention use a softmax?$: $The softmax produces a probability distribution over keys, giving attention weights that are non-negative and sum to 1. This lets attention be interpreted as a soft dictionary lookup: a convex combination of values weighted by query-key similarity. The softmax is differentiable, so the whole operation can be trained end-to-end.$
$Why divide by sqrt(d_k) in attention?$: $Without scaling, the variance of QK^T grows linearly with d_k (assuming Q, K have unit variance). For large d_k the softmax saturates: a few entries dominate and gradients to all other positions vanish. Dividing by sqrt(d_k) keeps the logit variance ~1 regardless of dimension, keeping softmax in its sensitive regime.$
$What is the residual stream?$: $The shared d-dimensional vector that flows through the transformer unchanged except for additive contributions from each block. Attention and FFN sub-blocks read from and write back to the residual stream. Mechanistic interpretability uses this as the canonical model: the network is a sequence of read-process-write operations on one shared memory.$
$Pre-norm vs post-norm: which is better?$: $Pre-norm applies LayerNorm before each sub-block; post-norm applies it after. Pre-norm trains stably at large depth because the gradient flows through the residual without being squashed by an early LayerNorm. Modern LLMs (GPT-2 onward, LLaMA, Mistral, all dense decoder-only models) use pre-norm; original Transformer paper used post-norm.$
$Why use Mixture of Experts in modern transformers?$: $MoE replaces the dense FFN with N expert FFNs and a router that selects k of them per token (typically k=2). Total parameters scale with N; per-token compute scales with k. This decouples capacity from per-token cost, making 671B-parameter models like DeepSeek-V3 affordable to serve. The router is the new failure mode: it can collapse to a few experts (load balancing loss is the standard fix).$

References

Canonical:

Vaswani et al., "Attention Is All You Need" (2017). The original transformer paper, Sections 3.1-3.3 and 3.5 for architecture and positional encoding.

Current:

Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding" (2021), arXiv:2104.09864. RoPE, Sections 3.1-3.4 for the $\theta_i = 10000^{-2i/d}$ construction.
Shazeer, "GLU Variants Improve Transformer" (2020), arXiv:2002.05202. Motivates SwiGLU and GeGLU, and recommends shrinking $d_{\text{ff}}$ by $\,2/3$ to match the 2017 parameter budget.
Touvron et al., "Llama 2: Open Foundation and Fine-Tuned Chat Models" (2023), arXiv:2307.09288. SwiGLU-gated FFN with $d_{\text{ff}} = 11008$ at $d = 4096$ , Section 2.
Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (NeurIPS 2022), arXiv:2205.14135.
Zhang and Sennrich, "Root Mean Square Layer Normalization" (NeurIPS 2019), arXiv:1910.07467. RMSNorm: drops mean-centering, keeps RMS re-scaling; used in Llama, Mistral, DeepSeek.
Shazeer, "Fast Transformer Decoding: One Write-Head is All You Need" (2019), arXiv:1911.02150. Introduces Multi-Query Attention (MQA) to shrink the KV cache.
Ainslie, Lee-Thorp, de Jong, Zemlyanskiy, Lebrón, Sanghai, "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints" (EMNLP 2023), arXiv:2305.13245. Grouped-Query Attention, used in Llama 2 70B and Llama 3.
Press, Smith, Lewis, "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation" (ICLR 2022), arXiv:2108.12409. ALiBi, used in BLOOM and MPT.

Mechanistic interpretability:

Elhage et al., "A Mathematical Framework for Transformer Circuits" (2021). Previous-token heads and the residual-stream view.
Olsson et al., "In-context Learning and Induction Heads" (2022). Induction heads as a mechanism for in-context learning.
Wang et al., "Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small" (2022). Name mover heads.
Geva et al., "Transformer Feed-Forward Layers Are Key-Value Memories" (EMNLP 2021). FFN-as-memory hypothesis.
Hase et al., "Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models" (NeurIPS 2023). Evidence that Causal Tracing localizations do not predict editable layers.

Textbooks:

Jurafsky & Martin, Speech and Language Processing (3rd ed., draft), Chapters 7-12.
Goodfellow, Bengio, Courville, Deep Learning (2016), Chapters 10-12.

Next Topics

The natural next steps from transformer architecture:

Mechanistic interpretability: what do the attention heads and FFN layers actually compute?
Hallucination theory: why the next-token prediction objective leads to confabulation
RLHF and alignment: fine-tuning the transformer for human preferences
Vision transformer lineage: how the transformer was adapted for computer vision (ViT, Swin, DINO, CLIP)

Last reviewed: July 3, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

15

Deep Learning (Goodfellow, Bengio, Courville)layer 0B · tier 1
Softmax and Numerical Stabilitylayer 1 · tier 1
Adam Optimizerlayer 2 · tier 1
Feedforward Networks and Backpropagationlayer 2 · tier 1
Linear Layer: Shapes, Bias, and Memorylayer 2 · tier 1

Derived topics

33

Tabular Foundation Models as Bayesian Inference Engineslayer 3 · tier 1
Attention Is All You Need (Paper)layer 4 · tier 1
Hallucination Theorylayer 4 · tier 1
Mechanistic Interpretability: Features, Circuits, and Causal Faithfulnesslayer 4 · tier 1
Vision Transformer Lineage: ViT, DeiT, Swin, MAE, DINOv2, SAMlayer 4 · tier 1

+28 more on the derived-topics page.

Graph-backed continuations

Mechanistic Interpretability: Features, Circuits, and Causal Faithfulness Hallucination Theory RLHF and Alignment Attention for Protein Structure: AlphaFold and Successors Attention Is All You Need (Paper)Audio Language Models BERT and the Pretrain-Finetune Paradigm Chain-of-Thought and Reasoning Claude Model Family Cohere Models Deep Learning for Time Series DeepSeek Models Donut and OCR-Free Document Understanding Forgetting Transformer (FoX)Gemini and Google Models GPT Series Evolution Induction Heads LLaMA and Open Weight Models Mistral Models Mixture of Experts Model Comparison Table Model Merging and Weight Averaging Multi-Token Prediction Plan-then-Generate Post-Training Overview Prompt Engineering and In-Context Learning Qwen and Chinese Models Residual Stream and Transformer Internals Speculative Decoding and Quantization Structured Output and Constrained Generation Tabular Foundation Models as Bayesian Inference Engines Tool-Augmented Reasoning Vision Transformer Lineage: ViT, DeiT, Swin, MAE, DINOv2, SAM