Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Systems

Computer Architecture for ML

Memory movement often matters more than FLOPs. Memory hierarchy, bandwidth vs. compute, roofline model, GPU architecture, kernel fusion, and why parameter count alone does not tell the computational story.

CoreTier 2Current~40 min
0

Why This Matters

The gap between peak FLOPs and actual training throughput can be 10x or more. The bottleneck is usually memory bandwidth, not arithmetic. A model with fewer FLOPs can be slower than one with more FLOPs if the smaller model makes more memory round-trips. Understanding the hardware stack (memory hierarchy, bandwidth limits, GPU execution model) lets you predict which operations are slow, why certain optimizations work, and where to look when training is not saturating the hardware.

FlashAttention, kernel fusion, quantization, and KV-cache management are all motivated by the same principle: minimize data movement through the memory hierarchy. All of this builds on floating-point arithmetic fundamentals.

The Memory Hierarchy

Every computation involves moving data from storage to processing units. Closer memory is faster but smaller:

LevelCapacity (typical)BandwidthLatency
Registers~few KB per SMtens of TB/s< 1 ns
L1 cache / shared memory128-228 KB per SM~20 TB/s~5 ns
L2 cache40-50 MB total~5 TB/s~50 ns
HBM (global memory)40-80 GB1-3 TB/s~200 ns
CPU DRAM256 GB-2 TB100-400 GB/s~100 ns
NVMe SSDTB-scale5-15 GB/s~10 us

For an A100 GPU: HBM bandwidth is 2 TB/s, and peak FP16 compute is 312 TFLOPS. For an H100 with HBM3: bandwidth is 3.35 TB/s, and peak FP16 compute is 989 TFLOPS. The ratio of compute to bandwidth keeps increasing, making memory bandwidth the tighter constraint for more operations.

Arithmetic Intensity and the Roofline Model

Definition

Arithmetic Intensity

The arithmetic intensity of an operation is the ratio of floating-point operations performed to bytes of memory transferred:

I=FLOPsbytes movedI = \frac{\text{FLOPs}}{\text{bytes moved}}

An operation with low arithmetic intensity (e.g., elementwise addition: 1 FLOP per 12 bytes for FP32 read-read-write) is memory-bound. An operation with high arithmetic intensity (e.g., large matrix multiply) is compute-bound.

Definition

Roofline Model

The roofline model predicts the attainable performance of an operation as:

Performance=min(peak compute,  bandwidth×I)\text{Performance} = \min(\text{peak compute}, \; \text{bandwidth} \times I)

where II is the arithmetic intensity. The attainable performance is a function of II that increases linearly with II until it hits the compute ceiling. The crossover point I=peak compute/bandwidthI^* = \text{peak compute} / \text{bandwidth} is the ridge point. Operations with I<II < I^* are memory-bound; operations with I>II > I^* are compute-bound.

Proposition

Roofline Performance Bound

Statement

The maximum attainable throughput TT (in FLOPs/s) of an operation with arithmetic intensity II on hardware with peak compute PP and peak bandwidth BB satisfies:

Tmin(P,  BI)T \leq \min(P, \; B \cdot I)

The operation is memory-bound when I<P/BI < P/B and compute-bound when I>P/BI > P/B.

Intuition

You cannot compute faster than the hardware allows (PP), and you cannot compute faster than you can feed data to the compute units (BIB \cdot I). The tighter constraint wins. For most elementwise operations and small matrix multiplies, memory bandwidth is the bottleneck. For large matrix multiplies (the dominant operation in transformer training), compute is the bottleneck, but only if batch size is large enough.

Proof Sketch

The operation requires FF FLOPs and MM bytes of memory transfer, so I=F/MI = F/M. The time is at least F/PF/P (compute limited) and at least M/BM/B (bandwidth limited). Since M=F/IM = F/I, the bandwidth-limited time is F/(BI)F/(B \cdot I). The actual time is max(F/P,F/(BI))=F/min(P,BI)\max(F/P, F/(B \cdot I)) = F / \min(P, B \cdot I), giving throughput F/time=min(P,BI)F / \text{time} = \min(P, B \cdot I).

Why It Matters

The roofline model explains why the same operation can be fast or slow depending on dimensions. A matrix multiply C=ABC = AB with ARm×kA \in \mathbb{R}^{m \times k}, BRk×nB \in \mathbb{R}^{k \times n} has I2mkn/(2(mk+kn+mn))min(m,n,k)/2I \approx 2mkn / (2(mk + kn + mn)) \approx \min(m, n, k) / 2 for large dimensions. For m=n=k=4096m = n = k = 4096, I2048I \approx 2048 FLOPs/byte, well above the ridge point. For m=1m = 1 (single-vector inference), I1I \approx 1 FLOP/byte, deeply memory-bound. This is why batched inference is so much faster per token than single-token inference.

Failure Mode

The roofline model assumes perfect parallelism and no cache effects. In practice, cache misses, bank conflicts, warp divergence, and imperfect occupancy reduce actual throughput below the roofline. The model gives an upper bound, not a prediction. It also ignores instruction overhead and assumes the operation can saturate either compute or memory independently.

GPU Architecture

A modern GPU (e.g., NVIDIA A100, H100) consists of:

Streaming Multiprocessors (SMs). The A100 has 108 SMs, the H100 has 132. Each SM contains CUDA cores (for scalar FP32/FP64), Tensor Cores (for matrix multiply-accumulate in FP16/BF16/FP8/INT8), registers (64K 32-bit registers per SM), and shared memory (up to 228KB on H100, configurable with L1 cache).

Warp scheduling. Threads are organized in groups of 32 called warps. All threads in a warp execute the same instruction (SIMT: Single Instruction, Multiple Threads). When one warp stalls on a memory access, the scheduler switches to another ready warp. High occupancy (many warps per SM) hides memory latency.

Memory hierarchy on GPU. Registers are fastest but per-thread. Shared memory is shared within a thread block (on a single SM), with ~20 TB/s bandwidth. Global memory (HBM) is accessible to all SMs but has ~100x higher latency than shared memory.

Tensor Cores. Specialized matrix multiply units that compute D=A×B+CD = A \times B + C on small tiles (e.g., 16x16x16 in FP16). On H100, Tensor Cores provide 989 TFLOPS in FP16 vs. ~60 TFLOPS for FP32 CUDA cores. Matrix multiplies that map onto Tensor Core tile sizes are dramatically faster. Dimensions should be multiples of 8 (FP16) or 16 (INT8) for optimal Tensor Core utilization.

Kernel Fusion

A kernel is a function launched on the GPU. Each kernel reads inputs from global memory, computes, and writes outputs to global memory. A sequence of operations like LayerNorm followed by a linear layer followed by GELU activation would naively require 3 kernel launches and 3 round-trips to global memory for intermediate results.

Kernel fusion combines multiple operations into a single kernel. The intermediate results stay in registers or shared memory, avoiding global memory round-trips. The savings can be substantial: if each operation is memory-bound, fusing kk operations reduces memory traffic by up to kkx.

Examples of fusion in practice:

  • FlashAttention fuses the attention computation (QK^T, softmax, AV) into a single kernel that tiles the computation to fit in SRAM, reducing HBM reads from O(N2)O(N^2) to O(N)O(N).
  • Fused optimizer kernels apply weight update rules (Adam update, weight decay, gradient unscaling) in one pass over parameters.
  • Fused attention + LayerNorm avoids materializing attention output to HBM before LayerNorm reads it back.

Why Transformers Are Often Memory-Bound

The dominant operation in a transformer forward pass is matrix multiplication (QKV projections, attention, FFN layers). For large batch sizes, these are compute-bound and use Tensor Cores well. But several components are memory-bound:

  • Softmax: reads and writes O(N2)O(N^2) values with minimal computation per element.
  • LayerNorm: elementwise operations with low arithmetic intensity.
  • Activation functions (GELU, SiLU): elementwise, one FLOP per element.
  • Residual connections: elementwise addition, one FLOP per three memory operations (read two inputs, write one output).

For autoregressive inference (batch size 1, generating one token at a time), even the matrix multiplies are memory-bound because I1I \approx 1 FLOP/byte. The entire forward pass becomes a memory bandwidth exercise.

Quantization

Definition

Quantization

Quantization represents model parameters or activations in lower precision than FP32, reducing memory footprint and (on supported hardware) increasing compute throughput:

  • FP16/BF16: 2 bytes per value. Tensor Cores achieve 2x throughput vs FP32. BF16 has the same exponent range as FP32 (8 bits), avoiding overflow issues that plague FP16. See mixed precision training for details on training with reduced precision.
  • INT8: 1 byte per value. 4x memory reduction vs FP32. Requires calibration to map FP32 ranges to INT8 without excessive quantization error.
  • FP8 (E4M3, E5M2): H100 Tensor Cores support FP8, achieving up to 2x throughput vs FP16. E4M3 has more mantissa bits (inference); E5M2 has more exponent range (training gradients).
  • INT4/NF4: 0.5 bytes per value. Used in QLoRA and GPTQ for inference. Significant accuracy loss without careful calibration or mixed-precision approaches.

The memory bandwidth savings from quantization directly translate to faster memory-bound operations. An INT8 model reads half the bytes of an FP16 model, approaching 2x speedup on memory-bound operations.

The Memory Wall

The memory wall refers to the growing gap between processor speed and memory bandwidth. From 2010 to 2024, peak GPU FLOPs grew roughly 60x (Fermi to H100) while HBM bandwidth grew roughly 10x. This means the ridge point I=P/BI^* = P/B has increased from roughly 10 to roughly 300 FLOPs/byte. More operations fall on the memory-bound side of the roofline. Every generation of hardware makes memory-aware algorithm design more important, not less.

Connections to Model Serving

The architecture constraints directly determine serving performance:

  • KV-cache memory: autoregressive generation stores key/value tensors for all previous tokens. For a model with LL layers, hidden dimension dd, and context length nn, the KV-cache uses 2Ldn2 \cdot L \cdot d \cdot n values (in the serving precision). At L=80L = 80, d=8192d = 8192, n=8192n = 8192, and FP16, this is ~20 GB per sequence. Batching multiple sequences multiplies this.
  • Batch size and throughput: larger batches increase arithmetic intensity, moving matrix multiplies toward the compute-bound regime. This is why serving systems like vLLM batch aggressively.
  • Latency vs. throughput: single-user latency is dominated by sequential token generation (each token requires a full forward pass). Throughput is maximized by batching, which increases latency per request. The tradeoff is governed by the roofline model.

Common Confusions

Watch Out

FLOPs do not determine wall-clock time

Two operations with the same FLOP count can have vastly different run times. An elementwise operation over 10910^9 elements (~10910^9 FLOPs) may take longer than a matrix multiply with 101010^{10} FLOPs because the elementwise operation is memory-bound while the matrix multiply is compute-bound and uses Tensor Cores. Always consider arithmetic intensity, not just FLOP count.

Watch Out

Parameter count does not equal memory cost

A 7B-parameter model in FP16 uses 14 GB for parameters, but training requires optimizer states (Adam: 2x FP32 copies for mm and vv), gradients, and activations. The total training memory is typically 4-6x the parameter memory. Inference memory is smaller but includes KV-cache, which grows with context length and batch size.

Watch Out

Tensor Cores require specific dimension alignment

Tensor Cores operate on fixed tile sizes (e.g., 16x16x16 for FP16). If matrix dimensions are not multiples of these tile sizes, padding is required, wasting compute. Hidden dimensions of 768, 1024, 2048, 4096, 8192 are chosen partly because they divide evenly by Tensor Core tile sizes and warp sizes.

Key Takeaways

  • The roofline model: performance = min(peak compute, bandwidth x arithmetic intensity)
  • Most elementwise operations in transformers are memory-bound, not compute-bound
  • Kernel fusion reduces memory round-trips and is the primary optimization lever for memory-bound operations
  • Quantization reduces memory footprint and bandwidth requirements, directly speeding up memory-bound operations
  • The memory wall is getting worse: the ridge point increases with each hardware generation
  • Autoregressive inference at batch size 1 is almost entirely memory-bandwidth limited
  • KV-cache memory dominates serving cost for long-context models

Exercises

ExerciseCore

Problem

An A100 GPU has 312 TFLOPS FP16 compute and 2 TB/s HBM bandwidth. Compute the ridge point II^* of the roofline model. Is a matrix multiply C=ABC = AB with AR4096×4096A \in \mathbb{R}^{4096 \times 4096} and BR4096×4096B \in \mathbb{R}^{4096 \times 4096} (in FP16) compute-bound or memory-bound?

ExerciseCore

Problem

You apply LayerNorm to a tensor of shape (B,N,d)=(32,2048,4096)(B, N, d) = (32, 2048, 4096) in FP16. Estimate the arithmetic intensity. Is this operation compute-bound or memory-bound on an A100?

ExerciseAdvanced

Problem

A 70B-parameter model is served in FP16 for autoregressive generation at batch size 1 on an H100 (3.35 TB/s HBM bandwidth). Each token generation requires a full forward pass through all parameters. Estimate the minimum time per token (ignoring KV-cache reads and activation computation). How does this change at batch size 32?

References

Canonical:

  • Hennessy and Patterson, Computer Architecture: A Quantitative Approach (6th ed., 2019), Chapters 2-4
  • Williams, Waterman, Patterson, "Roofline: An Insightful Visual Performance Model for Multicore Architectures" (2009)

GPU Architecture:

  • NVIDIA, CUDA C++ Programming Guide, Chapters on Memory Hierarchy and Execution Model
  • Jia et al., "Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking" (2018)

ML-Specific:

  • Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (2022), Section 2
  • Ivanov et al., "Data Movement Is All You Need: A Case Study on Optimizing Transformers" (2021)
  • Dettmers et al., "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale" (2022)

Next Topics

  • FlashAttention: the canonical example of architecture-aware algorithm design for transformers

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics