Systems
Computer Architecture for ML
Memory movement often matters more than FLOPs. Memory hierarchy, bandwidth vs. compute, roofline model, GPU architecture, kernel fusion, and why parameter count alone does not tell the computational story.
Prerequisites
Why This Matters
The gap between peak FLOPs and actual training throughput can be 10x or more. The bottleneck is usually memory bandwidth, not arithmetic. A model with fewer FLOPs can be slower than one with more FLOPs if the smaller model makes more memory round-trips. Understanding the hardware stack (memory hierarchy, bandwidth limits, GPU execution model) lets you predict which operations are slow, why certain optimizations work, and where to look when training is not saturating the hardware.
FlashAttention, kernel fusion, quantization, and KV-cache management are all motivated by the same principle: minimize data movement through the memory hierarchy. All of this builds on floating-point arithmetic fundamentals.
The Memory Hierarchy
Every computation involves moving data from storage to processing units. Closer memory is faster but smaller:
| Level | Capacity (typical) | Bandwidth | Latency |
|---|---|---|---|
| Registers | ~few KB per SM | tens of TB/s | < 1 ns |
| L1 cache / shared memory | 128-228 KB per SM | ~20 TB/s | ~5 ns |
| L2 cache | 40-50 MB total | ~5 TB/s | ~50 ns |
| HBM (global memory) | 40-80 GB | 1-3 TB/s | ~200 ns |
| CPU DRAM | 256 GB-2 TB | 100-400 GB/s | ~100 ns |
| NVMe SSD | TB-scale | 5-15 GB/s | ~10 us |
For an A100 GPU: HBM bandwidth is 2 TB/s, and peak FP16 compute is 312 TFLOPS. For an H100 with HBM3: bandwidth is 3.35 TB/s, and peak FP16 compute is 989 TFLOPS. The ratio of compute to bandwidth keeps increasing, making memory bandwidth the tighter constraint for more operations.
Arithmetic Intensity and the Roofline Model
Arithmetic Intensity
The arithmetic intensity of an operation is the ratio of floating-point operations performed to bytes of memory transferred:
An operation with low arithmetic intensity (e.g., elementwise addition: 1 FLOP per 12 bytes for FP32 read-read-write) is memory-bound. An operation with high arithmetic intensity (e.g., large matrix multiply) is compute-bound.
Roofline Model
The roofline model predicts the attainable performance of an operation as:
where is the arithmetic intensity. The attainable performance is a function of that increases linearly with until it hits the compute ceiling. The crossover point is the ridge point. Operations with are memory-bound; operations with are compute-bound.
Roofline Performance Bound
Statement
The maximum attainable throughput (in FLOPs/s) of an operation with arithmetic intensity on hardware with peak compute and peak bandwidth satisfies:
The operation is memory-bound when and compute-bound when .
Intuition
You cannot compute faster than the hardware allows (), and you cannot compute faster than you can feed data to the compute units (). The tighter constraint wins. For most elementwise operations and small matrix multiplies, memory bandwidth is the bottleneck. For large matrix multiplies (the dominant operation in transformer training), compute is the bottleneck, but only if batch size is large enough.
Proof Sketch
The operation requires FLOPs and bytes of memory transfer, so . The time is at least (compute limited) and at least (bandwidth limited). Since , the bandwidth-limited time is . The actual time is , giving throughput .
Why It Matters
The roofline model explains why the same operation can be fast or slow depending on dimensions. A matrix multiply with , has for large dimensions. For , FLOPs/byte, well above the ridge point. For (single-vector inference), FLOP/byte, deeply memory-bound. This is why batched inference is so much faster per token than single-token inference.
Failure Mode
The roofline model assumes perfect parallelism and no cache effects. In practice, cache misses, bank conflicts, warp divergence, and imperfect occupancy reduce actual throughput below the roofline. The model gives an upper bound, not a prediction. It also ignores instruction overhead and assumes the operation can saturate either compute or memory independently.
GPU Architecture
A modern GPU (e.g., NVIDIA A100, H100) consists of:
Streaming Multiprocessors (SMs). The A100 has 108 SMs, the H100 has 132. Each SM contains CUDA cores (for scalar FP32/FP64), Tensor Cores (for matrix multiply-accumulate in FP16/BF16/FP8/INT8), registers (64K 32-bit registers per SM), and shared memory (up to 228KB on H100, configurable with L1 cache).
Warp scheduling. Threads are organized in groups of 32 called warps. All threads in a warp execute the same instruction (SIMT: Single Instruction, Multiple Threads). When one warp stalls on a memory access, the scheduler switches to another ready warp. High occupancy (many warps per SM) hides memory latency.
Memory hierarchy on GPU. Registers are fastest but per-thread. Shared memory is shared within a thread block (on a single SM), with ~20 TB/s bandwidth. Global memory (HBM) is accessible to all SMs but has ~100x higher latency than shared memory.
Tensor Cores. Specialized matrix multiply units that compute on small tiles (e.g., 16x16x16 in FP16). On H100, Tensor Cores provide 989 TFLOPS in FP16 vs. ~60 TFLOPS for FP32 CUDA cores. Matrix multiplies that map onto Tensor Core tile sizes are dramatically faster. Dimensions should be multiples of 8 (FP16) or 16 (INT8) for optimal Tensor Core utilization.
Kernel Fusion
A kernel is a function launched on the GPU. Each kernel reads inputs from global memory, computes, and writes outputs to global memory. A sequence of operations like LayerNorm followed by a linear layer followed by GELU activation would naively require 3 kernel launches and 3 round-trips to global memory for intermediate results.
Kernel fusion combines multiple operations into a single kernel. The intermediate results stay in registers or shared memory, avoiding global memory round-trips. The savings can be substantial: if each operation is memory-bound, fusing operations reduces memory traffic by up to x.
Examples of fusion in practice:
- FlashAttention fuses the attention computation (QK^T, softmax, AV) into a single kernel that tiles the computation to fit in SRAM, reducing HBM reads from to .
- Fused optimizer kernels apply weight update rules (Adam update, weight decay, gradient unscaling) in one pass over parameters.
- Fused attention + LayerNorm avoids materializing attention output to HBM before LayerNorm reads it back.
Why Transformers Are Often Memory-Bound
The dominant operation in a transformer forward pass is matrix multiplication (QKV projections, attention, FFN layers). For large batch sizes, these are compute-bound and use Tensor Cores well. But several components are memory-bound:
- Softmax: reads and writes values with minimal computation per element.
- LayerNorm: elementwise operations with low arithmetic intensity.
- Activation functions (GELU, SiLU): elementwise, one FLOP per element.
- Residual connections: elementwise addition, one FLOP per three memory operations (read two inputs, write one output).
For autoregressive inference (batch size 1, generating one token at a time), even the matrix multiplies are memory-bound because FLOP/byte. The entire forward pass becomes a memory bandwidth exercise.
Quantization
Quantization
Quantization represents model parameters or activations in lower precision than FP32, reducing memory footprint and (on supported hardware) increasing compute throughput:
- FP16/BF16: 2 bytes per value. Tensor Cores achieve 2x throughput vs FP32. BF16 has the same exponent range as FP32 (8 bits), avoiding overflow issues that plague FP16. See mixed precision training for details on training with reduced precision.
- INT8: 1 byte per value. 4x memory reduction vs FP32. Requires calibration to map FP32 ranges to INT8 without excessive quantization error.
- FP8 (E4M3, E5M2): H100 Tensor Cores support FP8, achieving up to 2x throughput vs FP16. E4M3 has more mantissa bits (inference); E5M2 has more exponent range (training gradients).
- INT4/NF4: 0.5 bytes per value. Used in QLoRA and GPTQ for inference. Significant accuracy loss without careful calibration or mixed-precision approaches.
The memory bandwidth savings from quantization directly translate to faster memory-bound operations. An INT8 model reads half the bytes of an FP16 model, approaching 2x speedup on memory-bound operations.
The Memory Wall
The memory wall refers to the growing gap between processor speed and memory bandwidth. From 2010 to 2024, peak GPU FLOPs grew roughly 60x (Fermi to H100) while HBM bandwidth grew roughly 10x. This means the ridge point has increased from roughly 10 to roughly 300 FLOPs/byte. More operations fall on the memory-bound side of the roofline. Every generation of hardware makes memory-aware algorithm design more important, not less.
Connections to Model Serving
The architecture constraints directly determine serving performance:
- KV-cache memory: autoregressive generation stores key/value tensors for all previous tokens. For a model with layers, hidden dimension , and context length , the KV-cache uses values (in the serving precision). At , , , and FP16, this is ~20 GB per sequence. Batching multiple sequences multiplies this.
- Batch size and throughput: larger batches increase arithmetic intensity, moving matrix multiplies toward the compute-bound regime. This is why serving systems like vLLM batch aggressively.
- Latency vs. throughput: single-user latency is dominated by sequential token generation (each token requires a full forward pass). Throughput is maximized by batching, which increases latency per request. The tradeoff is governed by the roofline model.
Common Confusions
FLOPs do not determine wall-clock time
Two operations with the same FLOP count can have vastly different run times. An elementwise operation over elements (~ FLOPs) may take longer than a matrix multiply with FLOPs because the elementwise operation is memory-bound while the matrix multiply is compute-bound and uses Tensor Cores. Always consider arithmetic intensity, not just FLOP count.
Parameter count does not equal memory cost
A 7B-parameter model in FP16 uses 14 GB for parameters, but training requires optimizer states (Adam: 2x FP32 copies for and ), gradients, and activations. The total training memory is typically 4-6x the parameter memory. Inference memory is smaller but includes KV-cache, which grows with context length and batch size.
Tensor Cores require specific dimension alignment
Tensor Cores operate on fixed tile sizes (e.g., 16x16x16 for FP16). If matrix dimensions are not multiples of these tile sizes, padding is required, wasting compute. Hidden dimensions of 768, 1024, 2048, 4096, 8192 are chosen partly because they divide evenly by Tensor Core tile sizes and warp sizes.
Key Takeaways
- The roofline model: performance = min(peak compute, bandwidth x arithmetic intensity)
- Most elementwise operations in transformers are memory-bound, not compute-bound
- Kernel fusion reduces memory round-trips and is the primary optimization lever for memory-bound operations
- Quantization reduces memory footprint and bandwidth requirements, directly speeding up memory-bound operations
- The memory wall is getting worse: the ridge point increases with each hardware generation
- Autoregressive inference at batch size 1 is almost entirely memory-bandwidth limited
- KV-cache memory dominates serving cost for long-context models
Exercises
Problem
An A100 GPU has 312 TFLOPS FP16 compute and 2 TB/s HBM bandwidth. Compute the ridge point of the roofline model. Is a matrix multiply with and (in FP16) compute-bound or memory-bound?
Problem
You apply LayerNorm to a tensor of shape in FP16. Estimate the arithmetic intensity. Is this operation compute-bound or memory-bound on an A100?
Problem
A 70B-parameter model is served in FP16 for autoregressive generation at batch size 1 on an H100 (3.35 TB/s HBM bandwidth). Each token generation requires a full forward pass through all parameters. Estimate the minimum time per token (ignoring KV-cache reads and activation computation). How does this change at batch size 32?
References
Canonical:
- Hennessy and Patterson, Computer Architecture: A Quantitative Approach (6th ed., 2019), Chapters 2-4
- Williams, Waterman, Patterson, "Roofline: An Insightful Visual Performance Model for Multicore Architectures" (2009)
GPU Architecture:
- NVIDIA, CUDA C++ Programming Guide, Chapters on Memory Hierarchy and Execution Model
- Jia et al., "Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking" (2018)
ML-Specific:
- Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (2022), Section 2
- Ivanov et al., "Data Movement Is All You Need: A Case Study on Optimizing Transformers" (2021)
- Dettmers et al., "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale" (2022)
Next Topics
- FlashAttention: the canonical example of architecture-aware algorithm design for transformers
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Floating-Point ArithmeticLayer 0A