LLM Construction
GPU Compute Model
How GPUs execute ML workloads: streaming multiprocessors, warps, memory hierarchy (registers, SRAM, L2, HBM), arithmetic intensity, the roofline model, and why most ML operations are memory-bound.
Why This Matters
You cannot optimize what you do not understand. Most ML practitioners treat the GPU as a black box that runs matrix multiplies. This works until you need to understand why Flash Attention is fast, why small batch sizes underutilize hardware, or why fusing two operations together can yield a 2-3x speedup.
The key insight: modern GPUs have far more compute throughput than memory bandwidth. Most ML operations spend more time moving data than computing on it. Understanding this asymmetry is the foundation for every systems-level optimization in ML.
Mental Model
A GPU is a massively parallel processor optimized for throughput, not latency. It has thousands of small cores organized into groups, with a multi-level memory hierarchy. Fast memory is small; large memory is slow. The programmer's job is to keep data in fast memory as long as possible and minimize traffic to slow memory.
Execution Model
Streaming Multiprocessor
The basic compute unit on an NVIDIA GPU. Each SM contains multiple CUDA cores (execution units for arithmetic), a register file, shared memory (SRAM), and warp schedulers. An A100 has 108 SMs; an H100 has 132 SMs.
Warp
A group of 32 threads that execute the same instruction simultaneously (SIMT: single instruction, multiple threads). The warp is the fundamental scheduling unit. All 32 threads in a warp execute in lockstep. Divergent branches cause serialization within the warp.
Thread Block
A group of threads (up to 1024) assigned to a single SM. Threads within a block can communicate via shared memory and synchronize with barriers. Threads in different blocks cannot directly communicate.
Memory Hierarchy
From fastest to slowest:
| Level | Size (A100) | Bandwidth | Latency |
|---|---|---|---|
| Registers | 256 KB per SM | ~19 TB/s | 1 cycle |
| Shared Memory (SRAM) | 164 KB per SM | ~19 TB/s | ~20 cycles |
| L2 Cache | 40 MB | ~5 TB/s | ~200 cycles |
| HBM (Global Memory) | 80 GB | 2 TB/s | ~400 cycles |
The ratio of HBM capacity to SRAM capacity is roughly (total SRAM across all SMs is about 18 MB on A100). The bandwidth ratio between registers and HBM is roughly . These ratios are the reason IO-aware algorithms exist.
Arithmetic Intensity and the Roofline Model
Arithmetic Intensity
The number of floating-point operations performed per byte of data transferred between HBM and the compute units. This is the single most important metric for predicting whether an operation is compute-bound or memory-bound.
Roofline Performance Bound
Statement
The achievable throughput of any operation is bounded by:
The crossover point is the ridge point. Operations with are memory-bound; operations with are compute-bound.
Intuition
If arithmetic intensity is low, the compute units finish their work before the next batch of data arrives from HBM. The GPU sits idle waiting for memory. If arithmetic intensity is high, the memory system delivers data fast enough to keep the compute units busy, and performance hits the compute ceiling.
Proof Sketch
In time , the memory system transfers at most bytes, which supports at most FLOPs. The compute units perform at most FLOPs. Actual FLOPs . Dividing by gives the throughput bound.
Why It Matters
For an A100 with TFLOP/s (FP16) and TB/s, the ridge point is FLOP/byte. Element-wise operations (ReLU, layer norm, softmax) have - FLOP/byte: deeply memory-bound. Only large matrix multiplies (-) approach compute-bound territory.
Failure Mode
The roofline model assumes perfect overlap of compute and memory access, no cache effects, and no kernel launch overhead. Real performance is always below the roofline. The model identifies the bottleneck but does not predict exact throughput.
Why Batch Size Matters
A single matrix-vector multiply where performs FLOPs and loads elements (the matrix and vector). Arithmetic intensity: approximately FLOP/element, or about FLOP/byte in FP16. This is deeply memory-bound.
A batched matrix multiply where performs FLOPs and loads elements. As grows, the matrix is loaded once and reused across all vectors. Arithmetic intensity approaches FLOP/element. With batch size , the operation becomes compute-bound.
This is why training (large batches) achieves much higher GPU utilization than inference (often batch size 1).
Kernel Launch Overhead
Every GPU operation is submitted as a kernel: a function that runs on the GPU. Each kernel launch incurs overhead:
- CPU-side dispatch: 5-20 microseconds
- GPU scheduling: a few microseconds
- Memory allocation and argument setup
For large matrix multiplies taking milliseconds, this overhead is negligible. For small element-wise operations taking microseconds, the launch overhead can dominate. This is why kernel fusion (combining multiple operations into one kernel) matters: you pay the launch cost once instead of multiple times.
Common Confusions
More CUDA cores does not always mean faster
If an operation is memory-bound (most ML operations are), adding more compute cores does not help. The bottleneck is memory bandwidth. An A100 and an H100 have similar HBM bandwidth (~2-3 TB/s), so memory-bound operations run at similar speeds despite the H100 having far more compute.
GPU memory size and memory bandwidth are different things
An 80 GB GPU does not read 80 GB per second. Memory size determines what fits; memory bandwidth determines how fast you can read/write. The A100 has 80 GB of HBM at 2 TB/s bandwidth. These are independent specifications that constrain different aspects of performance.
SRAM is not L2 cache
Shared memory (SRAM) is explicitly managed by the programmer and local to each SM. L2 cache is hardware-managed and shared across all SMs. They occupy different levels of the hierarchy with different trade-offs. Flash Attention uses shared memory, not L2 cache, because it needs explicit control over what data resides in fast memory.
Summary
- GPUs have far more compute than memory bandwidth; most ML operations are memory-bound
- The roofline model: throughput with ridge point
- Element-wise operations are deeply memory-bound (-); large matmuls approach compute-bound ()
- Batch size amortizes the cost of loading weight matrices, converting memory-bound operations to compute-bound
- Kernel launch overhead matters for small operations; fusion eliminates it
Exercises
Problem
An A100 GPU has peak FP16 throughput of 312 TFLOP/s and HBM bandwidth of 2 TB/s. A softmax operation on a vector of length performs approximately FLOPs (exponentiation, sum, division) and reads/writes FP16 elements ( bytes). What is the arithmetic intensity, and is this operation compute-bound or memory-bound?
Problem
You are performing inference with a model that has weight matrices of size in FP16. At batch size 1, what fraction of peak A100 FP16 compute do you expect to achieve? At what batch size does the operation cross the ridge point?
References
Canonical:
- NVIDIA CUDA Programming Guide, Chapters 2-5 (execution model and memory hierarchy)
- Williams, Waterman, Patterson, Roofline: An Insightful Visual Performance Model (2009)
Current:
- Dao et al., FlashAttention (2022), Section 2 (GPU memory hierarchy background)
- NVIDIA, A100 Tensor Core GPU Architecture Whitepaper (2020)
Next Topics
- Flash Attention: the most important application of IO-aware algorithm design on GPUs
- Fused kernels: combining operations to reduce kernel launch overhead and memory traffic
Last reviewed: April 2026
Builds on This
- AMD Competition LandscapeLayer 5
- Fused KernelsLayer 5
- NVIDIA GPU ArchitecturesLayer 5