Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

GPU Compute Model

How GPUs execute ML workloads: streaming multiprocessors, warps, memory hierarchy (registers, SRAM, L2, HBM), arithmetic intensity, the roofline model, and why most ML operations are memory-bound.

AdvancedTier 2Current~50 min

Why This Matters

You cannot optimize what you do not understand. Most ML practitioners treat the GPU as a black box that runs matrix multiplies. This works until you need to understand why Flash Attention is fast, why small batch sizes underutilize hardware, or why fusing two operations together can yield a 2-3x speedup.

The key insight: modern GPUs have far more compute throughput than memory bandwidth. Most ML operations spend more time moving data than computing on it. Understanding this asymmetry is the foundation for every systems-level optimization in ML.

Mental Model

A GPU is a massively parallel processor optimized for throughput, not latency. It has thousands of small cores organized into groups, with a multi-level memory hierarchy. Fast memory is small; large memory is slow. The programmer's job is to keep data in fast memory as long as possible and minimize traffic to slow memory.

Execution Model

Definition

Streaming Multiprocessor

The basic compute unit on an NVIDIA GPU. Each SM contains multiple CUDA cores (execution units for arithmetic), a register file, shared memory (SRAM), and warp schedulers. An A100 has 108 SMs; an H100 has 132 SMs.

Definition

Warp

A group of 32 threads that execute the same instruction simultaneously (SIMT: single instruction, multiple threads). The warp is the fundamental scheduling unit. All 32 threads in a warp execute in lockstep. Divergent branches cause serialization within the warp.

Definition

Thread Block

A group of threads (up to 1024) assigned to a single SM. Threads within a block can communicate via shared memory and synchronize with barriers. Threads in different blocks cannot directly communicate.

Memory Hierarchy

From fastest to slowest:

LevelSize (A100)BandwidthLatency
Registers256 KB per SM~19 TB/s1 cycle
Shared Memory (SRAM)164 KB per SM~19 TB/s~20 cycles
L2 Cache40 MB~5 TB/s~200 cycles
HBM (Global Memory)80 GB2 TB/s~400 cycles

The ratio of HBM capacity to SRAM capacity is roughly 80,000/184,400×80{,}000 / 18 \approx 4{,}400\times (total SRAM across all SMs is about 18 MB on A100). The bandwidth ratio between registers and HBM is roughly 10×10\times. These ratios are the reason IO-aware algorithms exist.

Arithmetic Intensity and the Roofline Model

Definition

Arithmetic Intensity

The number of floating-point operations performed per byte of data transferred between HBM and the compute units. This is the single most important metric for predicting whether an operation is compute-bound or memory-bound.

Proposition

Roofline Performance Bound

Statement

The achievable throughput of any operation is bounded by:

Throughputmin(P,  IB) FLOP/s\text{Throughput} \leq \min(P, \; I \cdot B) \text{ FLOP/s}

The crossover point I=P/BI^* = P / B is the ridge point. Operations with I<II < I^* are memory-bound; operations with I>II > I^* are compute-bound.

Intuition

If arithmetic intensity is low, the compute units finish their work before the next batch of data arrives from HBM. The GPU sits idle waiting for memory. If arithmetic intensity is high, the memory system delivers data fast enough to keep the compute units busy, and performance hits the compute ceiling.

Proof Sketch

In time TT, the memory system transfers at most BTB \cdot T bytes, which supports at most IBTI \cdot B \cdot T FLOPs. The compute units perform at most PTP \cdot T FLOPs. Actual FLOPs min(PT,IBT)\leq \min(P \cdot T, I \cdot B \cdot T). Dividing by TT gives the throughput bound.

Why It Matters

For an A100 with P=312P = 312 TFLOP/s (FP16) and B=2B = 2 TB/s, the ridge point is I=156I^* = 156 FLOP/byte. Element-wise operations (ReLU, layer norm, softmax) have I1I \approx 1-44 FLOP/byte: deeply memory-bound. Only large matrix multiplies (I128I \approx 128-256256) approach compute-bound territory.

Failure Mode

The roofline model assumes perfect overlap of compute and memory access, no cache effects, and no kernel launch overhead. Real performance is always below the roofline. The model identifies the bottleneck but does not predict exact throughput.

Why Batch Size Matters

A single matrix-vector multiply WxWx where WRm×nW \in \mathbb{R}^{m \times n} performs 2mn2mn FLOPs and loads mn+nmn + n elements (the matrix and vector). Arithmetic intensity: approximately 22 FLOP/element, or about 11 FLOP/byte in FP16. This is deeply memory-bound.

A batched matrix multiply WXWX where XRn×BX \in \mathbb{R}^{n \times B} performs 2mnB2mnB FLOPs and loads mn+nBmn + nB elements. As BB grows, the matrix WW is loaded once and reused across all BB vectors. Arithmetic intensity approaches 2B2B FLOP/element. With batch size B=128B = 128, the operation becomes compute-bound.

This is why training (large batches) achieves much higher GPU utilization than inference (often batch size 1).

Kernel Launch Overhead

Every GPU operation is submitted as a kernel: a function that runs on the GPU. Each kernel launch incurs overhead:

  • CPU-side dispatch: 5-20 microseconds
  • GPU scheduling: a few microseconds
  • Memory allocation and argument setup

For large matrix multiplies taking milliseconds, this overhead is negligible. For small element-wise operations taking microseconds, the launch overhead can dominate. This is why kernel fusion (combining multiple operations into one kernel) matters: you pay the launch cost once instead of multiple times.

Common Confusions

Watch Out

More CUDA cores does not always mean faster

If an operation is memory-bound (most ML operations are), adding more compute cores does not help. The bottleneck is memory bandwidth. An A100 and an H100 have similar HBM bandwidth (~2-3 TB/s), so memory-bound operations run at similar speeds despite the H100 having far more compute.

Watch Out

GPU memory size and memory bandwidth are different things

An 80 GB GPU does not read 80 GB per second. Memory size determines what fits; memory bandwidth determines how fast you can read/write. The A100 has 80 GB of HBM at 2 TB/s bandwidth. These are independent specifications that constrain different aspects of performance.

Watch Out

SRAM is not L2 cache

Shared memory (SRAM) is explicitly managed by the programmer and local to each SM. L2 cache is hardware-managed and shared across all SMs. They occupy different levels of the hierarchy with different trade-offs. Flash Attention uses shared memory, not L2 cache, because it needs explicit control over what data resides in fast memory.

Summary

  • GPUs have far more compute than memory bandwidth; most ML operations are memory-bound
  • The roofline model: throughput min(P,IB)\leq \min(P, I \cdot B) with ridge point I=P/BI^* = P/B
  • Element-wise operations are deeply memory-bound (I1I \approx 1-44); large matmuls approach compute-bound (I>100I > 100)
  • Batch size amortizes the cost of loading weight matrices, converting memory-bound operations to compute-bound
  • Kernel launch overhead matters for small operations; fusion eliminates it

Exercises

ExerciseCore

Problem

An A100 GPU has peak FP16 throughput of 312 TFLOP/s and HBM bandwidth of 2 TB/s. A softmax operation on a vector of length NN performs approximately 5N5N FLOPs (exponentiation, sum, division) and reads/writes 2N2N FP16 elements (4N4N bytes). What is the arithmetic intensity, and is this operation compute-bound or memory-bound?

ExerciseAdvanced

Problem

You are performing inference with a model that has weight matrices of size 4096×40964096 \times 4096 in FP16. At batch size 1, what fraction of peak A100 FP16 compute do you expect to achieve? At what batch size does the operation cross the ridge point?

References

Canonical:

  • NVIDIA CUDA Programming Guide, Chapters 2-5 (execution model and memory hierarchy)
  • Williams, Waterman, Patterson, Roofline: An Insightful Visual Performance Model (2009)

Current:

  • Dao et al., FlashAttention (2022), Section 2 (GPU memory hierarchy background)
  • NVIDIA, A100 Tensor Core GPU Architecture Whitepaper (2020)

Next Topics

  • Flash Attention: the most important application of IO-aware algorithm design on GPUs
  • Fused kernels: combining operations to reduce kernel launch overhead and memory traffic

Last reviewed: April 2026

Builds on This

Next Topics