Why This Matters
A GPU with 100 TFLOP/s of peak FP16 throughput and 1 TB/s of HBM bandwidth has a ridge point of 100 FLOP per byte. A kernel doing 1 FLOP per byte cannot exceed 1 TFLOP/s on that machine, even if every arithmetic unit is idle most of the time.
Many ML kernels move more bytes than they compute on. Elementwise activations, layer normalization, KV-cache reads during decoding, and unfused attention steps often hit the bandwidth ceiling. Dense matrix multiplication sits on the other side only when tile reuse is high enough that each loaded element feeds many fused multiply-adds.
Core Definitions
Memory Traffic
Memory traffic is the number of bytes transferred across the limiting memory interface during a kernel. For GPU kernels this usually means HBM traffic, not register or shared-memory traffic. Loads and stores both count; write-allocate and eviction traffic may add extra bytes.
Arithmetic Intensity
The arithmetic intensity of a kernel is , where is the number of floating-point operations and is memory traffic in bytes. Units are FLOP per byte.
Compute-Bound Kernel
A kernel is compute-bound on a specific machine when its runtime lower bound is set by arithmetic throughput. In roofline terms, .
Memory-Bound Kernel
A kernel is memory-bound on a specific machine when its runtime lower bound is set by memory bandwidth. In roofline terms, .
Ridge Point
The ridge point is . Kernels below this intensity are bandwidth-limited by the basic roofline model. Kernels above it are throughput-limited.
Bytes Before FLOPs
The roofline model starts with two independent lower bounds on runtime. If a kernel performs FLOPs and moves bytes, then
Combining them gives a performance cap
where .
Use a concrete machine model with TFLOP/s and TB/s. Its ridge point is
A kernel at 0.25 FLOP/byte is capped at 0.25 TFLOP/s. A kernel at 20 FLOP/byte is capped at 20 TFLOP/s. A kernel at 200 FLOP/byte is capped at 100 TFLOP/s.
On a log-log roofline plot, the x-axis is arithmetic intensity and the y-axis is attained FLOP/s. The bandwidth line has slope 1 because multiplying intensity by bandwidth doubles performance when intensity doubles. The compute ceiling is horizontal. The knee between them is the ridge point.
This is a bound, not a performance prediction with cycle accuracy. Bad memory coalescing, branch divergence, unaligned vector loads, bank conflicts, low occupancy, or instruction mix can put a kernel far below the roofline. The model still tells you which resource you must stop wasting first.
SAXPY as a Bandwidth Kernel
SAXPY computes over arrays of floats.
void saxpy(int n, float a, const float *x, float *y) {
for (int i = 0; i < n; ++i) {
y[i] = a * x[i] + y[i];
}
}
For FP32, one element needs these transfers under a simple streaming model.
| item | bytes | direction |
|---|---|---|
x[i] | 4 | load |
y[i] old value | 4 | load |
y[i] new value | 4 | store |
The operation count is 2 FLOPs if a multiply and add are counted separately. The traffic is 12 bytes, so
Some older BLAS discussions count the fused multiply-add as one operation and ignore the store, giving about 1 operation per 8 input bytes. Both conventions point to the same place: far left of the ridge.
At cache-line granularity, a 64-byte line holds 16 FP32 values. For 16 SAXPY elements, the kernel loads one line of x, loads one line of y, and stores one line of y.
x line: 16 floats = 64 B
y load: 16 floats = 64 B
y store: 16 floats = 64 B
work: 16 * 2 FLOPs = 32 FLOPs
AI = 32 / 192 = 0.167 FLOP/B
If the store triggers a write-allocate read on a write-back cache, traffic can rise to 256 bytes per 16 elements, dropping intensity to 0.125 FLOP/byte. Streaming stores or GPU global stores change that detail, but not the conclusion.
A fused elementwise chain raises intensity by avoiding intermediate stores. Suppose a model applies z = gelu(a*x + b) and then w = z + r. If implemented as three kernels, intermediate arrays are written to HBM and read back. A fused kernel can keep temporaries in registers.
__global__ void fused(float *w, const float *x, const float *r,
float a, float b, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
float t = a * x[i] + b;
float g = 0.5f * t * (1.0f + tanhf(0.79788456f * (t + 0.044715f*t*t*t)));
w[i] = g + r[i];
}
}
The fused version still loads x and r and stores w, but it avoids writing and rereading the intermediate t or g arrays. That saves 8 or more bytes per element for FP32 temporaries.
GEMM and Reuse
Square matrix multiplication with matrices performs about FLOPs. If each input and output element crossed HBM once, the traffic for FP16 would be
So the ideal arithmetic intensity is
For , this is about 341 FLOP/byte. On the 100 TFLOP/s, 1 TB/s machine, that sits to the right of the ridge point and can be compute-bound.
The word "ideal" does real work. Naive GEMM does not get this intensity because it reloads A[i,k] and B[k,j] for many output elements.
for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j) {
float sum = 0.0f;
for (int k = 0; k < N; ++k)
sum += A[i*N + k] * B[k*N + j];
C[i*N + j] = sum;
}
Without blocking, the inner loop streams through one row of A and one column of B for each output element. The same data is fetched again for nearby outputs. Tiled GEMM loads blocks of A and B into shared memory or cache, then reuses them for many multiply-adds.
A tile sketch for CUDA is below. It omits edge checks and tensor cores so the reuse pattern stays visible.
__global__ void tiled_gemm(const float *A, const float *B, float *C, int N) {
__shared__ float As[16][16];
__shared__ float Bs[16][16];
int row = blockIdx.y * 16 + threadIdx.y;
int col = blockIdx.x * 16 + threadIdx.x;
float acc = 0.0f;
for (int t = 0; t < N; t += 16) {
As[threadIdx.y][threadIdx.x] = A[row*N + (t + threadIdx.x)];
Bs[threadIdx.y][threadIdx.x] = B[(t + threadIdx.y)*N + col];
__syncthreads();
for (int k = 0; k < 16; ++k)
acc += As[threadIdx.y][k] * Bs[k][threadIdx.x];
__syncthreads();
}
C[row*N + col] = acc;
}
Each loaded tile element participates in 16 multiply-adds before the next HBM round trip. Larger tiles, vectorized loads, tensor core fragments, and register blocking increase reuse further. This is why large GEMMs often sit close to peak arithmetic throughput.
Attention on the Roofline
Scaled dot-product attention computes
For one head with sequence length and head dimension , the two matrix products perform about FLOPs. A per-query view gives FLOPs for one output row. During autoregressive decoding, each new token reads the existing K and V cache, so one step has about FLOPs and reads about FP16 elements.
Take and . For one decode token in FP16, reading K and V costs
The dot products and weighted value sum cost about
So the intensity is about 1 FLOP/byte before counting masks, positional terms, and softmax bookkeeping. That is far below a ridge point of 100 FLOP/byte.
For full prefill attention at , the arithmetic count is much larger:
The Q, K, V, and output tensors together are only about 4.19 MB in FP16. But a naive implementation materializes the score matrix and often a probability matrix too. Two FP16 matrices cost
That extra HBM traffic pushes attention toward the bandwidth side, especially as softmax reads and writes the score matrix. The asymptotic byte shape is elements when scores or probabilities are materialized.
FlashAttention changes the traffic pattern. It tiles Q, K, and V so that score blocks live in SRAM and registers, then uses an online softmax to avoid writing the score matrix to HBM.
for Q_block in Q:
m = -inf
l = 0
out = 0
for K_block, V_block in tiles:
scores = Q_block @ K_block.T
m_new = max(m, rowmax(scores))
l = exp(m - m_new) * l + rowsum(exp(scores - m_new))
out = exp(m - m_new) * out + exp(scores - m_new) @ V_block
m = m_new
O_block = out / l
The FLOP count stays close to standard attention. The HBM traffic falls because the intermediate is not stored and reread. This is a kernel-fusion and tiling fix, not a change to the mathematical attention function.
The Model
The roofline model is the pair of ceilings below.
Equivalently,
The ridge point separates the two regimes.
Operationally, the model gives four invariants for ML systems work.
- If , reducing bytes is worth more than reducing FLOPs.
- If , increasing arithmetic throughput or tensor-core occupancy matters more than reducing bytes.
- Increasing batch size or tile size can raise intensity by reusing loaded weights and activations.
- Fusing elementwise kernels raises intensity by removing HBM round trips for intermediate tensors.
The model is machine-specific. The same kernel can be memory-bound on a GPU with high FLOP/s per byte and compute-bound on a CPU with lower arithmetic throughput per byte.
Common Confusions
Counting tensor size instead of traffic
A tensor of 100 MB is not the same as 100 MB of traffic. If a kernel writes the tensor, another kernel reads it, and a third kernel overwrites it, HBM traffic can be 300 MB or more. Roofline arithmetic intensity uses traffic across the limiting memory interface.
Calling GEMM compute-bound without checking N
Small GEMMs may be launch-limited or memory-bound. For FP16 ideal square GEMM, FLOP/byte. With a ridge point of 100 FLOP/byte, gives about 42.7 FLOP/byte and sits below the ridge.
Treating attention as one roofline point
Prefill attention and decode attention have different traffic. Prefill can reuse Q, K, and V through matrix products, but naive score materialization adds bytes. Decode reads a large KV cache for one new token and often has intensity near 1 FLOP/byte.
Exercises
Problem
A SAXPY kernel processes 100 million FP32 elements. Count 2 FLOPs per element and 12 bytes per element. On a machine with 20 TFLOP/s peak and 800 GB/s bandwidth, compute the roofline runtime lower bounds and decide the limiting resource.
Problem
For an ideal FP16 square GEMM, use FLOP/byte. A GPU has 120 TFLOP/s peak and 1.5 TB/s bandwidth. Find the ridge point. Then classify and .
Problem
A decoder attention step uses one head with , , and FP16 K and V cache. Estimate bytes read, FLOPs, and arithmetic intensity for one new token using the approximation FLOPs and FP16 elements read.
References
Canonical:
- Williams, Waterman, and Patterson, Roofline: An Insightful Visual Performance Model for Multicore Architectures (2009), §§1-4 (the original roofline model, ridge point, and bandwidth ceiling)
- NVIDIA, CUDA C++ Programming Guide (current online edition), ch. 5 and ch. 8 (CUDA execution model, memory hierarchy, shared memory, and global memory behavior)
- Hennessy and Patterson, Computer Architecture: A Quantitative Approach (6th ed., 2017), §3.3 and ch. 4 (memory hierarchy and vector/SIMD performance limits)
- Vaswani et al., Attention Is All You Need (2017), §3.2 (scaled dot-product attention and multi-head attention)
- Kwon et al., Efficient Memory Management for Large Language Model Serving with PagedAttention (2023), §§2-3 (KV-cache memory pressure during LLM serving)
Accessible:
- Dao et al., FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (2022), §§2-3 (tiling attention to reduce HBM traffic)
- Horace He, Making Deep Learning Go Brrrr From First Principles (2022) (practical roofline-style thinking for ML kernels)
- Olah et al., A Mathematical Framework for Transformer Circuits (2021), introductory sections (attention patterns and transformer computation structure)
Next Topics
/computationpath/gpu-memory-hierarchy/computationpath/kernel-fusion-and-operator-overheads/computationpath/flashattention-and-io-aware-attention/computationpath/kv-cache-and-llm-serving