Computer Architecture for ML

Sneiderman, Robby

Algorithms Foundations

Computer Architecture for ML

Memory movement often matters more than FLOPs. Memory hierarchy, bandwidth vs. compute, roofline model, GPU architecture, kernel fusion, and why parameter count alone does not tell the computational story.

CoreTier 2CurrentReference~40 min

Prerequisites

Floating Point Arithmetic

Prereq Map

Learning position

Read this page in the graph.

algorithms-foundations | layer 2 | tier 2. This page has 1 direct prerequisite and 2 published dependents.

Open Atlas Prerequisites Leads to

What next

Flash Attention

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

The gap between peak FLOPs and actual training throughput can be 10x or more. The bottleneck is usually memory bandwidth, not arithmetic. A model with fewer FLOPs can be slower than one with more FLOPs if the smaller model makes more memory round-trips. Understanding the hardware stack (memory hierarchy, bandwidth limits, GPU execution model) lets you predict which operations are slow, why certain optimizations work, and where to look when training is not saturating the hardware.

FlashAttention, kernel fusion, quantization, and KV-cache management are all motivated by the same principle: minimize data movement through the memory hierarchy. All of this builds on floating-point arithmetic fundamentals.

The Memory Hierarchy

Every computation involves moving data from storage to processing units. Closer memory is faster but smaller:

Level	Capacity (typical)	Bandwidth	Latency
Registers	~few KB per SM	tens of TB/s	< 1 ns
L1 cache / shared memory	128-228 KB per SM	~20 TB/s	~5 ns
L2 cache	40-50 MB total	~5 TB/s	~50 ns
HBM (global memory)	40-80 GB	1-3 TB/s	~200 ns
CPU DRAM	256 GB-2 TB	100-400 GB/s	~100 ns
NVMe SSD	TB-scale	5-15 GB/s	~10 us

For an A100 GPU: HBM bandwidth is 2 TB/s, and peak FP16 compute is 312 TFLOPS. For an H100 with HBM3: bandwidth is 3.35 TB/s, and peak FP16 compute is 989 TFLOPS. The ratio of compute to bandwidth keeps increasing, making memory bandwidth the tighter constraint for more operations.

Arithmetic Intensity and the Roofline Model

Definition

Arithmetic Intensity $I = FLOPs / bytes$

The arithmetic intensity of an operation is the ratio of floating-point operations performed to bytes of memory transferred:

$I = \frac{\text{FLOPs}}{\text{bytes moved}}$

An operation with low arithmetic intensity (e.g., elementwise addition: 1 FLOP per 12 bytes for FP32 read-read-write) is memory-bound. An operation with high arithmetic intensity (e.g., large matrix multiply) is compute-bound.

Definition

Roofline Model

The roofline model predicts the attainable performance of an operation as:

$\text{Performance} = \min(\text{peak compute}, \; \text{bandwidth} \times I)$

where $I$ is the arithmetic intensity. The attainable performance is a function of $I$ that increases linearly with $I$ until it hits the compute ceiling. The crossover point $I^* = \text{peak compute} / \text{bandwidth}$ is the ridge point. Operations with $I < I^*$ are memory-bound; operations with $I > I^*$ are compute-bound.

Proposition

Roofline Performance Bound

Statement

The maximum attainable throughput $T$ (in FLOPs/s) of an operation with arithmetic intensity $I$ on hardware with peak compute $P$ and peak bandwidth $B$ satisfies:

$T \leq \min(P, \; B \cdot I)$

The operation is memory-bound when $I < P/B$ and compute-bound when $I > P/B$ .

Intuition

You cannot compute faster than the hardware allows ( $P$ ), and you cannot compute faster than you can feed data to the compute units ( $B \cdot I$ ). The tighter constraint wins. For most elementwise operations and small matrix multiplies, memory bandwidth is the bottleneck. For large matrix multiplies (the dominant operation in transformer training), compute is the bottleneck, but only if batch size is large enough.

Proof Sketch

The operation requires $F$ FLOPs and $M$ bytes of memory transfer, so $I = F/M$ . The time is at least $F/P$ (compute limited) and at least $M/B$ (bandwidth limited). Since $M = F/I$ , the bandwidth-limited time is $F/(B \cdot I)$ . The actual time is $\max(F/P, F/(B \cdot I)) = F / \min(P, B \cdot I)$ , giving throughput $F / \text{time} = \min(P, B \cdot I)$ .

Why It Matters

The roofline model explains why the same operation can be fast or slow depending on dimensions. A matrix multiply $C = AB$ with $A \in \mathbb{R}^{m \times k}$ , $B \in \mathbb{R}^{k \times n}$ has $I \approx 2mkn / (2(mk + kn + mn)) \approx \min(m, n, k) / 2$ for large dimensions. For $m = n = k = 4096$ , $I \approx 2048$ FLOPs/byte, well above the ridge point. For $m = 1$ (single-vector inference), $I \approx 1$ FLOP/byte, deeply memory-bound. This is why batched inference is so much faster per token than single-token inference.

Failure Mode

The roofline model assumes perfect parallelism and no cache effects. In practice, cache misses, bank conflicts, warp divergence, and imperfect occupancy reduce actual throughput below the roofline. The model gives an upper bound, not a prediction. It also ignores instruction overhead and assumes the operation can saturate either compute or memory independently.

report a correction →

GPU Architecture

A modern GPU (e.g., NVIDIA A100, H100) consists of:

Streaming Multiprocessors (SMs). The A100 has 108 SMs, the H100 has 132. Each SM contains CUDA cores (for scalar FP32/FP64), Tensor Cores (for matrix multiply-accumulate in FP16/BF16/FP8/INT8), registers (64K 32-bit registers per SM), and shared memory (up to 228KB on H100, configurable with L1 cache).

Warp scheduling. Threads are organized in groups of 32 called warps. All threads in a warp execute the same instruction (SIMT: Single Instruction, Multiple Threads). When one warp stalls on a memory access, the scheduler switches to another ready warp. High occupancy (many warps per SM) hides memory latency.

Memory hierarchy on GPU. Registers are fastest but per-thread. Shared memory is shared within a thread block (on a single SM), with ~20 TB/s bandwidth. Global memory (HBM) is accessible to all SMs but has ~100x higher latency than shared memory.

Tensor Cores. Specialized matrix multiply units that compute $D = A \times B + C$ on small tiles (e.g., 16x16x16 in FP16). On H100, Tensor Cores provide 989 TFLOPS in FP16 vs. ~60 TFLOPS for FP32 CUDA cores. Matrix multiplies that map onto Tensor Core tile sizes are dramatically faster. Dimensions should be multiples of 8 (FP16) or 16 (INT8) for optimal Tensor Core utilization.

Kernel Fusion

A kernel is a function launched on the GPU. Each kernel reads inputs from global memory, computes, and writes outputs to global memory. A sequence of operations like LayerNorm followed by a linear layer followed by GELU activation would naively require 3 kernel launches and 3 round-trips to global memory for intermediate results.

Kernel fusion combines multiple operations into a single kernel. The intermediate results stay in registers or shared memory, avoiding global memory round-trips. The savings can be substantial: if each operation is memory-bound, fusing $k$ operations reduces memory traffic by up to $k$ x.

Examples of fusion in practice:

FlashAttention fuses the attention computation (QK^T, softmax, AV) into a single kernel that tiles the computation to fit in SRAM, reducing HBM reads from $O(N^2)$ to $O(N)$ .
Fused optimizer kernels apply weight update rules (Adam update, weight decay, gradient unscaling) in one pass over parameters.
Fused attention + LayerNorm avoids materializing attention output to HBM before LayerNorm reads it back.

Why Transformers Are Often Memory-Bound

The dominant operation in a transformer forward pass is matrix multiplication (QKV projections, attention, FFN layers). For large batch sizes, these are compute-bound and use Tensor Cores well. But several components are memory-bound:

Softmax: reads and writes $O(N^2)$ values with minimal computation per element.
LayerNorm: elementwise operations with low arithmetic intensity.
Activation functions (GELU, SiLU): elementwise, one FLOP per element.
Residual connections: elementwise addition, one FLOP per three memory operations (read two inputs, write one output).

For autoregressive inference (batch size 1, generating one token at a time), even the matrix multiplies are memory-bound because $I \approx 1$ FLOP/byte. The entire forward pass becomes a memory bandwidth exercise.

Quantization

Definition

Quantization

Quantization represents model parameters or activations in lower precision than FP32, reducing memory footprint and (on supported hardware) increasing compute throughput:

FP16/BF16: 2 bytes per value. Tensor Cores achieve 2x throughput vs FP32. BF16 has the same exponent range as FP32 (8 bits), avoiding overflow issues that plague FP16. See mixed precision training for details on training with reduced precision.
INT8: 1 byte per value. 4x memory reduction vs FP32. Requires calibration to map FP32 ranges to INT8 without excessive quantization error.
FP8 (E4M3, E5M2): H100 Tensor Cores support FP8, achieving up to 2x throughput vs FP16. E4M3 has more mantissa bits (inference); E5M2 has more exponent range (training gradients).
INT4/NF4: 0.5 bytes per value. Used in QLoRA and GPTQ for inference. Significant accuracy loss without careful calibration or mixed-precision approaches.

The memory bandwidth savings from quantization directly translate to faster memory-bound operations. An INT8 model reads half the bytes of an FP16 model, approaching 2x speedup on memory-bound operations.

The Memory Wall

The memory wall refers to the growing gap between processor speed and memory bandwidth. From 2010 to 2024, peak GPU FLOPs grew roughly 60x (Fermi to H100) while HBM bandwidth grew roughly 10x. This means the ridge point $I^* = P/B$ has increased from roughly 10 to roughly 300 FLOPs/byte. More operations fall on the memory-bound side of the roofline. Every generation of hardware makes memory-aware algorithm design more important, not less.

Connections to Model Serving

The architecture constraints directly determine serving performance:

KV-cache memory: autoregressive generation stores key/value tensors for all previous tokens. For a model with $L$ layers, hidden dimension $d$ , and context length $n$ , the KV-cache uses $2 \cdot L \cdot d \cdot n$ values (in the serving precision). At $L = 80$ , $d = 8192$ , $n = 8192$ , and FP16, this is ~20 GB per sequence. Batching multiple sequences multiplies this.
Batch size and throughput: larger batches increase arithmetic intensity, moving matrix multiplies toward the compute-bound regime. This is why serving systems like vLLM batch aggressively.
Latency vs. throughput: single-user latency is dominated by sequential token generation (each token requires a full forward pass). Throughput is maximized by batching, which increases latency per request. The tradeoff is governed by the roofline model.

Common Confusions

Watch Out

FLOPs do not determine wall-clock time

Two operations with the same FLOP count can have vastly different run times. An elementwise operation over $10^9$ elements (~ $10^9$ FLOPs) may take longer than a matrix multiply with $10^{10}$ FLOPs because the elementwise operation is memory-bound while the matrix multiply is compute-bound and uses Tensor Cores. Always consider arithmetic intensity, not just FLOP count.

Watch Out

Parameter count does not equal memory cost

A 7B-parameter model in FP16 uses 14 GB for parameters, but training requires optimizer states (Adam: 2x FP32 copies for $m$ and $v$ ), gradients, and activations. The total training memory is typically 4-6x the parameter memory. Inference memory is smaller but includes KV-cache, which grows with context length and batch size.

Watch Out

Tensor Cores require specific dimension alignment

Tensor Cores operate on fixed tile sizes (e.g., 16x16x16 for FP16). If matrix dimensions are not multiples of these tile sizes, padding is required, wasting compute. Hidden dimensions of 768, 1024, 2048, 4096, 8192 are chosen partly because they divide evenly by Tensor Core tile sizes and warp sizes.

Summary

The roofline model: performance = min(peak compute, bandwidth x arithmetic intensity)
Most elementwise operations in transformers are memory-bound, not compute-bound
Kernel fusion reduces memory round-trips and is the primary optimization lever for memory-bound operations
Quantization reduces memory footprint and bandwidth requirements, directly speeding up memory-bound operations
The memory wall is getting worse: the ridge point increases with each hardware generation
Autoregressive inference at batch size 1 is almost entirely memory-bandwidth limited
KV-cache memory dominates serving cost for long-context models

Exercises

ExerciseCore

Problem

An A100 GPU has 312 TFLOPS FP16 compute and 2 TB/s HBM bandwidth. Compute the ridge point $I^*$ of the roofline model. Is a matrix multiply $C = AB$ with $A \in \mathbb{R}^{4096 \times 4096}$ and $B \in \mathbb{R}^{4096 \times 4096}$ (in FP16) compute-bound or memory-bound?

ExerciseCore

Problem

You apply LayerNorm to a tensor of shape $(B, N, d) = (32, 2048, 4096)$ in FP16. Estimate the arithmetic intensity. Is this operation compute-bound or memory-bound on an A100?

ExerciseAdvanced

Problem

A 70B-parameter model is served in FP16 for autoregressive generation at batch size 1 on an H100 (3.35 TB/s HBM bandwidth). Each token generation requires a full forward pass through all parameters. Estimate the minimum time per token (ignoring KV-cache reads and activation computation). How does this change at batch size 32?

References

Canonical:

Hennessy and Patterson, Computer Architecture: A Quantitative Approach (6th ed., 2019), Chapters 2-4
Williams, Waterman, Patterson, "Roofline: An Insightful Visual Performance Model for Multicore Architectures" (2009)

GPU Architecture:

NVIDIA, CUDA C++ Programming Guide, Chapters on Memory Hierarchy and Execution Model
Jia et al., "Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking" (2018)

ML-Specific:

Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (2022), Section 2
Ivanov et al., "Data Movement Is All You Need: A Case Study on Optimizing Transformers" (2021)
Dettmers et al., "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale" (2022)

Next Topics

FlashAttention: the canonical example of architecture-aware algorithm design for transformers

Last reviewed: April 15, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Floating-Point Arithmeticlayer 0A · tier 1

Derived topics

2

WebGPU for Machine Learninglayer 0B · tier 2
Flash Attentionlayer 5 · tier 2

Graph-backed continuations

Flash Attention WebGPU for Machine Learning