Why This Matters
A single 4096 by 4096 matrix multiply performs about 137 billion floating-point operations. A desktop CPU near 3 TFLOP/s FP32 takes at least 46 ms at peak arithmetic throughput. An H100-class GPU at 67 TFLOP/s FP32 takes at least 2.1 ms, and TF32 tensor cores raise the advertised dense-matmul peak toward 989 TFLOP/s for compatible operations.
The other gap is bytes per second. DDR4 memory bandwidth is around 50 GB/s on many CPU systems. HBM2e and HBM3 GPUs deliver about 2 to 3 TB/s. Transformer training and batched inference repeatedly stream activations, weights, KV-cache pages, and attention scores. That traffic, not only FLOPs, is why the CPU became the host and the GPU became the main arithmetic device.
Core Definitions
Latency-Optimized Core
A CPU core designed to reduce the time for one instruction stream. It spends transistors on branch prediction, out-of-order execution, speculative execution, large private caches, and low-latency access paths. The target workload has branches, pointer chasing, system calls, and small working sets.
Throughput-Optimized Core
A GPU execution lane or small core designed to complete many independent operations per second across many threads. It accepts higher latency for one thread and hides that latency by switching among many resident threads while memory requests are in flight.
SIMD
Single instruction, multiple data. One CPU instruction, such as _mm256_fmadd_ps, applies the same operation to multiple lanes inside one architectural thread. AVX2 processes eight FP32 lanes in a 256-bit register.
SIMT
Single instruction, multiple threads. A GPU warp runs one instruction stream across a group of threads, commonly 32 threads on NVIDIA GPUs. Each thread has its own registers and program state, but the warp issues one instruction for active lanes.
Host and Device
The host is the CPU process that allocates GPU work, launches kernels, and handles I/O. The device is the GPU and its memory. Data crossing PCIe or another CPU-GPU interconnect pays transfer latency and consumes interconnect bandwidth.
CPU Execution Low Latency First
A modern CPU core is optimized for the hard case where the next instruction depends on the previous few instructions. It predicts branches, issues independent micro-operations out of order, and keeps recent data in a deep cache hierarchy. A typical layout has private L1 data cache, private or shared L2, and shared L3. The point is low latency for a single thread, not the largest count of FP32 multipliers.
Consider a branch-heavy tokenization loop:
// Irregular control flow and data-dependent branches
int count_digits(const unsigned char *s, int n) {
int c = 0;
for (int i = 0; i < n; i++) {
unsigned char x = s[i];
if (x >= '0' && x <= '9') c++;
else if (x == '\n') break;
}
return c;
}
A CPU branch predictor learns the common path, an L1 cache serves the next byte in a few cycles, and one core returns a result with low startup cost. Launching a GPU kernel for ten kilobytes of such work usually loses to CPU overhead.
SIMD extends a CPU thread with wider registers. Eight FP32 values fit in one AVX2 vector:
#include <immintrin.h>
void axpy8(float *y, const float *x, float a) {
__m256 vx = _mm256_loadu_ps(x); // x[0] ... x[7]
__m256 vy = _mm256_loadu_ps(y); // y[0] ... y[7]
__m256 va = _mm256_set1_ps(a);
__m256 out = _mm256_fmadd_ps(va, vx, vy);
_mm256_storeu_ps(y, out);
}
The byte layout for float x[4] = {1, 2, 3, 4} on a little-endian machine is:
address value byte
base+0 1.0 00
base+1 00
base+2 80
base+3 3f
base+4 2.0 00
base+5 00
base+6 00
base+7 40
base+8 3.0 00 00 40 40
base+12 4.0 00 00 80 40
A 128-bit SIMD load reads those 16 bytes into one register. A 256-bit load reads eight floats, 32 bytes. SIMD width is inside one thread; the programmer still has a loop over chunks.
GPU Execution Throughput First
A GPU trades single-thread latency for many resident threads. If one warp stalls on global memory, the scheduler issues another warp. Instead of a few large cores, a GPU has many streaming multiprocessors with registers, shared memory, warp schedulers, and vector-like execution lanes.
A CUDA kernel exposes this structure:
__global__ void axpy(float *y, const float *x, float a, int n) {
int t = blockIdx.x * blockDim.x + threadIdx.x;
if (t < n) {
y[t] = a * x[t] + y[t];
}
}
For blockDim.x = 256, one block contains eight 32-thread warps. Thread t loads x[t] and y[t]. If warp 0 starts at t = 0, its x addresses are base + 4*t for t = 0..31, a 128-byte contiguous span. The hardware coalesces those per-thread loads into memory transactions when alignment and cache-line boundaries permit it.
SIMT differs from SIMD at the programming boundary. SIMD code says one CPU thread operates on vector lanes. SIMT code says many GPU threads operate on their own scalar values, while hardware groups them into warps.
Divergence is the tax. Suppose a warp executes:
if (threadIdx.x & 1) {
y[t] = x[t] + 1.0f;
} else {
y[t] = x[t] - 1.0f;
}
For the first warp, even lanes and odd lanes take different paths. Conceptually the warp runs the subtract instruction with active mask 0x55555555, then the add instruction with active mask 0xaaaaaaaa. Only half the lanes do useful arithmetic on each path. CPUs also suffer from branch mispredicts, but a CPU core does not normally serialize 32 logical threads through both sides of one branch.
The Host Device Split
A CUDA program has two address spaces in the common mental model: CPU memory and GPU memory. The host allocates, transfers, launches, and synchronizes. The device runs kernels.
float *h_x = new float[n];
float *h_y = new float[n];
float *d_x, *d_y;
cudaMalloc(&d_x, n * sizeof(float));
cudaMalloc(&d_y, n * sizeof(float));
cudaMemcpy(d_x, h_x, n * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, h_y, n * sizeof(float), cudaMemcpyHostToDevice);
int block = 256;
int grid = (n + block - 1) / block;
axpy<<<grid, block>>>(d_y, d_x, 2.0f, n);
cudaMemcpy(h_y, d_y, n * sizeof(float), cudaMemcpyDeviceToHost);
For n = 1,000,000, each FP32 array is 4 MB. Copying x and y to the device and y back transfers 12 MB. At 16 GB/s effective PCIe bandwidth, the copies take at least 12 MB / 16 GB/s = 0.75 ms, before launch overhead and kernel time. The kernel performs 2 million FLOPs. At 10 TFLOP/s sustained device arithmetic, arithmetic time is 0.0002 ms. The copies dominate.
This is why ML systems keep weights, activations, optimizer state, and KV caches resident on the device. Host-device copies are paid at batch boundaries, data-loader boundaries, checkpoint boundaries, and serving boundaries. In vLLM, the KV cache layout matters because attention reads previous keys and values for every generated token. Moving those bytes across PCIe per token would erase the GPU arithmetic advantage.
Memory Bandwidth and Transformer Work
Dense neural-network layers are often regular enough for GPUs. The transformer block in Vaswani et al. Uses matrix multiplications for projections, attention score computation, and feed-forward layers. Those operations expose thousands of independent multiply-adds.
The Roofline model separates peak arithmetic rate from memory bandwidth. For arithmetic intensity , attainable performance is bounded by:
Vector addition has low intensity. For z[i] = x[i] + y[i] in FP32, each element reads 8 bytes and writes 4 bytes, so FLOP per byte. With DDR4 at 50 GB/s, the bandwidth roof is about 4.17 GFLOP/s. With HBM at 3 TB/s, the bandwidth roof is 250 GFLOP/s. Both are far below tensor-core dense matmul peaks.
A square FP16 GEMM C = A B with n = 4096 performs FLOPs. Reading A and B and writing C once costs 3 * 4096 * 4096 * 2 = 100,663,296 bytes. The ideal intensity is about 1365 FLOPs per byte. Real kernels move tiles multiple times through cache and shared memory, but the operation still has high reuse. That is the shape GPUs were built to run.
CPU wins remain common. A small MLP with batch size 1 can finish on a CPU before a GPU launch and transfer complete. Tree search, dynamic programming with irregular dependencies, sparse pointer-heavy graph traversal, and request routing often prefer CPU cores. A production inference server often uses both: CPU threads parse requests and schedule batches; GPU kernels run dense linear algebra and attention.
Key Result
The practical invariant is a lower bound on wall time for GPU offload:
The compute and memory terms are bounded by:
For the axpy example with n = 1,000,000, take F = 2,000,000 FLOPs, M = 12,000,000 device bytes, and HBM bandwidth 3 TB/s. Device memory time is at least 0.004 ms. Compute time at 67 TFLOP/s is about 0.00003 ms. PCIe copies at 16 GB/s cost at least 0.75 ms. The offload is transfer-bound.
For the 4096 GEMM, use 137.4 GFLOPs and about 100.7 MB of ideal matrix traffic. On a 67 TFLOP/s FP32 GPU, compute lower bound is 2.05 ms. HBM traffic lower bound at 3 TB/s is 0.034 ms. PCIe transfer of all three matrices at 16 GB/s is about 6.3 ms. If matrices already live on the GPU as part of a transformer layer, PCIe is absent from the inner loop and the GPU wins by arithmetic throughput.
Common Confusions
SIMT is more than just wider SIMD
SIMD exposes vector lanes inside one CPU thread. SIMT exposes many scalar-looking threads and groups them into warps in hardware. The difference matters for control flow. In SIMD, the programmer or compiler packs data into vector registers. In SIMT, a branch inside one warp creates active masks and path serialization.
Peak FLOPs are not end-to-end inference throughput
An H100 TF32 number near 989 TFLOP/s applies to specific tensor-core matrix operations with compatible shapes and data types. Tokenization, sampling, CPU request handling, memory allocation, KV-cache reads, and PCIe copies are outside that number.
GPU memory bandwidth does not remove all memory bottlenecks
HBM at 2 to 3 TB/s is much higher than DDR4 around 50 GB/s, but attention can still be memory-bound during decoding. Batch size, sequence length, KV-cache layout, and paging determine whether the device streams bytes faster than it performs arithmetic.
Exercises
Problem
A CPU system has 50 GB/s memory bandwidth. A GPU has 3 TB/s HBM bandwidth and receives data across PCIe at 16 GB/s. For FP32 vector addition over 10 million elements, compute the array size, total bytes for two reads and one write, and the bandwidth lower bound on CPU, GPU device memory, and PCIe transfer if all arrays move host to device and result returns.
Problem
A warp of 32 threads executes if (threadIdx.x < 8) A(); else B();. Give the active masks for A and B using bit 0 for thread 0. What fraction of lanes are active on each path?
Problem
For a 4096 by 4096 FP16 matrix multiply, compute the ideal FLOPs, ideal bytes for reading A, reading B, and writing C, and arithmetic intensity. Then decide whether the Roofline bound is compute-bound or bandwidth-bound on a GPU with 67 TFLOP/s FP32 peak and 3 TB/s HBM bandwidth.
References
Canonical:
- Hennessy and Patterson, Computer Architecture: A Quantitative Approach (6th ed., 2017), §2.1-§2.4 and §3.3, CPU pipelines, caches, SIMD, and GPU architecture
- NVIDIA, CUDA C++ Programming Guide (2025), §2, §5.4, and §6.2, CUDA programming model, SIMT execution, memory hierarchy
- Williams, Waterman, and Patterson, “Roofline: An Insightful Visual Performance Model for Multicore Architectures” (CACM 2009), §1-§3, arithmetic intensity and bandwidth ceilings
- Vaswani et al., “Attention Is All You Need” (NeurIPS 2017), §3.2 and §5.4, transformer computation dominated by dense attention and feed-forward layers
- Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention” (SOSP 2023), §2-§4, KV-cache memory pressure in LLM serving
- Olah et al., “A Mathematical Framework for Transformer Circuits” (Transformer Circuits, 2021), attention heads and residual stream framing
Accessible:
- Mark Harris, “An Even Easier Introduction to CUDA,” NVIDIA Developer Blog
- Robert van de Geijn and Maggie Myers, Advanced Linear Algebra: Foundations to Frontiers, chapters on high-performance matrix multiplication
- Stanford CS149 lecture notes, “Parallel Computing,” lectures on SIMD, SIMT, and memory bandwidth
Next Topics
/computationpath/simd-and-vectorization/computationpath/cuda-mental-model/computationpath/roofline-model/computationpath/inference-serving-memory