LLM Construction
NVIDIA GPU Architectures
A concrete reference for the GPU hardware that determines what ML models can be trained and served: H100, H200, B200, GB200, and Rubin architectures, with emphasis on memory bandwidth as the primary bottleneck.
Prerequisites
Why This Matters
Hardware dictates what is possible. The decision to train a 405B parameter model or a 70B model is not purely algorithmic. It depends on how many GPUs you have, how much memory each one holds, and how fast they can communicate. Understanding GPU architectures lets you reason about training costs, inference latency, and what techniques (quantization, tensor parallelism, offloading) are necessary for a given model at a given scale.
The key insight: for most LLM workloads, memory bandwidth is the bottleneck, not compute. Autoregressive decoding reads the entire model from memory for every single token. A GPU with more FLOPS but the same memory bandwidth does not decode faster.
Architecture Timeline
The relevant NVIDIA data center GPU generations for ML:
- A100 (Ampere, 2020): 80GB HBM2e, 2 TB/s bandwidth, TF32 tensor cores
- H100 (Hopper, 2023): 80GB HBM3, 3.35 TB/s bandwidth, transformer engine, FP8
- H200 (Hopper refresh, 2024): 141GB HBM3e, 4.8 TB/s bandwidth, same compute as H100
- B200 (Blackwell, 2025): two dies connected, FP4, second-gen transformer engine
- GB200 NVL72 (Blackwell, 2025): 72 GPUs in a rack with NVLink domain
Key Specifications
HBM (High Bandwidth Memory)
HBM stacks DRAM dies vertically, connected by through-silicon vias. Each generation increases bandwidth. HBM2e (A100): 2 TB/s. HBM3 (H100): 3.35 TB/s. HBM3e (H200): 4.8 TB/s. Memory bandwidth determines inference throughput for memory-bound operations like attention and large matrix reads.
Transformer Engine
A hardware unit on Hopper and Blackwell GPUs that dynamically selects between FP8 and FP16 precision per layer during training. It monitors tensor statistics and switches precision to maintain accuracy while doubling throughput compared to FP16 on supported operations.
NVLink
A high-bandwidth interconnect between GPUs. NVLink on H100 provides 900 GB/s bidirectional bandwidth per GPU. On Blackwell NVL72, a fifth-generation NVLink connects all 72 GPUs in a single domain at 1.8 TB/s per GPU, enabling tensor parallelism across the full rack without PCIe bottlenecks.
Main Theorems
Memory Bandwidth Bottleneck for Inference
Statement
For autoregressive LLM inference at batch size , the time per token is approximately:
where is the number of parameters, BW is memory bandwidth in bytes/sec (adjusted for precision), and FLOPS is peak compute. For small , the first term dominates. The crossover batch size where compute becomes the bottleneck is , which is the arithmetic intensity ceiling.
Intuition
Each token generation reads the entire model from HBM (approximately bytes for FP16). If the GPU can read at BW bytes per second, decoding one token takes at least seconds. More FLOPS do not help unless you increase batch size enough to reuse the weights you already loaded.
Proof Sketch
This follows from the roofline model. The arithmetic intensity of a matrix-vector product (where is ) is FLOP per byte. The GPU's compute-to-bandwidth ratio is typically 100-500 FLOPS/byte. So a single matmul is far below the roofline, making it bandwidth bound.
Why It Matters
This explains why the H200 (same FLOPS as H100, 43% more bandwidth) is meaningfully faster for inference despite having identical compute units. It also explains why quantization to FP8 or INT4 helps inference: it reduces the bytes per parameter, directly increasing effective bandwidth.
Failure Mode
At large batch sizes, compute becomes the bottleneck. For training (large batch matmuls), the workload is compute-bound, not memory-bound. The analysis also changes for MoE models where only a fraction of parameters are active per token.
Architecture Details
H100 (Hopper)
- 80 GB HBM3, 3.35 TB/s bandwidth
- 989 TFLOPS FP16 tensor core, 1978 TFLOPS FP8
- Transformer engine: dynamic FP8/FP16 switching
- 900 GB/s NVLink (4th gen, 18 links)
- The baseline GPU for 2023-2024 frontier training
H200
- Same Hopper architecture and compute as H100
- 141 GB HBM3e, 4.8 TB/s bandwidth
- 76% more memory, 43% more bandwidth
- Directly improves inference latency for bandwidth-bound workloads
- Enables serving larger models without tensor parallelism
B200 (Blackwell)
- Two dies connected by a 10 TB/s chip-to-chip link
- 192 GB HBM3e, 8 TB/s bandwidth
- FP4 tensor cores: double the throughput of FP8
- Second-generation transformer engine
- Approximately 2.5x the H100 inference throughput per GPU
GB200 NVL72
- 72 Blackwell GPUs in a single rack
- NVLink connects all 72 into one domain (1.8 TB/s per GPU)
- 13.5 TB aggregate HBM across the rack
- Designed for training and serving trillion-parameter models
- Eliminates the need for InfiniBand within the rack
Rubin (Next Generation)
- Announced architecture following Blackwell
- HBM4 memory expected
- Details are preliminary as of early 2026
Common Confusions
More FLOPS does not always mean faster inference
H200 has the same FLOPS as H100 but is faster for LLM inference because it has more memory bandwidth. For batch-size-1 autoregressive decoding, the GPU spends most of its time reading weights from HBM. Additional compute units sit idle. Only at large batch sizes does compute become the bottleneck.
FP8 and FP4 are not just about saving memory
Lower precision halves (FP8) or quarters (FP4) the bytes read per parameter. Since inference is bandwidth-bound, this directly translates to proportionally faster decoding. The memory savings are a bonus. The primary benefit is throughput.
Summary
- Memory bandwidth, not FLOPS, determines LLM inference speed at low batch sizes
- H100: 80GB, 3.35 TB/s. H200: 141GB, 4.8 TB/s. B200: 192GB, 8 TB/s
- The transformer engine switches between precisions dynamically per layer
- NVLink bandwidth determines how efficiently you can do tensor parallelism
- GB200 NVL72 puts 72 GPUs in a single NVLink domain for trillion-parameter models
Exercises
Problem
A 70B parameter model in FP16 (2 bytes per parameter) is served on a single H100 (3.35 TB/s bandwidth). Estimate the minimum time per token for batch-size-1 autoregressive decoding.
Problem
Compare the time per token for serving the same 70B FP16 model on H100 vs H200 vs B200 (assuming the model fits in memory in all cases). At what batch size does the H100 become compute-bound?
References
Canonical:
- NVIDIA H100 Tensor Core GPU Architecture Whitepaper (2022)
- NVIDIA Blackwell Architecture Technical Brief (2024)
Current:
- NVIDIA GTC 2025 Keynote, Blackwell and Rubin announcements
Next Topics
- Flash Attention: algorithmic optimization that reduces HBM reads
- Fused kernels: combining operations to minimize memory round-trips
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- GPU Compute ModelLayer 5