Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

NVIDIA GPU Architectures

A concrete reference for the GPU hardware that determines what ML models can be trained and served: H100, H200, B200, GB200, and Rubin architectures, with emphasis on memory bandwidth as the primary bottleneck.

AdvancedTier 3Current~45 min

Prerequisites

0

Why This Matters

Hardware dictates what is possible. The decision to train a 405B parameter model or a 70B model is not purely algorithmic. It depends on how many GPUs you have, how much memory each one holds, and how fast they can communicate. Understanding GPU architectures lets you reason about training costs, inference latency, and what techniques (quantization, tensor parallelism, offloading) are necessary for a given model at a given scale.

The key insight: for most LLM workloads, memory bandwidth is the bottleneck, not compute. Autoregressive decoding reads the entire model from memory for every single token. A GPU with more FLOPS but the same memory bandwidth does not decode faster.

Architecture Timeline

The relevant NVIDIA data center GPU generations for ML:

  • A100 (Ampere, 2020): 80GB HBM2e, 2 TB/s bandwidth, TF32 tensor cores
  • H100 (Hopper, 2023): 80GB HBM3, 3.35 TB/s bandwidth, transformer engine, FP8
  • H200 (Hopper refresh, 2024): 141GB HBM3e, 4.8 TB/s bandwidth, same compute as H100
  • B200 (Blackwell, 2025): two dies connected, FP4, second-gen transformer engine
  • GB200 NVL72 (Blackwell, 2025): 72 GPUs in a rack with NVLink domain

Key Specifications

Definition

HBM (High Bandwidth Memory)

HBM stacks DRAM dies vertically, connected by through-silicon vias. Each generation increases bandwidth. HBM2e (A100): 2 TB/s. HBM3 (H100): 3.35 TB/s. HBM3e (H200): 4.8 TB/s. Memory bandwidth determines inference throughput for memory-bound operations like attention and large matrix reads.

Definition

Transformer Engine

A hardware unit on Hopper and Blackwell GPUs that dynamically selects between FP8 and FP16 precision per layer during training. It monitors tensor statistics and switches precision to maintain accuracy while doubling throughput compared to FP16 on supported operations.

Definition

NVLink

A high-bandwidth interconnect between GPUs. NVLink on H100 provides 900 GB/s bidirectional bandwidth per GPU. On Blackwell NVL72, a fifth-generation NVLink connects all 72 GPUs in a single domain at 1.8 TB/s per GPU, enabling tensor parallelism across the full rack without PCIe bottlenecks.

Main Theorems

Proposition

Memory Bandwidth Bottleneck for Inference

Statement

For autoregressive LLM inference at batch size BB, the time per token is approximately:

ttokenmax(2PBBW,  2PBFLOPS)t_{\text{token}} \approx \max\left(\frac{2P}{B \cdot \text{BW}},\; \frac{2P}{B \cdot \text{FLOPS}}\right)

where PP is the number of parameters, BW is memory bandwidth in bytes/sec (adjusted for precision), and FLOPS is peak compute. For small BB, the first term dominates. The crossover batch size where compute becomes the bottleneck is B=FLOPS/BWB^* = \text{FLOPS} / \text{BW}, which is the arithmetic intensity ceiling.

Intuition

Each token generation reads the entire model from HBM (approximately 2P2P bytes for FP16). If the GPU can read at BW bytes per second, decoding one token takes at least 2P/BW2P / \text{BW} seconds. More FLOPS do not help unless you increase batch size enough to reuse the weights you already loaded.

Proof Sketch

This follows from the roofline model. The arithmetic intensity of a matrix-vector product WxWx (where WW is m×nm \times n) is 2mn/(2mn+n+m)12mn / (2mn + n + m) \approx 1 FLOP per byte. The GPU's compute-to-bandwidth ratio is typically 100-500 FLOPS/byte. So a single matmul is far below the roofline, making it bandwidth bound.

Why It Matters

This explains why the H200 (same FLOPS as H100, 43% more bandwidth) is meaningfully faster for inference despite having identical compute units. It also explains why quantization to FP8 or INT4 helps inference: it reduces the bytes per parameter, directly increasing effective bandwidth.

Failure Mode

At large batch sizes, compute becomes the bottleneck. For training (large batch matmuls), the workload is compute-bound, not memory-bound. The analysis also changes for MoE models where only a fraction of parameters are active per token.

Architecture Details

H100 (Hopper)

  • 80 GB HBM3, 3.35 TB/s bandwidth
  • 989 TFLOPS FP16 tensor core, 1978 TFLOPS FP8
  • Transformer engine: dynamic FP8/FP16 switching
  • 900 GB/s NVLink (4th gen, 18 links)
  • The baseline GPU for 2023-2024 frontier training

H200

  • Same Hopper architecture and compute as H100
  • 141 GB HBM3e, 4.8 TB/s bandwidth
  • 76% more memory, 43% more bandwidth
  • Directly improves inference latency for bandwidth-bound workloads
  • Enables serving larger models without tensor parallelism

B200 (Blackwell)

  • Two dies connected by a 10 TB/s chip-to-chip link
  • 192 GB HBM3e, 8 TB/s bandwidth
  • FP4 tensor cores: double the throughput of FP8
  • Second-generation transformer engine
  • Approximately 2.5x the H100 inference throughput per GPU

GB200 NVL72

  • 72 Blackwell GPUs in a single rack
  • NVLink connects all 72 into one domain (1.8 TB/s per GPU)
  • 13.5 TB aggregate HBM across the rack
  • Designed for training and serving trillion-parameter models
  • Eliminates the need for InfiniBand within the rack

Rubin (Next Generation)

  • Announced architecture following Blackwell
  • HBM4 memory expected
  • Details are preliminary as of early 2026

Common Confusions

Watch Out

More FLOPS does not always mean faster inference

H200 has the same FLOPS as H100 but is faster for LLM inference because it has more memory bandwidth. For batch-size-1 autoregressive decoding, the GPU spends most of its time reading weights from HBM. Additional compute units sit idle. Only at large batch sizes does compute become the bottleneck.

Watch Out

FP8 and FP4 are not just about saving memory

Lower precision halves (FP8) or quarters (FP4) the bytes read per parameter. Since inference is bandwidth-bound, this directly translates to proportionally faster decoding. The memory savings are a bonus. The primary benefit is throughput.

Summary

  • Memory bandwidth, not FLOPS, determines LLM inference speed at low batch sizes
  • H100: 80GB, 3.35 TB/s. H200: 141GB, 4.8 TB/s. B200: 192GB, 8 TB/s
  • The transformer engine switches between precisions dynamically per layer
  • NVLink bandwidth determines how efficiently you can do tensor parallelism
  • GB200 NVL72 puts 72 GPUs in a single NVLink domain for trillion-parameter models

Exercises

ExerciseCore

Problem

A 70B parameter model in FP16 (2 bytes per parameter) is served on a single H100 (3.35 TB/s bandwidth). Estimate the minimum time per token for batch-size-1 autoregressive decoding.

ExerciseAdvanced

Problem

Compare the time per token for serving the same 70B FP16 model on H100 vs H200 vs B200 (assuming the model fits in memory in all cases). At what batch size does the H100 become compute-bound?

References

Canonical:

  • NVIDIA H100 Tensor Core GPU Architecture Whitepaper (2022)
  • NVIDIA Blackwell Architecture Technical Brief (2024)

Current:

  • NVIDIA GTC 2025 Keynote, Blackwell and Rubin announcements

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics