Tensor Cores and Mixed Precision Training

Why This Matters

An NVIDIA H100 SXM5 GPU delivers roughly 2,000 TFLOPS at FP8 through its tensor cores, compared with about 67 TFLOPS from its FP32 CUDA cores. That is a 30x ratio on the same chip. The tensor cores are physically separate matrix-multiply-accumulate (MMA) units; a single warp-level instruction computes a complete tile, not one multiply at a time. Without understanding this architecture, profiling output from ncu looks like noise and you cannot reason about why your training job is or isn't saturating hardware.

Mixed precision is the training-time answer to the question: if FP16 and BF16 reduce memory bandwidth and increase tensor core throughput, why do models diverge when naively cast to low precision? The answer is that gradients span a dynamic range that low-precision formats cannot represent without intervention. Loss scaling and FP32 master weights are the two mechanisms that close this gap. Getting them wrong causes silent divergence, not a crash.

Core Definitions

Definition

Tensor Core

A dedicated hardware unit inside an SM (streaming multiprocessor) that computes the matrix fused-multiply-add $D = A \cdot B + C$ over a fixed tile in a single instruction. The operands $A$ and $B$ are in a low-precision format (FP16, BF16, TF32, FP8, or INT8); the accumulator $C$ and result $D$ are typically FP32. One warp-level MMA instruction on Ampere computes a 16×8 tile in a single clock cycle across 32 threads cooperatively.

Definition

Mixed-Precision Training

A training protocol in which forward and backward passes operate on low-precision copies of weights (FP16 or BF16), while a full-precision (FP32) master copy of weights and the optimizer state are stored separately. After each backward pass, the FP32 master weights are updated; fresh low-precision copies are cast from them before the next forward pass. Micikevicius et al. (ICLR 2018) formalized this workflow.

Definition

Loss Scaling

A technique to prevent FP16 gradient underflow. Before the backward pass, multiply the scalar loss by a large constant $S$ (e.g., $S = 2^{15}$ ). This shifts the gradient distribution right by $\log_2 S$ bits. After the backward pass, divide accumulated gradients by $S$ before applying them to the FP32 master weights. If gradients overflow (producing inf or nan), skip the weight update and halve $S$ .

Precision Formats on Tensor Cores

FP16 vs BF16

Both formats occupy 16 bits, but the bit allocation differs:

Format	Sign	Exponent	Mantissa	Max value	Notes
FP32	1	8	23	~3.4 × 10³⁸	IEEE 754
FP16	1	5	10	65504	IEEE 754
BF16	1	8	7	~3.4 × 10³⁸	Google Brain float

BF16 truncates the FP32 mantissa to 7 bits while keeping all 8 exponent bits. This means BF16 has the same dynamic range as FP32 but only 3 decimal digits of precision. FP16 has 10-bit mantissa precision but its 5-bit exponent caps values at 65504; gradients frequently exceed this during training, which is exactly why loss scaling exists.

In practice: use BF16 for training on Ampere or later (A100, H100). You can drop loss scaling entirely because the exponent range matches FP32. Use FP16 when targeting Volta/Turing hardware or when ONNX export compatibility matters.

TF32

TF32 is a 19-bit internal format used only inside Ampere tensor core MMA instructions. It takes FP32 inputs, internally rounds the mantissa to 10 bits (matching FP16 precision) while keeping the 8-bit FP32 exponent, then accumulates into FP32. From the programmer's perspective, inputs and outputs remain FP32; the precision reduction is invisible unless you inspect numerical residuals. The CUDA driver enables TF32 tensor cores automatically for cublasSgemm on Ampere when CUDA_TF32_OVERRIDE=0 is not set.

TF32 TFLOPS on an H100 is approximately 500 with sparsity, versus BF16 at roughly 990 and FP8 at roughly 2000 (dense, non-sparse numbers from the H100 whitepaper, §3.1).

FP8: E4M3 and E5M2

Hopper (H100) added FP8 tensor core support with two variants:

E4M3: 4-bit exponent, 3-bit mantissa. Max value 448. Used for weights and activations in forward pass, where values are bounded.
E5M2: 5-bit exponent, 2-bit mantissa. Max value 57344. Used for gradients in backward pass, where you need more range and can tolerate less precision.

The split exists because weights cluster near zero with bounded magnitude, while gradients can be large and sparse. Using E5M2 for gradients avoids overflow without loss scaling. Using E4M3 for activations preserves more precision where the model is sensitive.

A complete H100 FP8 matmul uses cublasLtMatmul with CUDA_R_8F_E4M3 input types and CUDA_R_32F accumulator type. The throughput ratio: FP8 ≈ 2× BF16 ≈ 4× TF32 (H100 SXM5 dense).

Tile Geometry Across Generations

A tensor core instruction operates on fixed-size tiles of registers, not individual elements. The CUDA PTX instruction wmma::mma_sync exposes these:

Architecture	PTX family	A tile	B tile	C tile	Supported types
Volta (V100)	`wmma`	16×16	16×16	16×16	FP16 in, FP32 acc
Turing (T4)	`wmma`	16×16	16×16	16×16	FP16, INT8, INT4
Ampere (A100)	`mma`	16×8	8×16	16×16	FP16, BF16, TF32, INT8
Hopper (H100)	`wgmma`	64×8	8×16	64×16	FP8, FP16, BF16, TF32

Hopper's wgmma (warpgroup MMA) operates at the warpgroup level (128 threads = 4 warps), not per-warp. This reduces scheduling overhead and feeds the tensor cores more continuously. The raw instruction-level difference between Volta and Hopper is not 4× clock speed; it is that Hopper computes a tile 16× larger per instruction (64×16 vs 16×16 elements) with a narrower format.

A minimal CUDA C++ wmma fragment example for Ampere BF16:

#include <mma.h>
using namespace nvcuda::wmma;

// Each warp computes a 16×16 output tile.
// Shapes: A is 16×16 BF16, B is 16×16 BF16, C/D are 16×16 FP32.
fragment<matrix_a, 16, 16, 16, __nv_bfloat16, row_major> a_frag;
fragment<matrix_b, 16, 16, 16, __nv_bfloat16, col_major> b_frag;
fragment<accumulator, 16, 16, 16, float> c_frag;

fill_fragment(c_frag, 0.0f);           // zero accumulator
load_matrix_sync(a_frag, a_ptr, 16);  // load from shared memory, stride=16
load_matrix_sync(b_frag, b_ptr, 16);
mma_sync(c_frag, a_frag, b_frag, c_frag); // D = A*B + C in one instruction
store_matrix_sync(d_ptr, c_frag, 16, mem_row_major);

The 32 threads in the warp each hold a disjoint subset of the tile registers. The mma_sync call coordinates across all 32 threads; calling it without synchronizing all threads in the warp is undefined behavior.

Mixed-Precision Training in Practice

The training loop under mixed precision looks like this:

# PyTorch AMP pseudo-code (mirrors torch.cuda.amp internals)
scaler = GradScaler(init_scale=2**15)

for batch in dataloader:
    optimizer.zero_grad()
    with autocast(device_type='cuda', dtype=torch.bfloat16):
        # Forward pass: weights cast to BF16 on the fly.
        # Activations stored in BF16.
        loss = model(batch)

    # Backward pass runs in BF16/FP32 mixed mode.
    # With BF16, GradScaler is a no-op (scale=1).
    scaler.scale(loss).backward()

    # Unscale gradients back to true magnitude, check for inf/nan.
    scaler.step(optimizer)  # skips update if gradients have inf
    scaler.update()         # adjusts scale factor
    # optimizer.step() updates FP32 master weights internally.

With BF16 on Ampere+, the GradScaler can be omitted or left at scale=1.0 because the exponent range matches FP32. With FP16, the scaler is necessary. The FP32 master weights and Adam optimizer states ( $m_t$ , $v_t$ ) stay in full precision throughout; only the forward-backward path uses low precision.

Memory budget for one FP32 parameter with Adam:

Master weight: 4 bytes (FP32)
FP16 working copy: 2 bytes
Adam $m_t$ : 4 bytes
Adam $v_t$ : 4 bytes
Total: 14 bytes per parameter (vs 16 bytes for pure FP32 training with Adam)

The saving is modest because the optimizer state dominates. Mixed precision's primary benefit is throughput from tensor cores and reduced activation memory, not optimizer-state reduction.

Key Result: Roofline Bound and Tensor Core Ceiling

Proposition

Tensor Core Arithmetic Intensity Threshold

Statement

For a GEMM of shape $N \times N \times N$ in BF16, the arithmetic intensity is $I = N / 2$ FLOP/byte. The roofline ridge point occurs at $I^* = 990 \times 10^{12} / (3.35 \times 10^{12}) \approx 295$ FLOP/byte, giving $N^* \approx 590$ . For $N \geq 590$ , the operation is compute-bound and runs at the tensor core ceiling; for $N < 590$ , it is memory-bandwidth-bound.

Intuition

Each BF16 element is 2 bytes. An $N \times N$ matmul reads $3N^2$ elements total (two inputs, one output). It performs $2N^3$ FLOPs. So intensity = $2N^3 / (3N^2 \times 2) = N/3$ FLOP/byte for square matmul — here using the simplified ratio $N/2$ for a rough bound. Small batch sizes (e.g., batch=1 inference) drop $N$ far below 590, making tensor cores irrelevant; bandwidth is the bottleneck.

Proof Sketch

Williams et al. (CACM 2009) define the roofline ceiling as $\min(P_{\text{peak}}, I \cdot B)$ where $P_{\text{peak}}$ is peak compute, $I$ is arithmetic intensity, and $B$ is peak memory bandwidth. Setting $P_{\text{peak}} = I^* \cdot B$ gives $I^* = P_{\text{peak}} / B$ . For H100 BF16: $I^* = 990 \times 10^{12} / 3.35 \times 10^{12} \approx 295$ .

Why It Matters

This explains why serving a language model at batch size 1 does not saturate tensor cores regardless of precision. Increasing batch size or using continuous batching (as in vLLM) is necessary to reach the compute-bound regime where FP8 beats BF16.

Failure Mode

Assuming FP8 is always 2× faster than BF16 in production. At small batch sizes (batch ≤ 16 for typical transformer layers), both are bandwidth-bound and deliver similar wall-clock throughput. The 2× ratio applies only when the operation is compute-bound.

report a correction →

Common Confusions

Watch Out

BF16 'same range as FP32' means same accuracy as FP32

BF16 matches FP32 exponent range, so it cannot overflow where FP32 would not. But its 7-bit mantissa means relative precision is $2^{-7} \approx 0.78\%$ per number, versus $2^{-23} \approx 0.000012\%$ for FP32. Summing many BF16 values accumulates rounding error proportional to the count. This is why tensor cores accumulate into FP32 even when inputs are BF16: the accumulator runs in full precision, and only the inputs are narrow.

Watch Out

torch.autocast casts all operations to low precision

autocast maintains an allowlist of operations that benefit from low precision (GEMM, convolution, attention) and a denylist of operations that require FP32 (softmax, layer norm, loss computation, reductions). Operations outside the allowlist run in FP32 or the input's dtype. The full allowlist is documented in the PyTorch AMP docs; assuming all ops are cast leads to unexpected precision loss in normalization layers.

Exercises

ExerciseCore

Problem

An A100 SXM4 has BF16 tensor core peak performance of 312 TFLOPS and HBM2e bandwidth of 2 TB/s. A transformer attention layer runs a GEMM of shape $[B \times S, d] \times [d, 4d]$ with $B=8$ , $S=512$ , $d=4096$ . Compute the arithmetic intensity and determine whether this GEMM is compute-bound or memory-bandwidth-bound on the A100.

ExerciseAdvanced

Problem

A training run uses FP16 mixed precision with a GradScaler initialized at scale $S = 2^{15}$ . After a particular batch, all gradient norms are normal, but one parameter's gradient computes to $7.2 \times 10^{-6}$ in true FP32 magnitude. (a) Would this gradient underflow to zero without loss scaling? (b) What is the stored FP16 value after scaling by $S$ ? (c) After the scaler calls unscale_(), what FP32 value is recovered, and what is the relative rounding error?

References

Canonical:

Micikevicius et al., "Mixed Precision Training" (ICLR 2018) — the primary reference defining FP32 master weights, loss scaling, and the BF16/FP16 training protocol.
NVIDIA, CUDA C++ Programming Guide (12.x), §7.24 "Warp Matrix Functions (wmma)" ; documents fragment types, tile shapes per architecture, and mma_sync semantics.
Williams, Waterman, Patterson, "Roofline: An Insightful Visual Performance Model for Multicore Architectures," CACM 52(4), 2009 ; defines arithmetic intensity, the roofline ceiling, and the ridge point calculation used above.
NVIDIA, H100 Tensor Core GPU Architecture Whitepaper (2022), §3.1 "Tensor Core" ; specifies FP8 E4M3/E5M2 split, wgmma tile sizes, and peak TFLOPS figures for each precision.
Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention," SOSP 2023 ; explains why small-batch inference is bandwidth-bound and motivates continuous batching to reach tensor core utilization.

Accessible:

Horace He, "Making Deep Learning Go Brrrr From First Principles," Horace's Blog (2022) ; walks through the roofline model and arithmetic intensity for transformer layers with concrete numbers.
Tim Dettmers, "A Gentle Introduction to 8-bit Matrix Multiplication," Tim's Blog (2022) ; covers INT8 and FP8 quantization with worked examples, accessible without reading the CUDA Programming Guide first.

Next Topics

/computationpath/inference-quantization ; INT8, GPTQ, and AWQ: quantizing weights post-training without FP32 master weights.
/computationpath/flash-attention ; memory-efficient attention that fuses softmax and matmul to stay in SRAM, reducing HBM bandwidth demand.
/computationpath/roofline-model ; full treatment of arithmetic intensity analysis, including how to profile with ncu to determine whether your kernel is compute- or bandwidth-bound.
/computationpath/gpu-memory-hierarchy ; HBM, L2 cache, shared memory, and register file: the memory levels that determine whether tensor cores can stay fed.