Quantization and Model Compression

Why This Matters

A 70-billion-parameter transformer stored in FP16 needs about 140 GB for weights alone. Weight-only INT4 stores the same 70 billion weights in about 35 GB, plus scales. With group size 128 and one FP16 scale per group, scale overhead is $70\cdot 10^9/128 \cdot 2 \approx 1.09$ GB. Add zero-points and metadata, and a consumer machine with 48 GB of GPU memory can hold weights that would not fit in FP16.

The speed story is not automatic. If a kernel dequantizes INT4 weights into FP16 and then calls an FP16 matrix multiply, the saved memory traffic may dominate only when the operation is bandwidth-bound. If the GPU spends most time in tensor core compute, a compact format without matching hardware instructions can lose.

Core Definitions

Definition

Affine quantization

A real tensor value $x$ is approximated by an integer $q$ through $x \approx s(q-z)$ , where $s$ is a positive scale and $z$ is an integer zero point. Quantization maps $x$ to $q=\operatorname{clip}(\operatorname{round}(x/s)+z, q_{\min}, q_{\max})$ .

Definition

Symmetric quantization

Symmetric quantization fixes $z=0$ and uses signed integer codes, often $[-127,127]$ for INT8 or $[-7,7]$ for signed INT4. It is common for weights because transformer weight distributions are usually centered near zero.

Definition

Asymmetric quantization

Asymmetric quantization learns or computes $z$ so that nonzero-centered ranges such as ReLU activations use the integer code space better. UINT8 activations with $[0,255]$ are the usual example.

Definition

Granularity

Per-tensor quantization uses one scale for a whole tensor. Per-channel quantization uses one scale per output channel, often one per row of a linear layer weight matrix. Per-group quantization divides a row into blocks, such as 128 weights per scale.

Definition

Fake quantization

Fake quantization is the training-time operation $\hat{x}=s(\operatorname{clip}(\operatorname{round}(x/s)+z)-z)$ computed in floating point. The forward pass sees quantization error while the backward pass often uses a straight-through estimator for rounding.

Numeric Formats And Byte Layouts

A dense FP16 matrix with shape $4096 \times 4096$ has $16{,}777{,}216$ elements and uses 32 MiB. INT4 weight-only storage uses 4 bits per weight, so the raw weight array uses 8 MiB. If scales are FP16 and group size is 128, there are $16{,}777{,}216/128=131{,}072$ scales, or 256 KiB. The memory reduction is close to 4x even after scales.

Two INT4 weights are packed into one byte. A common signed layout stores each quantized value as a 4-bit two's-complement nibble.

#include <stdint.h>

static inline uint8_t pack_i4(int8_t lo, int8_t hi) {
    uint8_t a = (uint8_t)(lo & 0x0f);
    uint8_t b = (uint8_t)(hi & 0x0f);
    return (uint8_t)(a | (b << 4));
}

static inline int8_t unpack_i4_low(uint8_t byte) {
    int8_t x = (int8_t)(byte & 0x0f);
    return (x >= 8) ? (int8_t)(x - 16) : x;
}

static inline int8_t unpack_i4_high(uint8_t byte) {
    int8_t x = (int8_t)((byte >> 4) & 0x0f);
    return (x >= 8) ? (int8_t)(x - 16) : x;
}

Example: the pair $(-3,6)$ becomes low nibble 1101 and high nibble 0110, so the byte is 0x6d. With scale $s=0.05$ , dequantized values are $-0.15$ and $0.30$ .

FP8 is different. It is a floating-point format, not an integer plus external scale, although practical systems still use tensor-level or block-level scaling. E4M3 has 1 sign bit, 4 exponent bits, and 3 mantissa bits. E5M2 has 1 sign bit, 5 exponent bits, and 2 mantissa bits. E4M3 gives more precision near one; E5M2 gives wider range. Hopper tensor cores support FP8 matrix operations with accumulation into wider formats.

A byte-level E4M3 sketch for positive values is:

bit 7       bits 6..3        bits 2..0
sign        exponent         fraction
0           eeee             fff

The exact handling of infinities, NaNs, and finite maxima depends on the FP8 variant. In ML kernels, the more visible control knob is often the scaling policy around the FP8 conversion.

Calibration And Scale Selection

Calibration runs representative inputs through the unquantized model and records tensor statistics. The dataset is small compared with training. A few hundred to a few thousand sequences often suffice for PTQ experiments, but the right sample must cover the deployment distribution.

For symmetric per-tensor INT8, a simple max calibration chooses:

S=\frac{\max_i |x_i|}{127}.

If an activation tensor has observed range $[-2.8,3.1]$ , then $s=3.1/127\approx 0.0244$ . The value $x=0.70$ maps to $q=\operatorname{round}(0.70/0.0244)=29$ and dequantizes to $0.7076$ .

Asymmetric UINT8 min-max calibration for range $[a,b]$ uses:

S=\frac{b-a}{255}

Z=\operatorname{round}\left(-\frac{a}{s}\right).

For $[-0.4,5.7]$ , $s=6.1/255\approx 0.0239$ and $z\approx 17$ . The real zero maps to integer 17. This matters for activations because attention and MLP intermediate tensors can be offset from zero after layer norm, residual addition, or nonlinearities.

Per-channel and per-group scales reduce error when one row has outliers. Suppose one row contains weights in $[-0.2,0.2]$ and another in $[-2.0,2.0]$ . Per-tensor INT4 symmetric scaling uses $s=2/7\approx 0.286$ , so most values in the small row round to $-1$ , $0$ , or $1$ . Per-row scaling for the small row uses $s=0.2/7\approx 0.0286$ , preserving more levels.

Post-Training Quantization For LLM Weights

Post-training quantization converts a trained model without full retraining. For current LLM deployment, weight-only PTQ dominates. Weights become INT8 or INT4, while activations and accumulators remain FP16, BF16, or FP32. The linear layer computes roughly:

Y = X \cdot \operatorname{dequant}(Q_W,S_W).

GPTQ and AWQ are PTQ families. GPTQ quantizes weights with a local second-order correction objective, but the operational view is simpler: process weight blocks, choose integer codes, and update the remaining unquantized weights to compensate for the block error. AWQ searches for activation-aware per-channel rescaling, preserving channels that matter more for observed activations. Neither requires the full pretraining corpus.

A minimal weight-only kernel has three phases: load packed integers, dequantize into registers or shared memory, then run multiply-accumulate. Real kernels fuse these phases to avoid writing a full FP16 copy of weights.

// One group of eight signed INT4 weights stored in four bytes.
// Dequantize with one scale and multiply by FP16/FP32 activations.
float dot_i4_group8(const uint8_t *packed, const float *x, float scale) {
    float acc = 0.0f;
    for (int b = 0; b < 4; ++b) {
        int8_t w0 = unpack_i4_low(packed[b]);
        int8_t w1 = unpack_i4_high(packed[b]);
        acc += (scale * (float)w0) * x[2*b + 0];
        acc += (scale * (float)w1) * x[2*b + 1];
    }
    return acc;
}

The code is pedagogical, not a fast GPU kernel. A production kernel tiles the matrix, coalesces packed loads, and arranges dequantization so tensor core instructions receive fragments in the expected layout.

Quantization-Aware Training And FP8

QAT inserts fake-quant nodes during fine-tuning. A linear layer sees $\hat{W}$ and sometimes $\hat{X}$ in the forward pass, while optimizer state remains higher precision. The standard straight-through estimator treats rounding as identity in the backward pass within the unclipped range:

\frac{\partial \hat{x}}{\partial x} \approx 1 \quad \text{if } q_{\min} < \operatorname{round}(x/s)+z < q_{\max}.

QAT costs more than PTQ but often recovers accuracy for activation quantization, small models, and aggressive bit widths. It is also common when the final target is integer arithmetic on fixed hardware rather than a GPU dequantization path.

FP8 weight-plus-activation paths are a separate regime. Hopper GPUs expose tensor core support for FP8 matrix multiplication. Training systems keep master weights, gradients, or optimizer states in FP16, BF16, or FP32, then cast selected tensors to FP8 for matrix operations. In inference, both weights and activations can be FP8 when calibration or runtime scaling keeps values in range.

The invariant is that FP8 is a compute format, while INT4 weight-only is usually a storage format. INT4 saves memory bandwidth for weights but still multiplies in FP16 or similar after dequantization. FP8 can reduce both memory traffic and tensor core input bandwidth when hardware accepts FP8 fragments directly.

Sparsity, Pruning, And Distillation

Structured 2:4 sparsity means that in each group of four weights, exactly two are nonzero. Ampere and Hopper tensor cores have sparse matrix instructions for this pattern. The raw data contains two nonzero values plus metadata selecting their positions.

Example group:

dense weights       [ 0.0, -1.25, 0.0, 0.50 ]
stored values       [ -1.25, 0.50 ]
position metadata   [ 1, 3 ]

The theoretical weight multiply count halves for the sparse operand. The realized gain depends on layout, metadata handling, and whether the layer maps to supported sparse tensor core instructions.

Unstructured pruning sets arbitrary weights to zero. It compresses well only if the runtime has a sparse format and kernels whose overhead is lower than skipped arithmetic. For LLM inference on GPUs, unstructured sparsity is less common than INT4 weight-only quantization because dense GEMM kernels are heavily tuned.

Distillation compresses by training a smaller student model to match a larger teacher. The loss often combines ground-truth cross entropy with a KL term between teacher and student token distributions:

L=(1-\alpha)L_{\text{CE}}+\alpha T^2 \operatorname{KL}(p_T^{\text{teacher}} \| p_T^{\text{student}}).

Distillation changes architecture size, so it is not a drop-in replacement for quantizing a checkpoint. It is a model development path, more than just a checkpoint conversion step.

The Model

The useful mental model has two equations and one boundary.

First, storage per parameter:

\text{bytes/param}=\frac{k}{8}+\frac{\text{scale bytes}+\text{zero bytes}}{\text{group size}}.

For INT4 with group size 128, FP16 scale, and no zero point, this is $0.5+2/128=0.515625$ bytes per parameter. A 70B model needs about $70\cdot 10^9\cdot 0.515625=36.1$ GB for weights and scales.

Second, Roofline throughput is bounded by compute and memory traffic:

\text{performance}\leq \min(\text{peak compute},\ \text{memory bandwidth}\cdot \text{arithmetic intensity}).

Quantization increases arithmetic intensity when it reduces bytes loaded per multiply. But dequantization adds instructions, and small batch decoding can be dominated by memory reads of weights and KV cache. PagedAttention in vLLM attacks KV-cache fragmentation and allocation, which is separate from weight quantization.

Typical perplexity patterns are empirical, model-dependent, and benchmark-dependent. For LLaMA-like decoder models on common PTQ reports, INT8 weight-only is often within measurement noise of FP16. Good 4-bit PTQ is commonly in the range of about +0.1 to +1.0 perplexity on WikiText-style evaluations. Three-bit and naive activation quantization fail more often, especially on outlier-heavy layers and long-tail domains.

Common Confusions

Watch Out

INT4 weights do not mean INT4 arithmetic

Most LLM INT4 deployments store weights in 4-bit form, then dequantize to FP16 or BF16 inside the matmul kernel. The accumulator is not 4-bit. A claim of “4-bit model” usually describes checkpoint storage and weight bandwidth, not every arithmetic operation.

Watch Out

Perplexity changes are not portable across prompts

A small perplexity increase on WikiText does not prove equal quality on code, math, chat formatting, or tool-call JSON. Calibration and evaluation must use sequences that resemble the served workload.

Watch Out

Sparsity and quantization savings do not multiply automatically

A 2:4 sparse INT4 weight matrix has fewer nonzeros and fewer bits per stored value, but metadata, alignment, and kernel availability decide the speed. Storage math alone overpredicts throughput.

Exercises

ExerciseCore

Problem

A linear layer has weight shape $8192 \times 4096$ . Compute storage for FP16, INT8 per-channel with one FP16 scale per output channel, and INT4 per-group with group size 128 and one FP16 scale per group. Use bytes, then MiB.

ExerciseCore

Problem

Quantize the vector $[-0.33,-0.10,0.02,0.41]$ using symmetric signed INT4 with scale chosen by max absolute value. Pack the first two quantized values into one byte with the low nibble holding the first value.

ExerciseAdvanced

Problem

A decoder layer reads 1.2 GB of FP16 weights per generated token at batch size 1. Assume the kernel is memory-bandwidth bound and the GPU sustains 900 GB/s for this access pattern. Estimate the lower bound on time for FP16 weights and for INT4 weights with 3 percent metadata overhead. Ignore KV cache traffic and dequantization instructions.

References

Canonical:

Vaswani et al., Attention Is All You Need (2017), §§3.2, 5.4, transformer layer structure and training cost context
NVIDIA, CUDA C++ Programming Guide (v12.6), §§10.20, 10.24, low-precision arithmetic types and warp matrix operations
Williams, Waterman, and Patterson, Roofline: An Insightful Visual Performance Model for Multicore Architectures (2009), §§2-4, compute and bandwidth ceilings
Kwon et al., Efficient Memory Management for Large Language Model Serving with PagedAttention (2023), §§2-4, serving memory pressure and KV-cache paging
Jacob et al., Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference (CVPR 2018), §§2-3, affine quantization and fake quant training
Frantar et al., GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (ICLR 2023), §§2-4, blockwise PTQ for transformer weights
Lin et al., AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration (MLSys 2024), §§3-4, activation-aware scaling for weight quantization
Micikevicius et al., FP8 Formats for Deep Learning (2022), §§2-4, E4M3 and E5M2 formats for training and inference

Accessible:

Hugging Face, Quantization Methods documentation, practical overview of bitsandbytes, GPTQ, AWQ, and FP8 deployment paths
NVIDIA, Transformer Engine User Guide, FP8 scaling recipes and Hopper-oriented training examples
Olah et al., A Mathematical Framework for Transformer Circuits (2021), attention and residual-stream framing useful for locating quantization-sensitive tensors

Next Topics

/computationpath/inference-serving
/computationpath/gpu-kernels-and-tensor-cores
/computationpath/memory-hierarchy-for-ml
/topics/attention-mechanism