Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Quantization Theory

Reduce model weight precision from FP32 to FP16, INT8, or INT4. Post-training quantization, quantization-aware training, GPTQ, AWQ, and GGUF. Quantization is how large language models actually get deployed.

AdvancedTier 3Current~55 min
0

Why This Matters

A 70-billion parameter model in FP32 requires 280 GB of memory. That does not fit on any single GPU. In FP16 it is 140 GB. still too large for most setups. In INT4 it is 35 GB. now it fits on a single 40 GB A100 or even two consumer GPUs.

Quantization is not an optional optimization. It is the primary mechanism by which large models become deployable. Almost every LLM you have interacted with in production is running quantized. Understanding quantization means understanding the gap between what researchers train and what users actually run.

Mental Model

Quantization replaces high-precision floating point numbers with lower-precision integers. A weight stored as a 32-bit float has about 7 decimal digits of precision. An 8-bit integer has 256 possible values. A 4-bit integer has only 16 possible values. The question is: can you map a continuous range of weights to these few discrete values without destroying model quality?

The answer is mostly yes, with careful engineering. Weights in neural networks are remarkably robust to precision reduction because the function computed by the network depends on collective behavior of many weights, not on any single weight being exact.

Formal Setup and Notation

Definition

Uniform Quantization

Uniform quantization maps a real-valued weight ww to a bb-bit integer:

Q(w)=clamp(round(ws)+z,0,2b1)Q(w) = \text{clamp}\left(\text{round}\left(\frac{w}{s}\right) + z, \, 0, \, 2^b - 1\right)

where ss is the scale (step size) and zz is the zero point (integer offset). The dequantized value is w^=s(Q(w)z)\hat{w} = s(Q(w) - z). The quantization error is ww^s/2|w - \hat{w}| \leq s/2.

Definition

Symmetric vs Asymmetric Quantization

Symmetric quantization sets z=0z = 0 and maps [α,α][-\alpha, \alpha] to [2b1,2b11][-2^{b-1}, 2^{b-1}-1] with s=α/2b1s = \alpha / 2^{b-1}. It is simpler and faster but wastes a bit if the weight distribution is asymmetric.

Asymmetric quantization allows z0z \neq 0 and maps [wmin,wmax][w_{\min}, w_{\max}] to [0,2b1][0, 2^b - 1] with s=(wmaxwmin)/(2b1)s = (w_{\max} - w_{\min}) / (2^b - 1). It uses the full integer range but requires storing the zero point.

Definition

Per-Tensor vs Per-Channel Quantization

Per-tensor quantization uses a single scale ss and zero point zz for an entire weight matrix. Simple but coarse.

Per-channel quantization uses separate sc,zcs_c, z_c for each output channel (row of the weight matrix). This handles the common case where different channels have very different magnitude ranges.

Main Theorems

Proposition

Quantization Error Bound

Statement

For uniform bb-bit quantization of a weight matrix WRm×nW \in \mathbb{R}^{m \times n} with range [α,α][-\alpha, \alpha] and round-to-nearest:

WW^Fα2bmn\|W - \hat{W}\|_F \leq \frac{\alpha}{2^b} \sqrt{mn}

The per-element error is bounded by wijw^ijα/2b|w_{ij} - \hat{w}_{ij}| \leq \alpha / 2^b. The output error for input xx satisfies WxW^xαmn2bx\|Wx - \hat{W}x\| \leq \frac{\alpha \sqrt{mn}}{2^b} \|x\|.

Intuition

Each weight incurs at most half a step size of error. The total error scales with the number of weights and the range. Reducing from FP32 to INT8 (256 levels) means the per-weight error is roughly α/256\alpha/256, which is small enough that outputs barely change. Reducing to INT4 (16 levels) gives α/16\alpha/16, which requires more careful handling.

Proof Sketch

Round-to-nearest has error at most s/2=α/2bs/2 = \alpha/2^b per element. Frobenius norm squares: WW^F2mn(α/2b)2\|W - \hat{W}\|_F^2 \leq mn \cdot (\alpha/2^b)^2. Take square root. For the output bound, apply Cauchy-Schwarz.

Why It Matters

This bound explains the precision hierarchy: FP32 to FP16 loses almost nothing (error 104\sim 10^{-4}). FP16 to INT8 is usually safe (error 102\sim 10^{-2}). INT8 to INT4 is where quality visibly degrades unless you use smarter methods than round-to-nearest.

Failure Mode

The bound assumes uniform weight distribution. Real neural network weights are not uniform: they often have outlier channels with magnitudes 10-100x larger than typical channels. Uniform quantization wastes most of its integer range on the empty space around outliers. This is why per-channel quantization and outlier-aware methods are necessary.

Proposition

Outlier Channels Dominate Quantization Error

Statement

In transformer models, a small fraction of channels (often fewer than 1%) contain activation outliers with magnitudes 10-100x larger than typical channels. Under per-tensor quantization, these outlier channels force the scale ss to be large, reducing effective precision for all other channels. The quantization error is dominated by the non-outlier channels receiving fewer effective bits.

Intuition

If one channel has weights in [100,100][-100, 100] and the rest are in [1,1][-1, 1], per-tensor INT8 quantization uses scale s=200/2550.78s = 200/255 \approx 0.78. The typical weights get quantized to just 2-3 distinct values instead of using the full 256 levels. The outlier channel is fine, but everything else is destroyed.

Proof Sketch

Let αmax\alpha_{\max} be the outlier range and αtyp\alpha_{\text{typ}} be the typical range. Per-tensor scale is s=2αmax/(2b1)s = 2\alpha_{\max}/(2^b - 1). The effective number of bits for typical channels is log2(2αtyp/s)=b+log2(αtyp/αmax)\log_2(2\alpha_{\text{typ}}/s) = b + \log_2(\alpha_{\text{typ}}/\alpha_{\max}). When αmax/αtyp=100\alpha_{\max}/\alpha_{\text{typ}} = 100, effective bits drop by log2(100)6.6\log_2(100) \approx 6.6, leaving fewer than 2 effective bits for INT8.

Why It Matters

This observation motivates every modern quantization method: LLM.int8() handles outliers in FP16, GPTQ uses second-order information to compensate, and AWQ protects salient channels. Understanding the outlier problem is essential for understanding why naive quantization fails for LLMs.

Failure Mode

The outlier pattern varies across layers and models. A fixed strategy (e.g., always quantizing the top 1% in FP16) may not be optimal. Calibration data is needed to identify which channels are critical.

Quantization Methods

Definition

Post-Training Quantization (PTQ)

PTQ quantizes a pretrained model without any retraining. Steps:

  1. Choose quantization scheme (per-tensor/per-channel, symmetric/asymmetric)
  2. Run calibration data through the model to determine optimal scales
  3. Quantize weights (and optionally activations)

Advantages: fast, no training required. Disadvantage: quality degrades at low bit-widths (INT4) without compensation.

Definition

Quantization-Aware Training (QAT)

QAT simulates quantization during training. The forward pass uses quantized weights (via straight-through estimator for gradients). The model learns to be robust to quantization noise during training.

Advantages: consistently better quality than PTQ at low bit-widths. Disadvantage: requires full retraining, which is prohibitively expensive for large models.

Definition

GPTQ

GPTQ (Frantar et al., 2022) quantizes weights one column at a time, using second-order (Hessian) information to optimally adjust remaining weights to compensate for quantization error. Based on the OBQ (Optimal Brain Quantization) framework. Key: the Hessian H=2XXH = 2X^\top X comes from calibration data, making the compensation data-dependent.

Definition

AWQ (Activation-Aware Weight Quantization)

AWQ (Lin et al., 2023) observes that a small fraction of weight channels are disproportionately important (those multiplied by large activations). It scales these salient channels up before quantization and scales them back during inference, effectively giving important channels more quantization precision.

Definition

GGUF Format

GGUF (GPT-Generated Unified Format) is a file format for storing quantized models, used by llama.cpp and related inference engines. It supports mixed-precision quantization (different layers at different bit-widths), metadata, and multiple quantization types (Q4_0, Q4_K_M, Q5_K_S, etc.). The K-quant variants use importance-based mixed precision within each layer.

Canonical Examples

Example

INT8 quantization of a 7B model

A 7B parameter model in FP16 requires 14 GB. In INT8 it requires 7 GB. With per-channel symmetric quantization and round-to-nearest, the perplexity increase on typical benchmarks is less than 0.5%. The memory savings are 2x with negligible quality loss. This is the easy case.

Example

INT4 quantization requires compensation

The same 7B model in INT4 (3.5 GB) with naive round-to-nearest shows 5-10% perplexity degradation. GPTQ reduces this to about 1% by using Hessian-based weight adjustment. AWQ achieves similar quality by protecting salient channels. The lesson: below INT8, naive quantization is not enough.

Common Confusions

Watch Out

Quantization is not the same as pruning

Quantization reduces the precision of every weight. Pruning removes weights entirely (sets them to zero). They are complementary: you can quantize a pruned model. Quantization preserves the model architecture while pruning changes the effective architecture.

Watch Out

Lower bits does not always mean faster inference

INT4 operations are not natively supported on all hardware. On GPUs without INT4 tensor cores, INT4 weights are dequantized to FP16 on the fly. The benefit is memory savings (fitting the model on fewer GPUs), not necessarily faster computation per token.

Summary

  • Quantization reduces precision: FP32 -> FP16 -> INT8 -> INT4
  • Per-element error bounded by half the step size
  • Outlier channels dominate error under per-tensor quantization
  • PTQ is fast but degrades at low bits; QAT is better but expensive
  • GPTQ uses Hessian information to compensate for quantization error
  • AWQ protects salient channels via activation-aware scaling
  • GGUF is the standard format for deploying quantized LLMs locally

Exercises

ExerciseCore

Problem

A weight matrix has values in [2,2][-2, 2]. You apply symmetric INT8 quantization. What is the step size ss? What is the maximum per-element quantization error? If the matrix is 4096×40964096 \times 4096, what is the bound on WW^F\|W - \hat{W}\|_F?

ExerciseAdvanced

Problem

Explain why GPTQ quantizes weights column by column rather than all at once. What role does the Hessian H=XXH = X^\top X play in compensating for quantization error?

Related Comparisons

References

Canonical:

  • Jacob et al., "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference" (2018)
  • Dettmers et al., "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale" (2022)

Current:

  • Frantar et al., "GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers" (2022)
  • Lin et al., "AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration" (2023)
  • Tseng et al., "QTIP: Quantization with Trellises and Incoherence Processing" (NeurIPS 2024 Spotlight). Uses trellis coded quantization to overcome the dimensional limitations of vector quantization. Achieves state-of-the-art quality and speed by separating codebook size from bitrate via a stateful decoder.

Next Topics

Natural extensions from quantization:

  • Mixture of experts: another approach to reducing inference compute
  • [KV cache optimization](/topics/kv-cache-optimization): quantizing the key-value cache for memory savings

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.