LLM Construction
Quantization Theory
Reduce model weight precision from FP32 to FP16, INT8, or INT4. Post-training quantization, quantization-aware training, GPTQ, AWQ, and GGUF. Quantization is how large language models actually get deployed.
Prerequisites
Why This Matters
A 70-billion parameter model in FP32 requires 280 GB of memory. That does not fit on any single GPU. In FP16 it is 140 GB. still too large for most setups. In INT4 it is 35 GB. now it fits on a single 40 GB A100 or even two consumer GPUs.
Quantization is not an optional optimization. It is the primary mechanism by which large models become deployable. Almost every LLM you have interacted with in production is running quantized. Understanding quantization means understanding the gap between what researchers train and what users actually run.
Mental Model
Quantization replaces high-precision floating point numbers with lower-precision integers. A weight stored as a 32-bit float has about 7 decimal digits of precision. An 8-bit integer has 256 possible values. A 4-bit integer has only 16 possible values. The question is: can you map a continuous range of weights to these few discrete values without destroying model quality?
The answer is mostly yes, with careful engineering. Weights in neural networks are remarkably robust to precision reduction because the function computed by the network depends on collective behavior of many weights, not on any single weight being exact.
Formal Setup and Notation
Uniform Quantization
Uniform quantization maps a real-valued weight to a -bit integer:
where is the scale (step size) and is the zero point (integer offset). The dequantized value is . The quantization error is .
Symmetric vs Asymmetric Quantization
Symmetric quantization sets and maps to with . It is simpler and faster but wastes a bit if the weight distribution is asymmetric.
Asymmetric quantization allows and maps to with . It uses the full integer range but requires storing the zero point.
Per-Tensor vs Per-Channel Quantization
Per-tensor quantization uses a single scale and zero point for an entire weight matrix. Simple but coarse.
Per-channel quantization uses separate for each output channel (row of the weight matrix). This handles the common case where different channels have very different magnitude ranges.
Main Theorems
Quantization Error Bound
Statement
For uniform -bit quantization of a weight matrix with range and round-to-nearest:
The per-element error is bounded by . The output error for input satisfies .
Intuition
Each weight incurs at most half a step size of error. The total error scales with the number of weights and the range. Reducing from FP32 to INT8 (256 levels) means the per-weight error is roughly , which is small enough that outputs barely change. Reducing to INT4 (16 levels) gives , which requires more careful handling.
Proof Sketch
Round-to-nearest has error at most per element. Frobenius norm squares: . Take square root. For the output bound, apply Cauchy-Schwarz.
Why It Matters
This bound explains the precision hierarchy: FP32 to FP16 loses almost nothing (error ). FP16 to INT8 is usually safe (error ). INT8 to INT4 is where quality visibly degrades unless you use smarter methods than round-to-nearest.
Failure Mode
The bound assumes uniform weight distribution. Real neural network weights are not uniform: they often have outlier channels with magnitudes 10-100x larger than typical channels. Uniform quantization wastes most of its integer range on the empty space around outliers. This is why per-channel quantization and outlier-aware methods are necessary.
Outlier Channels Dominate Quantization Error
Statement
In transformer models, a small fraction of channels (often fewer than 1%) contain activation outliers with magnitudes 10-100x larger than typical channels. Under per-tensor quantization, these outlier channels force the scale to be large, reducing effective precision for all other channels. The quantization error is dominated by the non-outlier channels receiving fewer effective bits.
Intuition
If one channel has weights in and the rest are in , per-tensor INT8 quantization uses scale . The typical weights get quantized to just 2-3 distinct values instead of using the full 256 levels. The outlier channel is fine, but everything else is destroyed.
Proof Sketch
Let be the outlier range and be the typical range. Per-tensor scale is . The effective number of bits for typical channels is . When , effective bits drop by , leaving fewer than 2 effective bits for INT8.
Why It Matters
This observation motivates every modern quantization method: LLM.int8() handles outliers in FP16, GPTQ uses second-order information to compensate, and AWQ protects salient channels. Understanding the outlier problem is essential for understanding why naive quantization fails for LLMs.
Failure Mode
The outlier pattern varies across layers and models. A fixed strategy (e.g., always quantizing the top 1% in FP16) may not be optimal. Calibration data is needed to identify which channels are critical.
Quantization Methods
Post-Training Quantization (PTQ)
PTQ quantizes a pretrained model without any retraining. Steps:
- Choose quantization scheme (per-tensor/per-channel, symmetric/asymmetric)
- Run calibration data through the model to determine optimal scales
- Quantize weights (and optionally activations)
Advantages: fast, no training required. Disadvantage: quality degrades at low bit-widths (INT4) without compensation.
Quantization-Aware Training (QAT)
QAT simulates quantization during training. The forward pass uses quantized weights (via straight-through estimator for gradients). The model learns to be robust to quantization noise during training.
Advantages: consistently better quality than PTQ at low bit-widths. Disadvantage: requires full retraining, which is prohibitively expensive for large models.
GPTQ
GPTQ (Frantar et al., 2022) quantizes weights one column at a time, using second-order (Hessian) information to optimally adjust remaining weights to compensate for quantization error. Based on the OBQ (Optimal Brain Quantization) framework. Key: the Hessian comes from calibration data, making the compensation data-dependent.
AWQ (Activation-Aware Weight Quantization)
AWQ (Lin et al., 2023) observes that a small fraction of weight channels are disproportionately important (those multiplied by large activations). It scales these salient channels up before quantization and scales them back during inference, effectively giving important channels more quantization precision.
GGUF Format
GGUF (GPT-Generated Unified Format) is a file format for storing quantized models, used by llama.cpp and related inference engines. It supports mixed-precision quantization (different layers at different bit-widths), metadata, and multiple quantization types (Q4_0, Q4_K_M, Q5_K_S, etc.). The K-quant variants use importance-based mixed precision within each layer.
Canonical Examples
INT8 quantization of a 7B model
A 7B parameter model in FP16 requires 14 GB. In INT8 it requires 7 GB. With per-channel symmetric quantization and round-to-nearest, the perplexity increase on typical benchmarks is less than 0.5%. The memory savings are 2x with negligible quality loss. This is the easy case.
INT4 quantization requires compensation
The same 7B model in INT4 (3.5 GB) with naive round-to-nearest shows 5-10% perplexity degradation. GPTQ reduces this to about 1% by using Hessian-based weight adjustment. AWQ achieves similar quality by protecting salient channels. The lesson: below INT8, naive quantization is not enough.
Common Confusions
Quantization is not the same as pruning
Quantization reduces the precision of every weight. Pruning removes weights entirely (sets them to zero). They are complementary: you can quantize a pruned model. Quantization preserves the model architecture while pruning changes the effective architecture.
Lower bits does not always mean faster inference
INT4 operations are not natively supported on all hardware. On GPUs without INT4 tensor cores, INT4 weights are dequantized to FP16 on the fly. The benefit is memory savings (fitting the model on fewer GPUs), not necessarily faster computation per token.
Summary
- Quantization reduces precision: FP32 -> FP16 -> INT8 -> INT4
- Per-element error bounded by half the step size
- Outlier channels dominate error under per-tensor quantization
- PTQ is fast but degrades at low bits; QAT is better but expensive
- GPTQ uses Hessian information to compensate for quantization error
- AWQ protects salient channels via activation-aware scaling
- GGUF is the standard format for deploying quantized LLMs locally
Exercises
Problem
A weight matrix has values in . You apply symmetric INT8 quantization. What is the step size ? What is the maximum per-element quantization error? If the matrix is , what is the bound on ?
Problem
Explain why GPTQ quantizes weights column by column rather than all at once. What role does the Hessian play in compensating for quantization error?
Related Comparisons
References
Canonical:
- Jacob et al., "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference" (2018)
- Dettmers et al., "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale" (2022)
Current:
- Frantar et al., "GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers" (2022)
- Lin et al., "AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration" (2023)
- Tseng et al., "QTIP: Quantization with Trellises and Incoherence Processing" (NeurIPS 2024 Spotlight). Uses trellis coded quantization to overcome the dimensional limitations of vector quantization. Achieves state-of-the-art quality and speed by separating codebook size from bitrate via a stateful decoder.
Next Topics
Natural extensions from quantization:
- Mixture of experts: another approach to reducing inference compute
- [KV cache optimization](/topics/kv-cache-optimization): quantizing the key-value cache for memory savings
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.