Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Comparison

Post-Training Quantization vs. Quantization-Aware Training

PTQ quantizes a pretrained model with no retraining. QAT simulates quantization during training to recover quality. GPTQ and AWQ are modern PTQ methods that close much of the gap. The tradeoff is compute cost vs. model quality at low bit widths.

What Each Does

Both PTQ and QAT reduce model precision from floating-point (FP16/BF16) to lower bit widths (INT8, INT4, etc.) to reduce memory, bandwidth, and compute costs at inference time. They differ in when and how quantization is applied.

Post-Training Quantization (PTQ): Take a fully trained FP16 model and convert weights (and optionally activations) to lower precision. No retraining.

Quantization-Aware Training (QAT): Insert fake quantization operators into the forward pass during training. The model learns to be robust to quantization noise. Requires retraining or fine-tuning.

Side-by-Side Statement

Definition

Post-Training Quantization

Given a trained model with weights WRm×nW \in \mathbb{R}^{m \times n}, PTQ finds a quantized representation W^\hat{W} that minimizes some reconstruction objective:

W^=argminW^QWXW^XF2\hat{W} = \arg\min_{\hat{W} \in \mathcal{Q}} \|WX - \hat{W}X\|_F^2

where Q\mathcal{Q} is the set of quantized weight matrices and XX is a small calibration dataset. No gradient updates to the original model.

Definition

Quantization-Aware Training

QAT modifies the training forward pass by inserting quantization-dequantization ("fake quant") operations:

W^=dequant(quant(W))=sclamp(W/s,qmin,qmax)\hat{W} = \text{dequant}(\text{quant}(W)) = s \cdot \text{clamp}(\lfloor W/s \rceil, q_{\min}, q_{\max})

The forward pass uses W^\hat{W} (quantized), but gradients flow through the quantization via the straight-through estimator (STE), updating the full-precision "shadow" weights WW.

Where Each Is Stronger

PTQ wins on speed and accessibility

PTQ takes minutes to hours. You need only a small calibration set (128-1024 examples) and no training infrastructure. For INT8 quantization of most models, PTQ introduces negligible quality loss. It is the default choice for production deployment when 8-bit precision suffices.

QAT wins on quality at low bit widths

At 4-bit and below, PTQ quality degrades significantly for many models. QAT recovers most of this loss because the model adapts its weight distribution during training to be quantization-friendly. Weights cluster around quantization grid points, reducing the effective quantization error.

The cost is substantial: QAT requires training infrastructure, a training dataset, and typically 5-20% of the original training compute budget.

Modern PTQ Methods

Recent PTQ methods have narrowed the gap between PTQ and QAT at 4-bit precision.

Example

GPTQ: one-shot weight quantization

GPTQ quantizes weights one column at a time, using the inverse Hessian to compensate for the quantization error of each column by adjusting the remaining unquantized columns. This is based on the Optimal Brain Quantization (OBQ) framework.

For a weight matrix WW and calibration data XX, GPTQ minimizes WXW^XF2\|WX - \hat{W}X\|_F^2 greedily. The Hessian H=2XXH = 2XX^\top captures the sensitivity of each weight to the output. Insensitive weights are quantized aggressively; sensitive weights are quantized carefully with error compensation.

GPTQ quantizes a 175B parameter model in a few GPU-hours.

Example

AWQ: activation-aware weight quantization

AWQ observes that a small fraction of weights (1-2%) are much more important than others, because they correspond to large activation magnitudes. Instead of treating all weights equally, AWQ scales important weight channels before quantization to reduce their relative quantization error.

The key insight: multiplying a weight channel by s>1s > 1 and dividing the corresponding activation by ss preserves the output but reduces the quantization error on that weight channel (because the quantization grid is finer relative to the scaled weights).

Quality Comparison at Different Bit Widths

Bit widthPTQ (naive)PTQ (GPTQ/AWQ)QAT
INT8Negligible lossNegligible lossNegligible loss
INT41-5% perplexity increase0.5-1% perplexity increase0.1-0.5% perplexity increase
INT3Often broken2-5% perplexity increase1-2% perplexity increase
INT2BrokenUsually broken3-10% perplexity increase

These numbers are approximate and model-dependent. Larger models tolerate quantization better than smaller ones.

When to Use Each

Use naive PTQ (round-to-nearest) when:

Use GPTQ/AWQ when:

Use QAT when:

Key Assumptions That Differ

PTQQAT
When appliedAfter trainingDuring training
Training data neededSmall calibration set or noneFull training dataset
Compute costMinutes to hoursDays to weeks
INT8 qualityExcellentExcellent
INT4 qualityGood (with GPTQ/AWQ)Better
INT2 qualityPoorAcceptable for some models
Gradient methodNoneStraight-through estimator

Common Confusions

Watch Out

GPTQ is not QAT

GPTQ is a PTQ method. It does not retrain the model. It uses second-order information (the Hessian) to minimize reconstruction error, but all computation happens post-training with a small calibration set. The "one-shot" in its name refers to single-pass processing, not single-epoch training.

Watch Out

The straight-through estimator is an approximation

In QAT, the quantization function \lfloor \cdot \rceil (round-to-nearest) has zero gradient almost everywhere. The STE pretends the gradient of the quantization step is 1, which is mathematically incorrect but works well in practice. There is no strong theoretical justification for why STE works as well as it does. It is an empirical success, not a theorem.

Watch Out

Quantization is not just about weights

Weight-only quantization (W4A16, W8A16) is simpler and more common for LLMs because weights dominate memory. But activation quantization (W8A8, W4A4) gives additional speedups because matrix multiplications run in lower precision. Activation quantization is harder because activations have more dynamic range and outliers, especially in transformer models.

References

Canonical:

Current: