What Each Does
Both PTQ and QAT reduce model precision from floating-point (FP16/BF16) to lower bit widths (INT8, INT4, etc.) to reduce memory, bandwidth, and compute costs at inference time. They differ in when and how quantization is applied.
Post-Training Quantization (PTQ): Take a fully trained FP16 model and convert weights (and optionally activations) to lower precision. No retraining.
Quantization-Aware Training (QAT): Insert fake quantization operators into the forward pass during training. The model learns to be robust to quantization noise. Requires retraining or fine-tuning.
Side-by-Side Statement
Post-Training Quantization
Given a trained model with weights , PTQ finds a quantized representation that minimizes some reconstruction objective:
where is the set of quantized weight matrices and is a small calibration dataset. No gradient updates to the original model.
Quantization-Aware Training
QAT modifies the training forward pass by inserting quantization-dequantization ("fake quant") operations:
The forward pass uses (quantized), but gradients flow through the quantization via the straight-through estimator (STE), updating the full-precision "shadow" weights .
Where Each Is Stronger
PTQ wins on speed and accessibility
PTQ takes minutes to hours. You need only a small calibration set (128-1024 examples) and no training infrastructure. For INT8 quantization of most models, PTQ introduces negligible quality loss. It is the default choice for production deployment when 8-bit precision suffices.
QAT wins on quality at low bit widths
At 4-bit and below, PTQ quality degrades significantly for many models. QAT recovers most of this loss because the model adapts its weight distribution during training to be quantization-friendly. Weights cluster around quantization grid points, reducing the effective quantization error.
The cost is substantial: QAT requires training infrastructure, a training dataset, and typically 5-20% of the original training compute budget.
Modern PTQ Methods
Recent PTQ methods have narrowed the gap between PTQ and QAT at 4-bit precision.
GPTQ: one-shot weight quantization
GPTQ quantizes weights one column at a time, using the inverse Hessian to compensate for the quantization error of each column by adjusting the remaining unquantized columns. This is based on the Optimal Brain Quantization (OBQ) framework.
For a weight matrix and calibration data , GPTQ minimizes greedily. The Hessian captures the sensitivity of each weight to the output. Insensitive weights are quantized aggressively; sensitive weights are quantized carefully with error compensation.
GPTQ quantizes a 175B parameter model in a few GPU-hours.
AWQ: activation-aware weight quantization
AWQ observes that a small fraction of weights (1-2%) are much more important than others, because they correspond to large activation magnitudes. Instead of treating all weights equally, AWQ scales important weight channels before quantization to reduce their relative quantization error.
The key insight: multiplying a weight channel by and dividing the corresponding activation by preserves the output but reduces the quantization error on that weight channel (because the quantization grid is finer relative to the scaled weights).
Quality Comparison at Different Bit Widths
| Bit width | PTQ (naive) | PTQ (GPTQ/AWQ) | QAT |
|---|---|---|---|
| INT8 | Negligible loss | Negligible loss | Negligible loss |
| INT4 | 1-5% perplexity increase | 0.5-1% perplexity increase | 0.1-0.5% perplexity increase |
| INT3 | Often broken | 2-5% perplexity increase | 1-2% perplexity increase |
| INT2 | Broken | Usually broken | 3-10% perplexity increase |
These numbers are approximate and model-dependent. Larger models tolerate quantization better than smaller ones.
When to Use Each
Use naive PTQ (round-to-nearest) when:
- Target precision is INT8 or higher.
- The model is large enough that per-channel quantization works well.
- You have no calibration data.
Use GPTQ/AWQ when:
- Target precision is INT4.
- You need fast quantization (no training loop).
- A small calibration set (128-1024 examples) is available.
- You are deploying an open-weight LLM for inference.
Use QAT when:
- Target precision is INT4 or lower.
- Quality at low bit widths is critical.
- Training compute is available.
- The model will be deployed at scale, justifying the upfront training cost.
Key Assumptions That Differ
| PTQ | QAT | |
|---|---|---|
| When applied | After training | During training |
| Training data needed | Small calibration set or none | Full training dataset |
| Compute cost | Minutes to hours | Days to weeks |
| INT8 quality | Excellent | Excellent |
| INT4 quality | Good (with GPTQ/AWQ) | Better |
| INT2 quality | Poor | Acceptable for some models |
| Gradient method | None | Straight-through estimator |
Common Confusions
GPTQ is not QAT
GPTQ is a PTQ method. It does not retrain the model. It uses second-order information (the Hessian) to minimize reconstruction error, but all computation happens post-training with a small calibration set. The "one-shot" in its name refers to single-pass processing, not single-epoch training.
The straight-through estimator is an approximation
In QAT, the quantization function (round-to-nearest) has zero gradient almost everywhere. The STE pretends the gradient of the quantization step is 1, which is mathematically incorrect but works well in practice. There is no strong theoretical justification for why STE works as well as it does. It is an empirical success, not a theorem.
Quantization is not just about weights
Weight-only quantization (W4A16, W8A16) is simpler and more common for LLMs because weights dominate memory. But activation quantization (W8A8, W4A4) gives additional speedups because matrix multiplications run in lower precision. Activation quantization is harder because activations have more dynamic range and outliers, especially in transformer models.
References
Canonical:
- Jacob et al., "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference" (CVPR 2018)
- Nagel et al., "A White Paper on Neural Network Quantization" (Qualcomm, 2021)
Current:
- Frantar et al., "GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers" (ICLR 2023)
- Lin et al., "AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration" (MLSys 2024)