PTQ vs. QAT. Post-Training Quantization vs. Quantization-Aware Training

What Each Does

Both PTQ and QAT reduce model precision from floating-point (FP16/BF16) to lower bit widths (INT8, INT4, etc.) to reduce memory, bandwidth, and compute costs at inference time. They differ in when and how quantization is applied.

Post-Training Quantization (PTQ): Take a fully trained FP16 model and convert weights (and optionally activations) to lower precision. No retraining.

Quantization-Aware Training (QAT): Insert fake quantization operators into the forward pass during training. The model learns to be robust to quantization noise. Requires retraining or fine-tuning.

Side-by-Side Statement

Definition

Post-Training Quantization

Given a trained model with weights $W \in \mathbb{R}^{m \times n}$ , PTQ finds a quantized representation $\hat{W}$ that minimizes some reconstruction objective:

$\hat{W} = \arg\min_{\hat{W} \in \mathcal{Q}} \|WX - \hat{W}X\|_F^2$

where $\mathcal{Q}$ is the set of quantized weight matrices and $X$ is a small calibration dataset. No gradient updates to the original model.

Definition

Quantization-Aware Training

QAT modifies the training forward pass by inserting quantization-dequantization ("fake quant") operations:

$\hat{W} = \text{dequant}(\text{quant}(W)) = s \cdot \text{clamp}(\lfloor W/s \rceil, q_{\min}, q_{\max})$

The forward pass uses $\hat{W}$ (quantized), but gradients flow through the quantization via the straight-through estimator (STE), updating the full-precision "shadow" weights $W$ .

Where Each Is Stronger

PTQ wins on speed and accessibility

PTQ takes minutes to hours. You need only a small calibration set (128-1024 examples) and no training infrastructure. For INT8 quantization of most models, PTQ introduces negligible quality loss. It is the default choice for production deployment when 8-bit precision suffices.

QAT wins on quality at low bit widths

At 4-bit and below, PTQ quality degrades significantly for many models. QAT recovers most of this loss because the model adapts its weight distribution during training to be quantization-friendly. Weights cluster around quantization grid points, reducing the effective quantization error.

The cost is substantial: QAT requires training infrastructure, a training dataset, and typically 5-20% of the original training compute budget.

Modern PTQ Methods

Recent PTQ methods have narrowed the gap between PTQ and QAT at 4-bit precision.

Example

GPTQ: one-shot weight quantization

GPTQ quantizes weights one column at a time, using the inverse Hessian to compensate for the quantization error of each column by adjusting the remaining unquantized columns. This is based on the Optimal Brain Quantization (OBQ) framework.

For a weight matrix $W$ and calibration data $X$ , GPTQ minimizes $\|WX - \hat{W}X\|_F^2$ greedily. The Hessian $H = 2XX^\top$ captures the sensitivity of each weight to the output. Insensitive weights are quantized aggressively; sensitive weights are quantized carefully with error compensation.

GPTQ quantizes a 175B parameter model in a few GPU-hours.

Example

AWQ: activation-aware weight quantization

AWQ observes that a small fraction of weights (1-2%) are much more important than others, because they correspond to large activation magnitudes. Instead of treating all weights equally, AWQ scales important weight channels before quantization to reduce their relative quantization error.

The key insight: multiplying a weight channel by $s > 1$ and dividing the corresponding activation by $s$ preserves the output but reduces the quantization error on that weight channel (because the quantization grid is finer relative to the scaled weights).

Quality Comparison at Different Bit Widths

Bit width	PTQ (naive)	PTQ (GPTQ/AWQ)	QAT
INT8	Negligible loss	Negligible loss	Negligible loss
INT4	1-5% perplexity increase	0.5-1% perplexity increase	0.1-0.5% perplexity increase
INT3	Often broken	2-5% perplexity increase	1-2% perplexity increase
INT2	Broken	Usually broken	3-10% perplexity increase

These numbers are approximate and model-dependent. Larger models tolerate quantization better than smaller ones.

When to Use Each

Use naive PTQ (round-to-nearest) when:

Target precision is INT8 or higher.
The model is large enough that per-channel quantization works well.
You have no calibration data.

Use GPTQ/AWQ when:

Target precision is INT4.
You need fast quantization (no training loop).
A small calibration set (128-1024 examples) is available.
You are deploying an open-weight LLM for inference.

Use QAT when:

Target precision is INT4 or lower.
Quality at low bit widths is critical.
Training compute is available.
The model will be deployed at scale, justifying the upfront training cost.

Key Assumptions That Differ

	PTQ	QAT
When applied	After training	During training
Training data needed	Small calibration set or none	Full training dataset
Compute cost	Minutes to hours	Days to weeks
INT8 quality	Excellent	Excellent
INT4 quality	Good (with GPTQ/AWQ)	Better
INT2 quality	Poor	Acceptable for some models
Gradient method	None	Straight-through estimator

Common Confusions

Watch Out

GPTQ is not QAT

GPTQ is a PTQ method. It does not retrain the model. It uses second-order information (the Hessian) to minimize reconstruction error, but all computation happens post-training with a small calibration set. The "one-shot" in its name refers to single-pass processing, not single-epoch training.

Watch Out

The straight-through estimator is an approximation

In QAT, the quantization function $\lfloor \cdot \rceil$ (round-to-nearest) has zero gradient almost everywhere. The STE pretends the gradient of the quantization step is 1, which is mathematically incorrect but works well in practice. There is no strong theoretical justification for why STE works as well as it does. It is an empirical success, not a theorem.

Watch Out

Quantization is not just about weights

Weight-only quantization (W4A16, W8A16) is simpler and more common for LLMs because weights dominate memory. But activation quantization (W8A8, W4A4) gives additional speedups because matrix multiplications run in lower precision. Activation quantization is harder because activations have more dynamic range and outliers, especially in transformer models.

References

Canonical:

Jacob et al., "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference" (CVPR 2018)
Nagel et al., "A White Paper on Neural Network Quantization" (Qualcomm, 2021)

Current:

Frantar et al., "GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers" (ICLR 2023)
Lin et al., "AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration" (MLSys 2024)