Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Speculative Decoding and Quantization

Two core inference optimizations: speculative decoding for latency (draft-verify parallelism) and quantization for memory and throughput (reducing weight precision without destroying quality).

AdvancedTier 2Current~50 min
0

Why This Matters

A 70B-parameter model in FP16 requires 140 GB of memory just for the weights. That is more than fits on a single GPU. Even if it fits, autoregressive generation is memory-bandwidth-bound: each token requires reading every weight from memory, and the arithmetic intensity is low.

Quantization and speculative decoding are the two techniques with the largest measured speedup for making LLM inference practical. Quantization reduces memory and increases throughput by using lower-precision weights. Speculative decoding reduces latency by generating multiple tokens per large-model forward pass.

These are not optional optimizations. They are how models actually get deployed in production.

Speculative Decoding

Mental Model

Autoregressive generation with a large model is slow because each token requires a full forward pass. The key observation: verifying that a sequence of tokens is correct is much cheaper than generating them one by one, because verification of kk tokens can be done in a single forward pass (just like processing a prompt).

Speculative decoding exploits this: use a small, fast draft model to propose kk candidate tokens, then use the large target model to verify them all in one parallel forward pass. Accept the tokens that the target model agrees with; reject and resample from the point of disagreement.

Definition

Speculative Decoding

Speculative decoding uses a draft model MqM_q (small, fast) and a target model MpM_p (large, accurate). At each step:

  1. The draft model generates KK candidate tokens x1,,xKx_1, \ldots, x_K autoregressively
  2. The target model computes p(xix<i)p(x_i | x_{<i}) for all KK tokens in a single forward pass
  3. Each token is accepted or rejected using a modified rejection sampling scheme
  4. Generation continues from the first rejected position (or past all KK if all accepted)

The Correctness Guarantee

Theorem

Speculative Decoding Preserves Target Distribution

Statement

The speculative decoding algorithm with modified rejection sampling produces tokens distributed exactly according to the target model pp, regardless of the quality of the draft model qq. Specifically, at each position, the accepted token xx satisfies:

Pr[x=v]=p(vx<pos)vV\Pr[x = v] = p(v | x_{<\text{pos}}) \quad \forall\, v \in \mathcal{V}

The draft model affects only the speed (expected number of accepted tokens per verification step), not the distribution of the output.

Intuition

The acceptance criterion is: accept draft token xx with probability min(1,p(x)/q(x))\min(1, p(x)/q(x)). If qq proposes a token that pp also likes (p(x)q(x)p(x) \geq q(x)), it is always accepted. If qq proposes a token that pp dislikes (p(x)<q(x)p(x) < q(x)), it is accepted with probability p(x)/q(x)p(x)/q(x). On rejection, resample from the residual distribution max(0,pq)\max(0, p - q) normalized. This is standard rejection sampling. The output distribution is exactly pp.

Why It Matters

This is the key theoretical property: speculative decoding is not an approximation. It produces exactly the same distribution as the target model. You get speed for free with zero quality degradation. The draft model is purely a proposal mechanism. a bad draft model just means fewer tokens are accepted per step, not that the output quality changes.

Failure Mode

The guarantee requires exact probability computation. In practice, numerical precision differences between draft and target models can cause slight distributional deviations. Also, the speedup depends on the acceptance rate: if the draft model is very different from the target, most tokens are rejected and the overhead of running the draft model is wasted. The draft model must be a reasonable approximation of the target for practical speedups.

Speedup Analysis

The expected number of tokens generated per verification step is:

E[accepted tokens]=i=1Kj=1iαj\mathbb{E}[\text{accepted tokens}] = \sum_{i=1}^{K} \prod_{j=1}^{i} \alpha_j

where αj\alpha_j is the acceptance probability at position jj. If the draft model closely matches the target (high α\alpha), the speedup approaches KK. In practice, speculative decoding with a 10x smaller draft model achieves 2-3x speedup on typical text generation tasks.

Quantization

Mental Model

Neural network weights are typically stored in FP16 (16-bit floating point) or BF16. Quantization maps these to lower precision. INT8 (8-bit integer), INT4 (4-bit), or even lower. Fewer bits means less memory, faster memory reads, and often specialized hardware support for low-precision arithmetic.

The challenge: quantization introduces error. The question is whether this error matters for the task you care about.

Definition

Post-Training Quantization (PTQ)

PTQ quantizes a pretrained model without retraining. Given a weight tensor WRm×n\mathbf{W} \in \mathbb{R}^{m \times n}, quantize to bb bits via:

W^ij=sclamp ⁣(round ⁣(Wijs),  2b1,  2b11)\hat{W}_{ij} = s \cdot \text{clamp}\!\left(\text{round}\!\left(\frac{W_{ij}}{s}\right),\; -2^{b-1},\; 2^{b-1}-1\right)

where ss is a learned or calibrated scale factor. PTQ is fast (minutes to hours on a calibration dataset) but introduces quantization noise that is not compensated by training.

Definition

Quantization-Aware Training (QAT)

QAT simulates quantization during training using the straight-through estimator: the forward pass uses quantized weights, but gradients are computed as if the rounding did not occur. This allows the model to adapt its weights to be robust to quantization noise. QAT produces better results than PTQ but requires a full training run.

Quantization Error

Proposition

Per-Layer Quantization Error

Statement

For a weight matrix W\mathbf{W} quantized to bb bits with optimal scale factor s=W/(2b11)s^* = \|\mathbf{W}\|_\infty / (2^{b-1} - 1), the per-element mean squared quantization error is bounded by:

E[(W^ijWij)2]s24=W24(2b11)2\mathbb{E}\left[\left(\hat{W}_{ij} - W_{ij}\right)^2\right] \leq \frac{s^{*2}}{4} = \frac{\|\mathbf{W}\|_\infty^2}{4(2^{b-1}-1)^2}

The error scales as O(22b)O(2^{-2b}): halving the precision roughly quadruples the quantization error per weight.

Intuition

Quantization maps each weight to the nearest representable value. The maximum rounding error is half the step size s/2s/2. The step size is set by the dynamic range of the weights divided by the number of quantization levels (2b2^b). Wider dynamic range or fewer bits means larger steps and more error.

Why It Matters

This explains why outlier weights are so problematic: a single large weight forces a large scale factor ss, increasing quantization error for all weights in that channel. This is why methods like LLM.int8() isolate outlier channels and quantize them separately, and why per-channel quantization outperforms per-tensor quantization.

Failure Mode

The per-element MSE bound does not capture the downstream effect on model quality. Some weights are more important than others: quantization error in attention projection weights affects the model differently than error in FFN weights. Methods like GPTQ and AWQ use Hessian information to minimize the effect of quantization on the output (not just the weights themselves).

Quantization Methods in Practice

GPTQ (Generative Pre-Trained Transformer Quantization): Applies layer-wise quantization using approximate second-order information (the Hessian of the layer output with respect to the weights). Quantizes weights one column at a time, updating remaining columns to compensate for the quantization error of each column. Achieves good INT4 results.

AWQ (Activation-Aware Weight Quantization): Identifies "salient" weight channels (those multiplied by large activations on calibration data) and protects them by scaling before quantization. Simple and effective.

GGUF: A file format (not a quantization method) for storing quantized models, widely used in local inference (llama.cpp). Supports mixed-precision quantization: different layers can use different bit-widths.

Combining Both Techniques

Speculative decoding and quantization are complementary. A typical production pipeline:

  1. Quantize the target model to INT8 or INT4 to fit in available GPU memory
  2. Use a small draft model (possibly also quantized) for speculative decoding
  3. Serve with speculative verification to reduce latency
Build It This Way by Default

For inference, start with INT8 quantization (AWQ or GPTQ). This halves memory with minimal quality loss and is the lowest-risk optimization. For latency- sensitive applications, add speculative decoding with a draft model that is roughly 10x smaller than the target. Only go to INT4 if memory is the binding constraint, and validate on your specific eval suite because INT4 can degrade rare-token generation and calibration.

Common Fake Understanding

Quantization does not just "make things smaller." It introduces specific distributional errors that disproportionately affect: (1) outlier weights and activations, (2) calibration (confidence estimates become unreliable before accuracy degrades), (3) rare-token generation (low-frequency tokens have less gradient signal during QAT and suffer more from PTQ rounding). A model that "looks fine" on perplexity benchmarks may have degraded tail behavior that matters for your application.

Common Confusions

Watch Out

Speculative decoding is not approximate

A common misunderstanding is that speculative decoding trades quality for speed. It does not. The output distribution is mathematically identical to sampling from the target model. The speedup comes from parallelizing verification, not from approximating the distribution.

Watch Out

INT4 is not half as good as INT8

The relationship between bit-width and quality is nonlinear and task-dependent. INT8 quantization is nearly lossless for most models. INT4 introduces measurable degradation but is often acceptable. INT3 and INT2 typically cause significant quality loss. The transition from "fine" to "broken" is often sharp rather than gradual.

Summary

  • Speculative decoding: draft model proposes, target model verifies in parallel. Output distribution is exactly the target model's distribution.
  • Speedup depends on draft-target agreement; typically 2-3x for a 10x smaller draft model
  • Quantization: reduce weight precision from FP16 to INT8/INT4 to save memory and increase throughput
  • PTQ is fast but noisy; QAT is expensive but robust
  • Outlier weights are the main challenge. per-channel scaling and Hessian-aware methods address this
  • Start with INT8 quantization; add speculative decoding for latency; go to INT4 only if memory is the bottleneck

Exercises

ExerciseCore

Problem

A 70B-parameter model in FP16 requires 140 GB of memory. How much memory does it require after INT8 quantization? After INT4?

ExerciseAdvanced

Problem

In speculative decoding, suppose the draft model's per-token acceptance rate is α=0.7\alpha = 0.7 and you generate K=5K = 5 draft tokens per step. What is the expected number of tokens accepted per verification step? What if α=0.9\alpha = 0.9?

ExerciseResearch

Problem

Quantization-aware training uses the straight-through estimator (STE): the forward pass uses quantized weights W^\hat{\mathbf{W}}, but the backward pass computes gradients as if W^=W\hat{\mathbf{W}} = \mathbf{W} (ignoring the rounding). Why does this biased gradient estimator work in practice? When might it fail?

Related Comparisons

References

Canonical:

  • Leviathan, Kalman, Matias, "Fast Inference from Transformers via Speculative Decoding" (ICML 2023)
  • Chen, Borgeaud, et al., "Accelerating Large Language Model Decoding with Speculative Sampling" (2023)

Current:

  • Frantar et al., "GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers" (ICLR 2023)
  • Lin et al., "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration" (MLSys 2024)

Next Topics

The natural next steps from speculative decoding and quantization:

  • Context engineering: designing the full inference pipeline around these efficiency constraints
  • Mixture of experts: another approach to efficiency that MoE models themselves need quantization and speculative decoding to serve

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics