LLM Construction
Speculative Decoding and Quantization
Two core inference optimizations: speculative decoding for latency (draft-verify parallelism) and quantization for memory and throughput (reducing weight precision without destroying quality).
Prerequisites
Why This Matters
A 70B-parameter model in FP16 requires 140 GB of memory just for the weights. That is more than fits on a single GPU. Even if it fits, autoregressive generation is memory-bandwidth-bound: each token requires reading every weight from memory, and the arithmetic intensity is low.
Quantization and speculative decoding are the two techniques with the largest measured speedup for making LLM inference practical. Quantization reduces memory and increases throughput by using lower-precision weights. Speculative decoding reduces latency by generating multiple tokens per large-model forward pass.
These are not optional optimizations. They are how models actually get deployed in production.
Speculative Decoding
Mental Model
Autoregressive generation with a large model is slow because each token requires a full forward pass. The key observation: verifying that a sequence of tokens is correct is much cheaper than generating them one by one, because verification of tokens can be done in a single forward pass (just like processing a prompt).
Speculative decoding exploits this: use a small, fast draft model to propose candidate tokens, then use the large target model to verify them all in one parallel forward pass. Accept the tokens that the target model agrees with; reject and resample from the point of disagreement.
Speculative Decoding
Speculative decoding uses a draft model (small, fast) and a target model (large, accurate). At each step:
- The draft model generates candidate tokens autoregressively
- The target model computes for all tokens in a single forward pass
- Each token is accepted or rejected using a modified rejection sampling scheme
- Generation continues from the first rejected position (or past all if all accepted)
The Correctness Guarantee
Speculative Decoding Preserves Target Distribution
Statement
The speculative decoding algorithm with modified rejection sampling produces tokens distributed exactly according to the target model , regardless of the quality of the draft model . Specifically, at each position, the accepted token satisfies:
The draft model affects only the speed (expected number of accepted tokens per verification step), not the distribution of the output.
Intuition
The acceptance criterion is: accept draft token with probability . If proposes a token that also likes (), it is always accepted. If proposes a token that dislikes (), it is accepted with probability . On rejection, resample from the residual distribution normalized. This is standard rejection sampling. The output distribution is exactly .
Why It Matters
This is the key theoretical property: speculative decoding is not an approximation. It produces exactly the same distribution as the target model. You get speed for free with zero quality degradation. The draft model is purely a proposal mechanism. a bad draft model just means fewer tokens are accepted per step, not that the output quality changes.
Failure Mode
The guarantee requires exact probability computation. In practice, numerical precision differences between draft and target models can cause slight distributional deviations. Also, the speedup depends on the acceptance rate: if the draft model is very different from the target, most tokens are rejected and the overhead of running the draft model is wasted. The draft model must be a reasonable approximation of the target for practical speedups.
Speedup Analysis
The expected number of tokens generated per verification step is:
where is the acceptance probability at position . If the draft model closely matches the target (high ), the speedup approaches . In practice, speculative decoding with a 10x smaller draft model achieves 2-3x speedup on typical text generation tasks.
Quantization
Mental Model
Neural network weights are typically stored in FP16 (16-bit floating point) or BF16. Quantization maps these to lower precision. INT8 (8-bit integer), INT4 (4-bit), or even lower. Fewer bits means less memory, faster memory reads, and often specialized hardware support for low-precision arithmetic.
The challenge: quantization introduces error. The question is whether this error matters for the task you care about.
Post-Training Quantization (PTQ)
PTQ quantizes a pretrained model without retraining. Given a weight tensor , quantize to bits via:
where is a learned or calibrated scale factor. PTQ is fast (minutes to hours on a calibration dataset) but introduces quantization noise that is not compensated by training.
Quantization-Aware Training (QAT)
QAT simulates quantization during training using the straight-through estimator: the forward pass uses quantized weights, but gradients are computed as if the rounding did not occur. This allows the model to adapt its weights to be robust to quantization noise. QAT produces better results than PTQ but requires a full training run.
Quantization Error
Per-Layer Quantization Error
Statement
For a weight matrix quantized to bits with optimal scale factor , the per-element mean squared quantization error is bounded by:
The error scales as : halving the precision roughly quadruples the quantization error per weight.
Intuition
Quantization maps each weight to the nearest representable value. The maximum rounding error is half the step size . The step size is set by the dynamic range of the weights divided by the number of quantization levels (). Wider dynamic range or fewer bits means larger steps and more error.
Why It Matters
This explains why outlier weights are so problematic: a single large weight forces a large scale factor , increasing quantization error for all weights in that channel. This is why methods like LLM.int8() isolate outlier channels and quantize them separately, and why per-channel quantization outperforms per-tensor quantization.
Failure Mode
The per-element MSE bound does not capture the downstream effect on model quality. Some weights are more important than others: quantization error in attention projection weights affects the model differently than error in FFN weights. Methods like GPTQ and AWQ use Hessian information to minimize the effect of quantization on the output (not just the weights themselves).
Quantization Methods in Practice
GPTQ (Generative Pre-Trained Transformer Quantization): Applies layer-wise quantization using approximate second-order information (the Hessian of the layer output with respect to the weights). Quantizes weights one column at a time, updating remaining columns to compensate for the quantization error of each column. Achieves good INT4 results.
AWQ (Activation-Aware Weight Quantization): Identifies "salient" weight channels (those multiplied by large activations on calibration data) and protects them by scaling before quantization. Simple and effective.
GGUF: A file format (not a quantization method) for storing quantized models, widely used in local inference (llama.cpp). Supports mixed-precision quantization: different layers can use different bit-widths.
Combining Both Techniques
Speculative decoding and quantization are complementary. A typical production pipeline:
- Quantize the target model to INT8 or INT4 to fit in available GPU memory
- Use a small draft model (possibly also quantized) for speculative decoding
- Serve with speculative verification to reduce latency
For inference, start with INT8 quantization (AWQ or GPTQ). This halves memory with minimal quality loss and is the lowest-risk optimization. For latency- sensitive applications, add speculative decoding with a draft model that is roughly 10x smaller than the target. Only go to INT4 if memory is the binding constraint, and validate on your specific eval suite because INT4 can degrade rare-token generation and calibration.
Quantization does not just "make things smaller." It introduces specific distributional errors that disproportionately affect: (1) outlier weights and activations, (2) calibration (confidence estimates become unreliable before accuracy degrades), (3) rare-token generation (low-frequency tokens have less gradient signal during QAT and suffer more from PTQ rounding). A model that "looks fine" on perplexity benchmarks may have degraded tail behavior that matters for your application.
Common Confusions
Speculative decoding is not approximate
A common misunderstanding is that speculative decoding trades quality for speed. It does not. The output distribution is mathematically identical to sampling from the target model. The speedup comes from parallelizing verification, not from approximating the distribution.
INT4 is not half as good as INT8
The relationship between bit-width and quality is nonlinear and task-dependent. INT8 quantization is nearly lossless for most models. INT4 introduces measurable degradation but is often acceptable. INT3 and INT2 typically cause significant quality loss. The transition from "fine" to "broken" is often sharp rather than gradual.
Summary
- Speculative decoding: draft model proposes, target model verifies in parallel. Output distribution is exactly the target model's distribution.
- Speedup depends on draft-target agreement; typically 2-3x for a 10x smaller draft model
- Quantization: reduce weight precision from FP16 to INT8/INT4 to save memory and increase throughput
- PTQ is fast but noisy; QAT is expensive but robust
- Outlier weights are the main challenge. per-channel scaling and Hessian-aware methods address this
- Start with INT8 quantization; add speculative decoding for latency; go to INT4 only if memory is the bottleneck
Exercises
Problem
A 70B-parameter model in FP16 requires 140 GB of memory. How much memory does it require after INT8 quantization? After INT4?
Problem
In speculative decoding, suppose the draft model's per-token acceptance rate is and you generate draft tokens per step. What is the expected number of tokens accepted per verification step? What if ?
Problem
Quantization-aware training uses the straight-through estimator (STE): the forward pass uses quantized weights , but the backward pass computes gradients as if (ignoring the rounding). Why does this biased gradient estimator work in practice? When might it fail?
Related Comparisons
References
Canonical:
- Leviathan, Kalman, Matias, "Fast Inference from Transformers via Speculative Decoding" (ICML 2023)
- Chen, Borgeaud, et al., "Accelerating Large Language Model Decoding with Speculative Sampling" (2023)
Current:
- Frantar et al., "GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers" (ICLR 2023)
- Lin et al., "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration" (MLSys 2024)
Next Topics
The natural next steps from speculative decoding and quantization:
- Context engineering: designing the full inference pipeline around these efficiency constraints
- Mixture of experts: another approach to efficiency that MoE models themselves need quantization and speculative decoding to serve
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Transformer ArchitectureLayer 4
- Attention Mechanism TheoryLayer 4
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Softmax and Numerical StabilityLayer 1
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- KV CacheLayer 5
Builds on This
- Edge and On-Device MLLayer 5
- Inference Systems OverviewLayer 5