LLM Construction
Inference-Time Scaling Laws
How additional compute at inference time (repeated sampling, search, verification) improves output quality, why gains are task-dependent, and why verifier quality matters more than raw sample count.
Prerequisites
Why This Matters
Training scaling laws (Chinchilla, Kaplan) describe how model quality improves with more training compute. Inference-time scaling laws describe a complementary axis: how output quality improves with more compute at test time. This is a distinct phenomenon. A fixed model can produce better outputs by generating more candidates and selecting the best one.
This matters because inference compute is cheaper and more flexible than training compute. You can allocate more inference compute to harder problems and less to easy ones. Training compute is a one-time fixed cost; inference compute scales with demand and difficulty.
Mental Model
Given a fixed model, there are two ways to use extra inference compute:
- Generate more candidates and select the best one (best-of-N, majority voting)
- Search more deeply within a single generation (beam search, tree search, chain-of-thought with backtracking)
Both approaches improve output quality, but they hit diminishing returns at different rates and for different reasons.
Formal Setup and Notation
Best-of-N Sampling
Given a model and a verifier (reward model) , best-of-N sampling generates independent completions and selects:
The quality of depends on , the quality of , and the quality of .
Process Reward Model (PRM)
A process reward model assigns scores to intermediate steps, not just final answers. For a chain of reasoning steps :
or equivalently, in log space. This enables step-level search: prune branches where early steps score poorly, allocate compute to promising branches.
Core Definitions
The pass@N metric measures the probability that at least one of independent samples solves a problem correctly:
where is the per-sample success probability. Using for small , this is approximately , which itself reduces to when and saturates near once .
An outcome reward model (ORM) scores only the final answer. A process reward model (PRM) scores each intermediate step. The distinction matters for scaling: PRMs enable tree search over reasoning steps, while ORMs can only rank complete solutions.
The verifier tax is the compute cost of evaluating the reward model on each candidate. For large reward models, this tax can be substantial. Doubling doubles both generation and verification cost.
Main Theorems
Diminishing Returns of Best-of-N
Statement
For best-of-N with a perfect verifier and independent samples with success probability :
The marginal improvement from the -th sample is:
This decays exponentially in . The expected number of samples to achieve pass@N is:
for small . Halving the failure probability requires additional samples, regardless of current .
Intuition
Each new sample has the same independent probability of being correct. As grows, you have already likely found a correct answer, so additional samples are wasted. The regime where best-of-N helps most is when is moderate (5-30%): high enough that samples suffice, but low enough that a single sample often fails.
Proof Sketch
Direct computation. The probability that all samples fail is . The complement is the success probability. Differentiation with respect to gives the marginal improvement.
Why It Matters
This bound shows that best-of-N scaling is limited by the base model's per-sample success probability. If , you need samples to reach 99.9% success. At , you need only . Inference-time compute cannot compensate for a weak base model. Improving (via better training) and improving scaling (via better search) are complementary.
Failure Mode
The bound assumes a perfect verifier and independent samples. Imperfect verifiers select confidently wrong answers over hesitantly correct ones. Correlated samples (e.g., from temperature-reduced sampling) reduce the effective diversity and make the formula overly optimistic.
Key Phenomena
Verifier Quality Dominates Sample Count
Empirical results consistently show that improving the verifier (reward model) yields larger gains than increasing . A perfect verifier with outperforms a mediocre verifier with on math and code tasks. This is because a bad verifier has a failure mode that sampling cannot fix: it systematically prefers wrong answers.
Formally, if the verifier has accuracy (probability of correctly ranking a correct answer above an incorrect one), the effective pass rate becomes approximately . When , increasing is bounded by .
Task Dependence of Scaling
Inference-time scaling works well for tasks with:
- Verifiable answers (math, code, formal proofs)
- High diversity in solution approaches
- Moderate base success probability (5-50%)
It works poorly for tasks with:
- No good verifier (open-ended writing, ethical judgments)
- Low diversity in completions (the model makes the same error every time)
- Very low base probability (the model cannot solve the task)
Process Reward Models Enable Tree Search
With a PRM, you can evaluate partial solutions and prune bad branches early. This converts best-of-N (flat sampling) into tree search (structured exploration). Tree search uses compute more efficiently: it avoids completing solutions that are already off-track.
The improvement from tree search over flat sampling depends on how early errors can be detected. If errors are detectable after the first step, tree search is dramatically more efficient. If errors are only visible at the final answer, tree search reduces to best-of-N.
Canonical Examples
Math problem solving with best-of-N
On the MATH benchmark, a model with 30% single-sample accuracy can reach 80% accuracy with best-of-100 and a trained verifier. But the same model with a perfect verifier (using the ground-truth answer to select) reaches 95% with best-of-100. The gap between 80% and 95% is entirely due to verifier quality, not sample count.
Common Confusions
Inference-time scaling is not just generating more tokens
Generating a longer chain-of-thought is not the same as inference-time scaling. The key mechanism is selection: generating multiple candidates and choosing the best. Simply generating more tokens in a single completion can help (more reasoning steps), but the scaling properties are different. Token-level scaling saturates quickly; candidate-level scaling with verification follows the pass@N curve.
Training and inference scaling are not substitutes
You cannot fully compensate for weak training by spending more at inference time. If the base model has per-sample accuracy , you need samples for 99.9% success. It is far cheaper to improve to 0.1 through better training and then use . Training scaling improves the base rate ; inference scaling amplifies it. Both are needed.
Majority voting is not optimal selection
Majority voting selects the most common answer. This is optimal only when the verifier is the identity function (no reward model, just frequency). A trained verifier that scores solution quality outperforms majority voting because it uses more information than answer frequency. The exception: when the verifier is poorly calibrated, majority voting can be more robust.
Exercises
Problem
A model has 20% per-sample accuracy on a coding benchmark. How many independent samples do you need for pass@N to exceed 90%?
Problem
Suppose your verifier has accuracy (it correctly identifies the best solution 80% of the time when comparing a correct and incorrect solution). With per-sample accuracy and , what is the approximate probability that the verifier selects a correct answer?
References
Canonical:
- Cobbe et al., Training Verifiers to Solve Math Word Problems (2021), arXiv:2110.14168
- Lightman et al., Let's Verify Step by Step (2023), arXiv:2305.20050
Current:
- Snell et al., Scaling LLM Test-Time Compute Optimally Can Be More Effective than Scaling Model Parameters (2024), arXiv:2408.03314
- Brown et al., Large Language Monkeys: Scaling Inference Compute with Repeated Sampling (2024), arXiv:2407.21787
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Scaling LawsLayer 4
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Test-Time Compute and SearchLayer 5