Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Inference-Time Scaling Laws

How additional compute at inference time (repeated sampling, search, verification) improves output quality, why gains are task-dependent, and why verifier quality matters more than raw sample count.

AdvancedTier 2Frontier~50 min
0

Why This Matters

Training scaling laws (Chinchilla, Kaplan) describe how model quality improves with more training compute. Inference-time scaling laws describe a complementary axis: how output quality improves with more compute at test time. This is a distinct phenomenon. A fixed model can produce better outputs by generating more candidates and selecting the best one.

This matters because inference compute is cheaper and more flexible than training compute. You can allocate more inference compute to harder problems and less to easy ones. Training compute is a one-time fixed cost; inference compute scales with demand and difficulty.

Mental Model

Given a fixed model, there are two ways to use extra inference compute:

  1. Generate more candidates and select the best one (best-of-N, majority voting)
  2. Search more deeply within a single generation (beam search, tree search, chain-of-thought with backtracking)

Both approaches improve output quality, but they hit diminishing returns at different rates and for different reasons.

Formal Setup and Notation

Definition

Best-of-N Sampling

Given a model pθp_\theta and a verifier (reward model) rϕr_\phi, best-of-N sampling generates NN independent completions y1,,yNpθ(x)y_1, \ldots, y_N \sim p_\theta(\cdot | x) and selects:

y^=argmaxi{1,,N}rϕ(x,yi)\hat{y} = \arg\max_{i \in \{1, \ldots, N\}} r_\phi(x, y_i)

The quality of y^\hat{y} depends on NN, the quality of pθp_\theta, and the quality of rϕr_\phi.

Definition

Process Reward Model (PRM)

A process reward model assigns scores to intermediate steps, not just final answers. For a chain of reasoning steps s1,s2,,sTs_1, s_2, \ldots, s_T:

rPRM(s1,,sT)=t=1Trϕ(sts<t,x)r_{\text{PRM}}(s_1, \ldots, s_T) = \prod_{t=1}^T r_\phi(s_t | s_{<t}, x)

or equivalently, tlogrϕ(sts<t,x)\sum_t \log r_\phi(s_t | s_{<t}, x) in log space. This enables step-level search: prune branches where early steps score poorly, allocate compute to promising branches.

Core Definitions

The pass@N metric measures the probability that at least one of NN independent samples solves a problem correctly:

pass@N=1(1p)N\text{pass@}N = 1 - (1 - p)^N

where pp is the per-sample success probability. Using log(1p)p\log(1 - p) \approx -p for small pp, this is approximately 1epN1 - e^{-pN}, which itself reduces to pNpN when pN1pN \ll 1 and saturates near 11 once pN3pN \gtrsim 3.

An outcome reward model (ORM) scores only the final answer. A process reward model (PRM) scores each intermediate step. The distinction matters for scaling: PRMs enable tree search over reasoning steps, while ORMs can only rank complete solutions.

The verifier tax is the compute cost of evaluating the reward model on each candidate. For large reward models, this tax can be substantial. Doubling NN doubles both generation and verification cost.

Main Theorems

Theorem

Diminishing Returns of Best-of-N

Statement

For best-of-N with a perfect verifier and independent samples with success probability pp:

pass@N=1(1p)N\text{pass@}N = 1 - (1 - p)^N

The marginal improvement from the NN-th sample is:

ΔN=(1p)N1p\Delta_N = (1 - p)^{N-1} \cdot p

This decays exponentially in NN. The expected number of samples to achieve pass@N =1δ= 1 - \delta is:

N=log(1/δ)log(1/(1p))log(1/δ)pN = \frac{\log(1/\delta)}{\log(1/(1-p))} \approx \frac{\log(1/\delta)}{p}

for small pp. Halving the failure probability requires O(1/p)O(1/p) additional samples, regardless of current NN.

Intuition

Each new sample has the same independent probability pp of being correct. As NN grows, you have already likely found a correct answer, so additional samples are wasted. The regime where best-of-N helps most is when pp is moderate (5-30%): high enough that N=10-100N = 10\text{-}100 samples suffice, but low enough that a single sample often fails.

Proof Sketch

Direct computation. The probability that all NN samples fail is (1p)N(1-p)^N. The complement is the success probability. Differentiation with respect to NN gives the marginal improvement.

Why It Matters

This bound shows that best-of-N scaling is limited by the base model's per-sample success probability. If p=0.001p = 0.001, you need N7000N \approx 7000 samples to reach 99.9% success. At p=0.1p = 0.1, you need only N66N \approx 66. Inference-time compute cannot compensate for a weak base model. Improving pp (via better training) and improving NN scaling (via better search) are complementary.

Failure Mode

The bound assumes a perfect verifier and independent samples. Imperfect verifiers select confidently wrong answers over hesitantly correct ones. Correlated samples (e.g., from temperature-reduced sampling) reduce the effective diversity and make the (1p)N(1-p)^N formula overly optimistic.

Key Phenomena

Verifier Quality Dominates Sample Count

Empirical results consistently show that improving the verifier (reward model) yields larger gains than increasing NN. A perfect verifier with N=10N = 10 outperforms a mediocre verifier with N=1000N = 1000 on math and code tasks. This is because a bad verifier has a failure mode that sampling cannot fix: it systematically prefers wrong answers.

Formally, if the verifier has accuracy qq (probability of correctly ranking a correct answer above an incorrect one), the effective pass rate becomes approximately qpass@Nq \cdot \text{pass@}N. When q<1q < 1, increasing NN is bounded by qq.

Task Dependence of Scaling

Inference-time scaling works well for tasks with:

  • Verifiable answers (math, code, formal proofs)
  • High diversity in solution approaches
  • Moderate base success probability (5-50%)

It works poorly for tasks with:

  • No good verifier (open-ended writing, ethical judgments)
  • Low diversity in completions (the model makes the same error every time)
  • Very low base probability (the model cannot solve the task)

Process Reward Models Enable Tree Search

With a PRM, you can evaluate partial solutions and prune bad branches early. This converts best-of-N (flat sampling) into tree search (structured exploration). Tree search uses compute more efficiently: it avoids completing solutions that are already off-track.

The improvement from tree search over flat sampling depends on how early errors can be detected. If errors are detectable after the first step, tree search is dramatically more efficient. If errors are only visible at the final answer, tree search reduces to best-of-N.

Canonical Examples

Example

Math problem solving with best-of-N

On the MATH benchmark, a model with 30% single-sample accuracy can reach 80% accuracy with best-of-100 and a trained verifier. But the same model with a perfect verifier (using the ground-truth answer to select) reaches 95% with best-of-100. The gap between 80% and 95% is entirely due to verifier quality, not sample count.

Common Confusions

Watch Out

Inference-time scaling is not just generating more tokens

Generating a longer chain-of-thought is not the same as inference-time scaling. The key mechanism is selection: generating multiple candidates and choosing the best. Simply generating more tokens in a single completion can help (more reasoning steps), but the scaling properties are different. Token-level scaling saturates quickly; candidate-level scaling with verification follows the pass@N curve.

Watch Out

Training and inference scaling are not substitutes

You cannot fully compensate for weak training by spending more at inference time. If the base model has per-sample accuracy p=0.001p = 0.001, you need N=7000N = 7000 samples for 99.9% success. It is far cheaper to improve pp to 0.1 through better training and then use N=66N = 66. Training scaling improves the base rate pp; inference scaling amplifies it. Both are needed.

Watch Out

Majority voting is not optimal selection

Majority voting selects the most common answer. This is optimal only when the verifier is the identity function (no reward model, just frequency). A trained verifier that scores solution quality outperforms majority voting because it uses more information than answer frequency. The exception: when the verifier is poorly calibrated, majority voting can be more robust.

Exercises

ExerciseCore

Problem

A model has 20% per-sample accuracy on a coding benchmark. How many independent samples NN do you need for pass@N to exceed 90%?

ExerciseAdvanced

Problem

Suppose your verifier has accuracy q=0.8q = 0.8 (it correctly identifies the best solution 80% of the time when comparing a correct and incorrect solution). With per-sample accuracy p=0.2p = 0.2 and N=50N = 50, what is the approximate probability that the verifier selects a correct answer?

References

Canonical:

  • Cobbe et al., Training Verifiers to Solve Math Word Problems (2021), arXiv:2110.14168
  • Lightman et al., Let's Verify Step by Step (2023), arXiv:2305.20050

Current:

  • Snell et al., Scaling LLM Test-Time Compute Optimally Can Be More Effective than Scaling Model Parameters (2024), arXiv:2408.03314
  • Brown et al., Large Language Monkeys: Scaling Inference Compute with Repeated Sampling (2024), arXiv:2407.21787

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.