Inference-Time Scaling Laws

Sneiderman, Robby

LLM Construction

Inference-Time Scaling Laws

How additional compute at inference time (repeated sampling, search, verification) improves output quality, why gains are task-dependent, and why verifier quality matters more than raw sample count.

AdvancedTier 2FrontierFrontier watch~50 min

Prerequisites

Scaling Laws Test Time Compute and Search

Prereq Map

Learning position

Read this page in the graph.

llm-construction | layer 5 | tier 2. This page has 2 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Training scaling laws (Chinchilla, Kaplan) describe how model quality improves with more training compute. Inference-time scaling laws describe a complementary axis: how output quality improves with more compute at test time. This is a distinct phenomenon. A fixed model can produce better outputs by generating more candidates and selecting the best one.

This matters because inference compute is cheaper and more flexible than training compute. You can allocate more inference compute to harder problems and less to easy ones. Training compute is a one-time fixed cost; inference compute scales with demand and difficulty.

Mental Model

Given a fixed model, there are two ways to use extra inference compute:

Generate more candidates and select the best one (best-of-N, majority voting)
Search more deeply within a single generation (beam search, tree search, chain-of-thought with backtracking)

Both approaches improve output quality, but they hit diminishing returns at different rates and for different reasons.

Current Checkpoint

The 2025-2026 lesson is that "more thinking" is not a single mechanism. Inference-time scaling has at least four separable levers:

Repeated sampling: generate many complete attempts and choose one.
Search over reasoning traces: allocate tokens adaptively to promising partial solutions.
External verification: use tests, answer checkers, proof checkers, or reward models to rank candidates.
Verifier-trained policies: use verifiable rewards during RL so the base model learns to produce traces that search and verification can use.

This distinction matters for product design. A study tool should not simply ask a model to produce a longer explanation. It should track when the answer is objectively checkable, when multiple attempts disagree, and when a verifier has enough coverage to justify more compute. For open-ended learning feedback, the platform needs evidence traces and uncertainty; for math, code, and formal derivations, it can use stricter answer or step verification.

Three 2024-2025 papers quantify this design split directly. Brown et al. (arXiv:2407.21787, Section 2 and Figure 7) report that repeated independent samples increase pass@N coverage, while mainstream selection methods can plateau far below the coverage ceiling without a strong verifier. Snell et al. (arXiv:2408.03314, Section 4 and Figure 1) compare best-of-N, verifier search, and adaptive strategies, showing that the preferred allocation depends on prompt difficulty. Setlur et al. (arXiv:2502.12118, Theorem 5.8 and Figure 7) narrow the training condition: gains require verifier feedback or verifier search on tasks with executable checks; longer sampled traces are not evidence without that check.

Build It This Way by Default

Treat test-time compute as an adaptive budget, not a fixed "think longer" setting. Use small budgets for easy items, increase sampling or search only when disagreement or low confidence appears, and require a verifier before claiming extra compute improved correctness.

Formal Setup and Notation

Definition

Best-of-N Sampling

Given a model $p_\theta$ and a verifier (reward model) $r_\phi$ , best-of-N sampling generates $N$ independent completions $y_1, \ldots, y_N \sim p_\theta(\cdot | x)$ and selects:

$\hat{y} = \arg\max_{i \in \{1, \ldots, N\}} r_\phi(x, y_i)$

The selected answer quality depends on the sample count, the generator, and the verifier. Cobbe et al. (arXiv:2110.14168, Sections 3-4) use this setup for outcome verifiers on GSM8K: sample multiple solutions, score them with a learned verifier, and select the highest-scoring candidate.

Definition

Process Reward Model (PRM)

A process reward model assigns scores to intermediate steps, not just final answers. For a reasoning trace with $T$ steps, let a learned scorer assign a per-step score to step $t$ :

$q_\phi(s_t \mid s_{<t}, x).$

Search can aggregate those step scores in log space:

$R_\phi(s_{1:T} \mid x) = \sum_{t=1}^{T}\log q_\phi(s_t \mid s_{<t}, x).$

Equivalently, if the scores are positive probabilities, the search routine can rank traces by the product of the per-step scores. The aggregation is a search heuristic, not the definition of a PRM. Lightman et al. (arXiv:2305.20050, Sections 2-3) introduce PRM800K and compare step-level process supervision with outcome-only supervision for mathematical reasoning. This enables step-level search: prune branches where early steps score poorly, allocate compute to promising branches.

Core Definitions

The pass@N metric measures the probability that at least one of $N$ independent samples solves a problem correctly:

$\text{pass@}N = 1 - (1 - p)^N$

where $p$ is the per-sample success probability. Using $\log(1 - p) \approx -p$ for small $p$ , this is approximately $1 - e^{-pN}$ , which itself reduces to $pN$ when $pN \ll 1$ and saturates near $1$ once $pN \gtrsim 3$ .

An outcome reward model (ORM) scores only the final answer. A process reward model (PRM) scores each intermediate step. Cobbe et al. (arXiv:2110.14168, Sections 3-4) are the canonical outcome-verifier case; Lightman et al. (arXiv:2305.20050, Sections 2-3) are the canonical process-supervision case. The distinction matters for scaling: PRMs enable tree search over reasoning steps, while ORMs can only rank complete solutions.

The verifier tax is the compute cost of evaluating the reward model on each candidate. For large reward models, this tax can be substantial. Doubling $N$ doubles both generation and verification cost in a flat best-of-N pipeline. But verifier tax is a systems cost, not a universal law: exact unit tests, symbolic checkers, and cached proof kernels can be cheaper than neural reward models, while neural verifiers add a second model call per candidate. Snell et al. (arXiv:2408.03314, Section 4 and Figure 1) explicitly compare allocation strategies because verifier cost and prompt difficulty change the optimal budget.

Main Theorems

Theorem

Diminishing Returns of Best-of-N

Statement

For best-of-N with a perfect verifier and independent samples with success probability $p$ :

$\text{pass@}N = 1 - (1 - p)^N$

The marginal improvement from the $N$ -th sample is:

$\Delta_N = (1 - p)^{N-1} \cdot p$

This decays exponentially in $N$ . The expected number of samples to achieve pass@N $= 1 - \delta$ is:

$N = \frac{\log(1/\delta)}{\log(1/(1-p))} \approx \frac{\log(1/\delta)}{p}$

for small $p$ . Halving the failure probability requires $O(1/p)$ additional samples, regardless of current $N$ .

Intuition

Each new sample has the same independent probability $p$ of being correct. As $N$ grows, you have already likely found a correct answer, so additional samples are wasted. The regime where best-of-N helps most is when $p$ is moderate (5-30%): high enough that $N = 10\text{-}100$ samples suffice, but low enough that a single sample often fails.

Proof Sketch

Direct computation. The probability that all $N$ samples fail is $(1-p)^N$ . The complement is the success probability. Differentiation with respect to $N$ gives the marginal improvement.

Why It Matters

This bound shows that best-of-N scaling is limited by the base model's per-sample success probability. If $p = 0.001$ , you need $N \approx 7000$ samples to reach 99.9% success. At $p = 0.1$ , you need only $N \approx 66$ . Inference-time compute cannot compensate for a weak base model. Improving $p$ (via better training) and improving $N$ scaling (via better search) are complementary.

Failure Mode

The bound assumes a perfect verifier and independent samples. Imperfect verifiers select confidently wrong answers over hesitantly correct ones. Correlated samples (e.g., from temperature-reduced sampling) reduce the effective diversity and make the $(1-p)^N$ formula overly optimistic.

report a correction →

Key Phenomena

Verifier Quality Dominates Sample Count

Empirical results consistently show that verifier quality can become the binding constraint once candidate coverage is high. Cobbe et al. (arXiv:2110.14168, Sections 3-4) show the outcome-verifier pattern on GSM8K; Lightman et al. (arXiv:2305.20050, Sections 2-3) show why step-level process labels can improve search guidance on mathematical reasoning; Brown et al. (arXiv:2407.21787, Section 2 and Figure 7) separate coverage from practical selection methods under repeated sampling. A bad verifier has a failure mode that sampling cannot fix: it systematically prefers wrong answers.

The effective pass rate is bounded by two quantities: the generator's coverage (pass@N, the probability at least one of $N$ samples is correct) and the verifier's precision in selecting among candidates. As $N$ grows, pass@N approaches 1, but verifier precision becomes the binding constraint: past a certain $N$ the bottleneck shifts from "is there a correct answer among the samples?" to "does the verifier rank it highest?" The ceiling is set by verifier quality, not sample count.

Task Dependence of Scaling

Inference-time scaling works well for tasks with:

Verifiable answers (math, code, formal proofs)
High diversity in solution approaches
Moderate base success probability (5-50%)

It works poorly for tasks with:

No good verifier (open-ended writing, ethical judgments)
Low diversity in completions (the model makes the same error every time)
Very low base probability (the model cannot solve the task)

Process Reward Models Enable Tree Search

With a PRM, you can evaluate partial solutions and prune bad branches early. This converts best-of-N (flat sampling) into tree search (structured exploration). Tree search uses compute more efficiently: it avoids completing solutions that are already off-track.

The improvement from tree search over flat sampling depends on how early errors can be detected. If errors are detectable after the first step, tree search is dramatically more efficient. If errors are only visible at the final answer, tree search reduces to best-of-N. Snell et al. (arXiv:2408.03314, Section 4 and Figure 1) frame this as a budget-allocation problem across difficulty levels; Setlur et al. (arXiv:2502.12118, Theorem 5.8 and Figure 7) sharpen the warning that longer traces alone are not a substitute for verifier feedback on checkable tasks.

Canonical Examples

Example

Math problem solving with repeated sampling

On GSM8K and MATH-style tasks, repeated sampling creates a separation between coverage and selection. Cobbe et al. (arXiv:2110.14168, Sections 3-4) train outcome verifiers to choose among sampled GSM8K solutions. Lightman et al. (arXiv:2305.20050, Sections 2-3) show that process supervision can provide denser guidance for mathematical reasoning traces. Brown et al. (arXiv:2407.21787, Section 2 and Figure 7) then makes the systems point explicit: pass@N coverage can keep rising while practical selection methods plateau below that ceiling. The gap is not raw sample count; it is whether the selector can identify the correct candidate.

Common Confusions

Watch Out

Inference-time scaling is not just generating more tokens

Generating a longer chain-of-thought is not the same as inference-time scaling. The key mechanism is selection: generating multiple candidates and choosing the best. Simply generating more tokens in a single completion can help (more reasoning steps), but the scaling properties are different. Token-level scaling saturates quickly; candidate-level scaling with verification follows the pass@N curve.

Watch Out

Training and inference scaling are not substitutes

You cannot fully compensate for weak training by spending more at inference time. If the base model has per-sample accuracy $p = 0.001$ , you need $N = 7000$ samples for 99.9% success. It is far cheaper to improve $p$ to 0.1 through better training and then use $N = 66$ . Training scaling improves the base rate $p$ ; inference scaling amplifies it. Both are needed.

Watch Out

Majority voting is not optimal selection

Majority voting selects the most common answer. This is optimal only when the verifier is the identity function (no reward model, just frequency). A trained verifier that scores solution quality outperforms majority voting because it uses more information than answer frequency. The exception: when the verifier is poorly calibrated, majority voting can return a higher exact-match rate than that verifier.

Exercises

ExerciseCore

Problem

A model has 20% per-sample accuracy on a coding benchmark. How many independent samples $N$ do you need for pass@N to exceed 90%?

ExerciseAdvanced

Problem

Suppose your verifier has accuracy $q = 0.8$ (it correctly identifies the best solution 80% of the time when comparing a correct and incorrect solution). With per-sample accuracy $p = 0.2$ and $N = 50$ , what is the approximate probability that the verifier selects a correct answer?

References

Canonical:

Cobbe et al., Training Verifiers to Solve Math Word Problems (2021), arXiv:2110.14168. Sections 3-4 for outcome verifier training and verifier-guided selection.
Lightman et al., Let's Verify Step by Step (2023), arXiv:2305.20050. Sections 2-3 for process supervision and PRM800K.

Current:

Snell et al., Scaling LLM Test-Time Compute Optimally Can Be More Effective than Scaling Model Parameters (2024), arXiv:2408.03314. Section 4 and Figure 1 for difficulty-conditioned allocation.
Brown et al., Large Language Monkeys: Scaling Inference Compute with Repeated Sampling (2024), arXiv:2407.21787. Section 2 and Figure 7 for pass@N coverage and selection gaps.
Setlur et al., Scaling Test-Time Compute Without Verification or RL is Suboptimal (2025), arXiv:2502.12118. Theorem 5.8 and Figure 7 for the verifier-free separation result.
DeepSeek-AI et al., DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (2025), arXiv:2501.12948. Rule-based rewards, GRPO, and documented reasoning-policy failure modes.

Last reviewed: July 3, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Scaling Lawslayer 4 · tier 1
Test-Time Compute and Searchlayer 5 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.