Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Test-Time Compute and Search

One of the biggest frontier shifts: spending more compute at inference through repeated sampling, verifier-guided search, MCTS for reasoning, chain-of-thought as compute, and latent reasoning.

AdvancedTier 2Frontier~65 min

Prerequisites

0

Why This Matters

Training scaling laws are well understood: more compute at training time predictably improves loss. But a parallel revolution is happening at inference time. Instead of generating a single answer, you generate many candidates and select the best one. Instead of producing tokens left-to-right, you search over reasoning paths. Instead of thinking in tokens, you think in hidden states.

This shift, from "make the model smarter" to "let the model think harder", is one of the defining developments of 2025-2026. Models like o1 and DeepSeek-R1 spend orders of magnitude more compute per query than standard models, and the extra compute translates into dramatically better performance on reasoning tasks.

The key finding is nuanced: more tokens alone do not reliably produce better answers. Good verifiers and structured search matter more than raw generation volume.

Mental Model

Think of a student taking an exam. A well-prepared student writes one answer and moves on. A student who is uncertain can:

  1. Attempt multiple solutions and pick the one that checks out (best-of-N)
  2. Show their work step by step and verify each step (process reward)
  3. Explore different approaches and backtrack when stuck (tree search)
  4. Think silently before writing (latent reasoning)

Each strategy trades more time (inference compute) for better answers. The question is: how much better, and when does the improvement plateau?

Inference Compute: The Core Idea

Definition

Test-Time Compute

Test-time compute (or inference compute) is the total computation spent generating a response to a single query. For a standard autoregressive model generating TT tokens with NN parameters, inference compute is approximately 2NT2NT FLOPs. Test-time compute scaling increases this by generating multiple candidates, performing search, or extending the reasoning chain.

The standard approach generates one response: compute 2NT\approx 2NT. Test-time scaling methods spend k×2NTk \times 2NT or more, where kk can be 10-1000x, to produce a better response. The central question: does the quality improvement justify the cost?

Best-of-N Sampling

The simplest test-time scaling method: generate NN independent responses and select the best one using a verifier or reward model.

Proposition

Best-of-N Scaling with a Perfect Verifier

Statement

If the model produces a correct answer with probability pp per sample, and the verifier perfectly identifies correctness, then the probability of at least one correct answer among NN samples is:

P(at least one correct)=1(1p)NP(\text{at least one correct}) = 1 - (1-p)^N

For small pp, this scales approximately linearly: PNpP \approx Np when Np1Np \ll 1. To achieve success probability 1δ1 - \delta, you need:

Nlog(1/δ)log(1/(1p))log(1/δ)pN \geq \frac{\log(1/\delta)}{\log(1/(1-p))} \approx \frac{\log(1/\delta)}{p}

samples. The cost scales as O(1/p)O(1/p) to achieve high confidence.

Intuition

If the model has even a small chance of producing the correct answer, generating enough samples makes success near-certain. The verifier acts as a filter: it cannot make the model smarter, but it can select the model's best work from many attempts. The bottleneck shifts from generation quality to verifier quality.

Why It Matters

Best-of-N is the workhorse of test-time compute scaling. It requires no additional training. just a good verifier. For math and code, where verifiers are reliable (run the code, check the proof), best-of-N gives substantial improvements. For open-ended text where verification is hard, the gains are smaller.

Failure Mode

When the verifier is imperfect (a reward model rather than a ground-truth checker), best-of-N suffers from reward hacking at inference time. The selected sample maximizes the verifier's score, not actual quality. As NN increases, the selected sample increasingly exploits verifier weaknesses. This is the inference-time analog of reward model overoptimization.

Verifier-Guided Search

Best-of-N generates complete responses independently. More sophisticated methods use the verifier to guide generation during the reasoning process.

Process Reward Models (PRMs) vs Outcome Reward Models (ORMs)

  • ORM: Scores the final answer. Binary: correct or not.
  • PRM: Scores each intermediate reasoning step. Guides the search by pruning bad reasoning paths early.

With a PRM, you can perform step-level beam search: at each reasoning step, generate multiple continuations, score them with the PRM, and keep only the top-kk. This avoids wasting compute on reasoning paths that go wrong early.

Monte Carlo Tree Search (MCTS) for Reasoning

MCTS treats reasoning as a sequential decision problem:

  • State: The reasoning chain so far
  • Action: The next reasoning step
  • Reward: Correctness of the final answer (or PRM score)
  • Rollout: Complete the reasoning chain from the current state

The standard MCTS loop:

  1. Select a promising node using UCB (upper confidence bound)
  2. Expand by generating a new reasoning step
  3. Simulate by completing the chain (or using a value estimate)
  4. Backpropagate the outcome to update node statistics

This is how AlphaGo-style search has been adapted for mathematical reasoning. The reasoning trace is the "game tree," and each step is a "move."

Chain-of-Thought as Inference Compute

Definition

Chain-of-Thought Scaling

Chain-of-thought (CoT) prompting increases inference compute by generating intermediate reasoning tokens before the final answer. A model generating TT reasoning tokens plus TaT_a answer tokens spends approximately 2N(T+Ta)2N(T + T_a) FLOPs. more reasoning tokens means more compute.

The hypothesis: the transformer's fixed per-token computation is insufficient for hard reasoning, and generating intermediate tokens provides additional "serial computation" that enables multi-step inference.

Proposition

Inference Compute Scaling

Statement

Empirically, for reasoning tasks with reliable verifiers, accuracy scales approximately log-linearly with inference compute:

accuracy(Cinf)alog(Cinf)+b\text{accuracy}(C_{\text{inf}}) \approx a \cdot \log(C_{\text{inf}}) + b

where CinfC_{\text{inf}} is the total inference FLOPs (including all candidates and reasoning tokens). This scaling holds over 1-2 orders of magnitude of compute before saturating.

The critical finding: the rate of improvement depends heavily on verifier quality. With a perfect verifier, scaling is steep. With a weak reward model, scaling flattens quickly due to reward hacking.

Intuition

More inference compute helps, but with diminishing returns. Doubling compute gives a fixed additive accuracy improvement (log scaling), not a fixed multiplicative one. Eventually the model hits a ceiling set by its training: no amount of search can find a correct solution if the model never learned the relevant knowledge or reasoning pattern.

Why It Matters

This result establishes that training compute and inference compute are partially substitutable. A smaller model with more inference compute can match a larger model with less inference compute on reasoning tasks. This has profound implications for deployment: you can trade expensive training for cheaper (per-parameter) inference, especially when queries are infrequent but high-stakes.

Failure Mode

The log-linear scaling saturates. Beyond a certain point, additional inference compute yields negligible improvement because the model lacks the knowledge or reasoning capability to solve the problem regardless of how many attempts it makes. Inference compute cannot compensate for fundamental training deficiencies.

Latent Reasoning

The most frontier direction: reasoning in hidden states rather than in tokens.

Standard chain-of-thought reasoning forces the model to express all reasoning as text tokens. This has limitations:

  • Bandwidth: Natural language is a low-bandwidth communication channel for complex reasoning
  • Faithfulness: The written chain-of-thought may not reflect the model's actual computation
  • Efficiency: Generating tokens is expensive; internal computation is cheaper

Latent reasoning models use special "thinking tokens" that are processed by the transformer but not decoded into text. The model can perform many steps of internal computation (many forward passes through "thinking" positions) before producing an output token. This increases the model's effective depth for hard problems.

This is an active research area. The theoretical framework is developing: latent reasoning can be viewed as giving the model a variable-depth computation graph rather than the fixed depth of a standard transformer.

Build It This Way by Default

For reasoning tasks with verifiable answers (math, code, constrained generation), best-of-N with a verifier is the default approach. Generate 16-64 candidates, score with a verifier, return the best. This is simple to implement, requires no additional training beyond the verifier, and provides 10-30% accuracy improvements on hard benchmarks. For tasks without reliable verifiers, a single extended chain-of-thought with a reward model reranker is the practical fallback.

When Test-Time Compute Helps (and When It Doesn't)

Test-time compute is most effective when:

  • The task has a reliable verifier (code execution, math provers)
  • The model has a non-zero pass rate (it sometimes gets the answer right)
  • The task requires multi-step reasoning (search helps explore paths)

Test-time compute is least effective when:

  • Verification is as hard as generation (open-ended writing, subjective tasks)
  • The model never produces correct answers (fundamental capability gap)
  • The task is easy (single-pass generation already succeeds reliably)

Common Confusions

Watch Out

More tokens does not always mean better reasoning

Generating a longer chain-of-thought is not the same as reasoning more deeply. Models can produce long, fluent chains of text that arrive at the wrong answer. The quality of reasoning depends on the model's training, not just the length of the output. Structured search with verification matters more than raw token count.

Watch Out

Best-of-N is not the same as temperature sampling

High temperature increases output diversity but does not select for quality. Best-of-N generates diverse candidates and selects the best one. The selection step is what matters. Without a good verifier, high-temperature sampling just produces more varied wrong answers.

Watch Out

Test-time and train-time compute are not fully substitutable

A small model with massive test-time compute cannot match a large, well-trained model on all tasks. Test-time compute helps the model use knowledge it already has more effectively. It cannot inject knowledge the model never learned. The substitution works for reasoning tasks where the knowledge is present but the inference path is hard to find.

Summary

  • Test-time compute scaling: generate multiple candidates and select the best
  • Best-of-N with a verifier: success probability =1(1p)N= 1 - (1-p)^N
  • PRMs guide search at each step; ORMs score only the final answer
  • MCTS adapts game-playing search to reasoning traces
  • Chain-of-thought provides serial compute via intermediate tokens
  • Latent reasoning trades visible tokens for internal computation
  • Verifier quality is the bottleneck: poor verifiers lead to inference-time reward hacking
  • Accuracy scales log-linearly with inference compute, then saturates

Exercises

ExerciseCore

Problem

A model solves a math problem correctly with probability p=0.05p = 0.05 per attempt. Using best-of-N with a perfect verifier, how many samples do you need to achieve a 95% probability of finding at least one correct solution?

ExerciseAdvanced

Problem

Compare the compute efficiency of two strategies: (A) Train a 2x larger model and generate one response. (B) Keep the original model and use best-of-NN with a perfect verifier. Assuming training scaling gives accuracy Nparams0.1\propto N_{\text{params}}^{0.1} and best-of-N gives accuracy 1(1p)N1 - (1-p)^N with fixed per-sample cost, under what conditions does strategy B dominate?

ExerciseResearch

Problem

The inference-time reward hacking problem: as NN increases in best-of-N with an imperfect reward model, the selected response increasingly exploits the reward model rather than improving in true quality. Formalize this using the Gao et al. (2023) framework: if the proxy reward r^\hat{r} and the true reward rr^* have correlation ρ<1\rho < 1, how does the expected true reward of the best-of-N selection behave as NN \to \infty?

References

Canonical:

  • Cobbe et al., "Training Verifiers to Solve Math Word Problems" (2021). early verifier-guided approach
  • Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (2022)

Current:

  • Snell et al., "Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters" (2024)
  • DeepSeek-AI, "DeepSeek-R1" (2025). inference-time reasoning at scale
  • Gao et al., "Scaling Laws for Reward Model Overoptimization" (2023). proxy reward hacking

Next Topics

The natural next steps from test-time compute:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics