LLM Construction
Test-Time Compute and Search
One of the biggest frontier shifts: spending more compute at inference through repeated sampling, verifier-guided search, MCTS for reasoning, chain-of-thought as compute, and latent reasoning.
Prerequisites
Why This Matters
Training scaling laws are well understood: more compute at training time predictably improves loss. But a parallel revolution is happening at inference time. Instead of generating a single answer, you generate many candidates and select the best one. Instead of producing tokens left-to-right, you search over reasoning paths. Instead of thinking in tokens, you think in hidden states.
This shift, from "make the model smarter" to "let the model think harder", is one of the defining developments of 2025-2026. Models like o1 and DeepSeek-R1 spend orders of magnitude more compute per query than standard models, and the extra compute translates into dramatically better performance on reasoning tasks.
The key finding is nuanced: more tokens alone do not reliably produce better answers. Good verifiers and structured search matter more than raw generation volume.
Mental Model
Think of a student taking an exam. A well-prepared student writes one answer and moves on. A student who is uncertain can:
- Attempt multiple solutions and pick the one that checks out (best-of-N)
- Show their work step by step and verify each step (process reward)
- Explore different approaches and backtrack when stuck (tree search)
- Think silently before writing (latent reasoning)
Each strategy trades more time (inference compute) for better answers. The question is: how much better, and when does the improvement plateau?
Inference Compute: The Core Idea
Test-Time Compute
Test-time compute (or inference compute) is the total computation spent generating a response to a single query. For a standard autoregressive model generating tokens with parameters, inference compute is approximately FLOPs. Test-time compute scaling increases this by generating multiple candidates, performing search, or extending the reasoning chain.
The standard approach generates one response: compute . Test-time scaling methods spend or more, where can be 10-1000x, to produce a better response. The central question: does the quality improvement justify the cost?
Best-of-N Sampling
The simplest test-time scaling method: generate independent responses and select the best one using a verifier or reward model.
Best-of-N Scaling with a Perfect Verifier
Statement
If the model produces a correct answer with probability per sample, and the verifier perfectly identifies correctness, then the probability of at least one correct answer among samples is:
For small , this scales approximately linearly: when . To achieve success probability , you need:
samples. The cost scales as to achieve high confidence.
Intuition
If the model has even a small chance of producing the correct answer, generating enough samples makes success near-certain. The verifier acts as a filter: it cannot make the model smarter, but it can select the model's best work from many attempts. The bottleneck shifts from generation quality to verifier quality.
Why It Matters
Best-of-N is the workhorse of test-time compute scaling. It requires no additional training. just a good verifier. For math and code, where verifiers are reliable (run the code, check the proof), best-of-N gives substantial improvements. For open-ended text where verification is hard, the gains are smaller.
Failure Mode
When the verifier is imperfect (a reward model rather than a ground-truth checker), best-of-N suffers from reward hacking at inference time. The selected sample maximizes the verifier's score, not actual quality. As increases, the selected sample increasingly exploits verifier weaknesses. This is the inference-time analog of reward model overoptimization.
Verifier-Guided Search
Best-of-N generates complete responses independently. More sophisticated methods use the verifier to guide generation during the reasoning process.
Process Reward Models (PRMs) vs Outcome Reward Models (ORMs)
- ORM: Scores the final answer. Binary: correct or not.
- PRM: Scores each intermediate reasoning step. Guides the search by pruning bad reasoning paths early.
With a PRM, you can perform step-level beam search: at each reasoning step, generate multiple continuations, score them with the PRM, and keep only the top-. This avoids wasting compute on reasoning paths that go wrong early.
Monte Carlo Tree Search (MCTS) for Reasoning
MCTS treats reasoning as a sequential decision problem:
- State: The reasoning chain so far
- Action: The next reasoning step
- Reward: Correctness of the final answer (or PRM score)
- Rollout: Complete the reasoning chain from the current state
The standard MCTS loop:
- Select a promising node using UCB (upper confidence bound)
- Expand by generating a new reasoning step
- Simulate by completing the chain (or using a value estimate)
- Backpropagate the outcome to update node statistics
This is how AlphaGo-style search has been adapted for mathematical reasoning. The reasoning trace is the "game tree," and each step is a "move."
Chain-of-Thought as Inference Compute
Chain-of-Thought Scaling
Chain-of-thought (CoT) prompting increases inference compute by generating intermediate reasoning tokens before the final answer. A model generating reasoning tokens plus answer tokens spends approximately FLOPs. more reasoning tokens means more compute.
The hypothesis: the transformer's fixed per-token computation is insufficient for hard reasoning, and generating intermediate tokens provides additional "serial computation" that enables multi-step inference.
Inference Compute Scaling
Statement
Empirically, for reasoning tasks with reliable verifiers, accuracy scales approximately log-linearly with inference compute:
where is the total inference FLOPs (including all candidates and reasoning tokens). This scaling holds over 1-2 orders of magnitude of compute before saturating.
The critical finding: the rate of improvement depends heavily on verifier quality. With a perfect verifier, scaling is steep. With a weak reward model, scaling flattens quickly due to reward hacking.
Intuition
More inference compute helps, but with diminishing returns. Doubling compute gives a fixed additive accuracy improvement (log scaling), not a fixed multiplicative one. Eventually the model hits a ceiling set by its training: no amount of search can find a correct solution if the model never learned the relevant knowledge or reasoning pattern.
Why It Matters
This result establishes that training compute and inference compute are partially substitutable. A smaller model with more inference compute can match a larger model with less inference compute on reasoning tasks. This has profound implications for deployment: you can trade expensive training for cheaper (per-parameter) inference, especially when queries are infrequent but high-stakes.
Failure Mode
The log-linear scaling saturates. Beyond a certain point, additional inference compute yields negligible improvement because the model lacks the knowledge or reasoning capability to solve the problem regardless of how many attempts it makes. Inference compute cannot compensate for fundamental training deficiencies.
Latent Reasoning
The most frontier direction: reasoning in hidden states rather than in tokens.
Standard chain-of-thought reasoning forces the model to express all reasoning as text tokens. This has limitations:
- Bandwidth: Natural language is a low-bandwidth communication channel for complex reasoning
- Faithfulness: The written chain-of-thought may not reflect the model's actual computation
- Efficiency: Generating tokens is expensive; internal computation is cheaper
Latent reasoning models use special "thinking tokens" that are processed by the transformer but not decoded into text. The model can perform many steps of internal computation (many forward passes through "thinking" positions) before producing an output token. This increases the model's effective depth for hard problems.
This is an active research area. The theoretical framework is developing: latent reasoning can be viewed as giving the model a variable-depth computation graph rather than the fixed depth of a standard transformer.
For reasoning tasks with verifiable answers (math, code, constrained generation), best-of-N with a verifier is the default approach. Generate 16-64 candidates, score with a verifier, return the best. This is simple to implement, requires no additional training beyond the verifier, and provides 10-30% accuracy improvements on hard benchmarks. For tasks without reliable verifiers, a single extended chain-of-thought with a reward model reranker is the practical fallback.
When Test-Time Compute Helps (and When It Doesn't)
Test-time compute is most effective when:
- The task has a reliable verifier (code execution, math provers)
- The model has a non-zero pass rate (it sometimes gets the answer right)
- The task requires multi-step reasoning (search helps explore paths)
Test-time compute is least effective when:
- Verification is as hard as generation (open-ended writing, subjective tasks)
- The model never produces correct answers (fundamental capability gap)
- The task is easy (single-pass generation already succeeds reliably)
Common Confusions
More tokens does not always mean better reasoning
Generating a longer chain-of-thought is not the same as reasoning more deeply. Models can produce long, fluent chains of text that arrive at the wrong answer. The quality of reasoning depends on the model's training, not just the length of the output. Structured search with verification matters more than raw token count.
Best-of-N is not the same as temperature sampling
High temperature increases output diversity but does not select for quality. Best-of-N generates diverse candidates and selects the best one. The selection step is what matters. Without a good verifier, high-temperature sampling just produces more varied wrong answers.
Test-time and train-time compute are not fully substitutable
A small model with massive test-time compute cannot match a large, well-trained model on all tasks. Test-time compute helps the model use knowledge it already has more effectively. It cannot inject knowledge the model never learned. The substitution works for reasoning tasks where the knowledge is present but the inference path is hard to find.
Summary
- Test-time compute scaling: generate multiple candidates and select the best
- Best-of-N with a verifier: success probability
- PRMs guide search at each step; ORMs score only the final answer
- MCTS adapts game-playing search to reasoning traces
- Chain-of-thought provides serial compute via intermediate tokens
- Latent reasoning trades visible tokens for internal computation
- Verifier quality is the bottleneck: poor verifiers lead to inference-time reward hacking
- Accuracy scales log-linearly with inference compute, then saturates
Exercises
Problem
A model solves a math problem correctly with probability per attempt. Using best-of-N with a perfect verifier, how many samples do you need to achieve a 95% probability of finding at least one correct solution?
Problem
Compare the compute efficiency of two strategies: (A) Train a 2x larger model and generate one response. (B) Keep the original model and use best-of- with a perfect verifier. Assuming training scaling gives accuracy and best-of-N gives accuracy with fixed per-sample cost, under what conditions does strategy B dominate?
Problem
The inference-time reward hacking problem: as increases in best-of-N with an imperfect reward model, the selected response increasingly exploits the reward model rather than improving in true quality. Formalize this using the Gao et al. (2023) framework: if the proxy reward and the true reward have correlation , how does the expected true reward of the best-of-N selection behave as ?
References
Canonical:
- Cobbe et al., "Training Verifiers to Solve Math Word Problems" (2021). early verifier-guided approach
- Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (2022)
Current:
- Snell et al., "Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters" (2024)
- DeepSeek-AI, "DeepSeek-R1" (2025). inference-time reasoning at scale
- Gao et al., "Scaling Laws for Reward Model Overoptimization" (2023). proxy reward hacking
Next Topics
The natural next steps from test-time compute:
- Reward models and verifiers: the signals that guide test-time search
- Post-training overview: how models are trained to support test-time reasoning
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Scaling LawsLayer 4
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
Builds on This
- Inference-Time Scaling LawsLayer 5
- Latent ReasoningLayer 5