Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

AI Safety

Verifier Design and Process Reward

Detailed treatment of verifier types, process vs outcome reward models, verifier-guided search, self-verification, and the connection to test-time compute scaling. How to design reward signals for reasoning models.

AdvancedTier 2Frontier~55 min
0

Why This Matters

The reward-models-and-verifiers page introduced the distinction between learned reward models and objective verifiers. This page goes deeper into the design of verification systems for reasoning: how to build process reward models, when to use outcome vs. process rewards, how verifiers guide search at inference time, and why this connects to the scaling of test-time compute.

Every frontier reasoning model in 2025-2026 (OpenAI o-series, DeepSeek-R1, Claude with extended thinking) uses some form of verifier-guided generation. The design of the verifier is often more important than the design of the generator.

Outcome Reward Models (ORM)

Definition

Outcome Reward Model

An outcome reward model evaluates only the final answer:

rORM(x,y)=f(x,yfinal)r_{\text{ORM}}(x, y) = f(x, y_{\text{final}})

where xx is the problem, yy is the full solution (including reasoning chain), and yfinaly_{\text{final}} is the extracted final answer.

For math problems: check if the numerical answer matches the reference. For code: run the code against test cases and check pass/fail.

The ORM provides a binary or scalar signal at the end of the reasoning chain and says nothing about the intermediate steps.

Strengths of ORM:

  • No bias from a learned step evaluator; the answer is either correct or not
  • Simple to implement for domains with verifiable answers (math, code, logic)
  • Scales with compute: generate many solutions, check each, select the best

Weaknesses of ORM:

  • Sparse signal: a 20-step reasoning chain gets a single bit of feedback
  • No credit assignment: which step caused the error?
  • High variance: with a fixed compute budget, you might not sample any correct solution for hard problems

Process Reward Models (PRM)

Proposition

PRM Credit Assignment Advantage

Statement

Consider a reasoning chain of TT steps where step kk introduces an error (all steps before kk are correct, step kk is wrong). An ORM assigns the same negative signal to all TT steps (it cannot distinguish which step caused the failure). A PRM assigns positive signal to steps 1,,k11, \ldots, k-1 and negative signal to step kk.

For RL training, the effective gradient signal per step is:

  • ORM: Each step receives gradient proportional to 1Trfinal\frac{1}{T} \cdot r_{\text{final}}. The signal is diluted across all steps, including correct ones.
  • PRM: Step kk receives a targeted negative gradient. Steps before kk receive positive gradients. The error is localized.

The PRM's advantage grows with chain length TT: for a chain of 20 steps with one error, the ORM dilutes the error signal by a factor of 20, while the PRM targets it directly.

Intuition

Imagine a teacher grading a math exam. The ORM teacher only checks the final answer: "wrong." The student does not know which of 20 steps was the mistake. The PRM teacher checks each line: "steps 1-14 correct, step 15 wrong, steps 16-20 built on the error." The student knows exactly where to improve.

Why It Matters

This credit assignment advantage is why PRMs are preferred for training reasoning models. DeepSeek-R1's GRPO training and OpenAI's process-supervised approaches both exploit dense step-level signal to train chains of 10-100 reasoning steps. With ORM alone, training such long chains would require enormous sample sizes to overcome the variance from sparse rewards.

Failure Mode

The PRM is itself a learned model. If the PRM incorrectly labels a wrong step as correct, the generator learns to reproduce that error. PRM errors compound across steps: a single false-positive can derail the entire remaining chain. The total accumulated PRM error scales as O(Tϵstep)O(T \cdot \epsilon_{\text{step}}) where ϵstep\epsilon_{\text{step}} is the per-step PRM error rate. For very long chains, this accumulated error can exceed the ORM's variance.

PRM Training Methods

Human annotation. Annotators label each reasoning step as correct or incorrect. This produces high-quality labels but is expensive ($10-50 per problem for detailed step-level annotation) and does not scale.

Monte Carlo estimation (Math-Shepherd). For each intermediate step sts_t in a reasoning chain:

  1. Generate NN completions from step sts_t to the final answer
  2. Check each completion against the ground truth
  3. Estimate P(corrects1,,st)correct completionsNP(\text{correct} | s_1, \ldots, s_t) \approx \frac{\text{correct completions}}{N}
  4. Label step sts_t as "correct" if this probability exceeds a threshold

This converts an ORM signal into step-level labels via rollouts. The labels are noisy (variance scales as 1/N1/N) but the method scales because it requires only a correctness checker for final answers, not human step-level annotation.

Self-supervised PRM training. Use the model itself to generate both the reasoning chains and the step-level evaluations. Train the PRM on a mixture of Monte Carlo labels and model-generated critiques. This creates a bootstrap loop: the PRM improves, which improves the generator, which produces better training data for the PRM.

Verifier Taxonomy

Definition

Verifier Types

Verifiers can be classified by their mechanism:

  1. Execution-based verifiers. Run code, check mathematical equalities, query databases. The result is objective. Example: running Python code against test cases.

  2. Discriminative verifiers. A trained model that takes (problem, solution) as input and outputs a correctness score. This is a reward model specialized for correctness rather than preference. Example: a classifier trained to predict whether a math solution is correct.

  3. Generative verifiers. A language model that generates a critique or verification of the solution. The verification itself is a text output that reasons about correctness. Example: prompting a model with "Check this solution step by step and identify any errors."

  4. Formal verifiers. Proof assistants (Lean, Isabelle, Coq) that check mathematical proofs against axioms. The verification is logically complete: if the proof checks, it is correct. Currently limited to domains with formal specifications.

The verifier types form a spectrum of reliability:

Verifier TypeReliabilityCoverageCost
FormalPerfect (sound and complete)Very narrow (formal proofs only)High setup cost
Execution-basedVery high (if tests are good)Narrow (code, computable math)Low per query
DiscriminativeModerate (learned, can err)Broad (any domain)Low per query
GenerativeLow-moderate (depends on model)Very broad (any text output)Moderate per query

Verifier-Guided Search

Proposition

Best-of-N Scaling with Verifiers

Statement

Generate NN independent solutions to a problem. Use a verifier to score each solution and select the highest-scoring one. If the generator has probability pp of producing a correct solution on a single attempt, then the probability of finding at least one correct solution among NN attempts is:

P(at least one correct)=1(1p)NP(\text{at least one correct}) = 1 - (1-p)^N

For p=0.1p = 0.1 and N=50N = 50: P0.995P \approx 0.995. For p=0.01p = 0.01 and N=100N = 100: P0.634P \approx 0.634.

The number of samples needed to achieve success probability 1δ1 - \delta is:

Nlog(1/δ)log(1/(1p))log(1/δ)pN \geq \frac{\log(1/\delta)}{\log(1/(1-p))} \approx \frac{\log(1/\delta)}{p}

for small pp.

Intuition

If you can check answers cheaply, you can compensate for a weak generator by sampling many times. A model that solves a problem 10% of the time can achieve 99.5% success with 50 attempts and a reliable verifier. This is the core insight behind test-time compute scaling: spend more compute at inference by generating and verifying multiple attempts.

Why It Matters

Best-of-N with a verifier is the simplest form of test-time compute scaling. It explains why reasoning models appear to get smarter when given more time: they generate more candidate solutions and use verifiers to select the best one. The ceiling is the verifier's ability to distinguish correct from incorrect solutions.

Failure Mode

This analysis assumes a perfect verifier. In practice, the verifier is imperfect: it may score an incorrect solution highly (false positive) or reject a correct solution (false negative). With an imperfect verifier, increasing NN past a certain point stops helping because the bottleneck becomes verifier accuracy, not generator coverage. If the verifier has false positive rate qq, the effective error floor is approximately qNq^N, which does approach zero, but each false positive is selected with probability proportional to its score, creating a practical plateau.

Beyond Best-of-N: Tree Search

Best-of-N generates complete solutions independently. More sophisticated search methods use the verifier to guide partial solutions:

Beam search with PRM. Generate step by step. At each step, generate BB candidate next steps, score each with the PRM, keep the top KK, and continue. This prunes bad reasoning paths early instead of completing them.

Monte Carlo Tree Search (MCTS). Build a tree of reasoning steps. Use the PRM as a value function to estimate the probability of reaching a correct answer from each node. Balance exploration (trying new steps) and exploitation (extending promising paths). AlphaGo used MCTS for Go; the same principle applies to mathematical reasoning.

Sequential refinement. Generate a complete solution, use a verifier to identify errors, then generate a revised solution conditioned on the critique. Repeat until the verifier is satisfied or a compute budget is exhausted.

Self-Verification

Proposition

Limitations of Self-Verification

Statement

When a model verifies its own output, systematic biases present during generation persist during verification. If the model consistently makes a specific type of reasoning error (e.g., sign errors in algebra, off-by-one errors in counting), it will also fail to detect that error during self-verification.

Formally: let ee be a systematic error pattern. If P(generate eproblem type)>0P(\text{generate } e | \text{problem type}) > 0, then typically P(detect everify own output)<P(detect eexternal verifier)P(\text{detect } e | \text{verify own output}) < P(\text{detect } e | \text{external verifier}) because the same learned heuristics that produce the error also fail to flag it during review.

Intuition

Asking a model to check its own work is like asking a student who made an algebra mistake to recheck by doing the same algebra. The student will likely make the same mistake again. An independent checker (a different model, a calculator, a proof assistant) does not share the same blind spots.

Why It Matters

Self-verification is appealing because it requires no external tools or additional models. But its limitations are important: for hard problems where the model's error rate is high, self-verification provides diminishing returns. The best systems combine self-verification (cheap, broad) with external verification (reliable, narrow) where available.

Failure Mode

Self-verification can create a false sense of confidence. If the model generates an incorrect answer and then "verifies" it as correct (because it has the same blind spot), the user receives a wrong answer that the model explicitly claims to have checked. This is worse than no verification because it increases trust in an incorrect result.

Making Self-Verification Work Better

Several techniques improve self-verification despite its inherent limitations:

Diverse sampling. Generate the solution multiple times with different prompts, temperatures, or chain-of-thought structures. Check whether the answers agree. Inconsistency signals potential errors. This works because different generation paths activate different heuristics, breaking the correlation between generation and verification errors.

Role separation. Prompt the model differently for generation and verification. The generator is told to solve the problem. The verifier is told to critique a solution written by "someone else." This can partially decouple the biases, though the underlying model weights are the same.

Formal grounding. When the model generates a mathematical claim, ask it to verify by computation rather than by re-reading the argument. For example: "compute the left-hand side and the right-hand side of this equation separately and check if they are equal." This converts verification from pattern matching to computation, which is more reliable.

Connection to Test-Time Compute

The verifier-guided search framework explains the test-time compute scaling observed in reasoning models:

  1. More compute = more samples. With budget CC, generate C/cC/c solutions (where cc is the cost per solution). More solutions mean higher probability of finding a correct one.

  2. More compute = deeper search. With budget CC, run beam search or MCTS with wider beams or deeper trees. The PRM guides the search toward promising paths.

  3. Compute-optimal allocation. Given a fixed budget, how should you split between generation (more samples) and verification (better scoring)? Empirically, the optimal split depends on problem difficulty: easy problems benefit more from a few high-quality samples with a simple verifier; hard problems benefit from many samples with a strong verifier.

The key insight: verifier quality determines the ceiling of test-time compute scaling. No amount of sampling helps if the verifier cannot distinguish correct from incorrect solutions. Improving verifiers (better PRMs, more reliable execution environments, formal verification coverage) directly increases the returns from test-time compute.

Common Confusions

Watch Out

PRM does not guarantee step-level correctness

A PRM predicts whether each step is correct. It is a learned model with nonzero error rate. A PRM score of 0.95 for a step means the PRM thinks the step is probably correct, not that it is correct. Only formal verifiers and execution-based verifiers can guarantee correctness.

Watch Out

Best-of-N is not the same as majority voting

Best-of-N selects the highest-scoring solution according to a verifier. Majority voting selects the most common answer among NN samples. They differ when the verifier disagrees with the majority. For math, majority voting is a weak baseline because many incorrect solutions can agree on the same wrong answer (e.g., common computation errors that multiple samples reproduce).

Watch Out

Self-verification is not verification

When papers report "the model verified its answer," this means the model generated text asserting correctness. It does not mean correctness was checked against ground truth. Self-verification reduces but does not eliminate errors, and the reduction is smallest for the hardest problems where verification matters most.

Summary

  • ORM: scores final answer only. Unbiased but sparse, no credit assignment
  • PRM: scores each step. Dense signal, better credit assignment, but biased by PRM errors
  • PRM error accumulates as O(Tϵstep)O(T \cdot \epsilon_{\text{step}}) over TT steps
  • Monte Carlo PRM training: use rollout success rates to label step correctness
  • Verifier types: formal (perfect, narrow), execution (reliable, narrow), discriminative (moderate, broad), generative (weak, very broad)
  • Best-of-N with verifier: P(success)=1(1p)NP(\text{success}) = 1 - (1-p)^N. Sample many, verify, select
  • Tree search (beam, MCTS) uses PRM to prune bad paths early
  • Self-verification is limited by shared biases between generator and verifier
  • Verifier quality is the ceiling on test-time compute scaling returns

Exercises

ExerciseCore

Problem

A model solves a particular math problem correctly with probability p=0.05p = 0.05 per attempt. You have a perfect verifier (it always correctly identifies correct solutions). How many independent attempts NN do you need to find a correct solution with probability at least 0.99?

ExerciseAdvanced

Problem

A PRM has per-step error rate ϵ=0.02\epsilon = 0.02 (probability of labeling a correct step as incorrect, or an incorrect step as correct). For a reasoning chain of T=30T = 30 steps, estimate the probability that the PRM makes zero errors across all steps. Compare this to a chain of T=5T = 5 steps.

ExerciseResearch

Problem

Design a hybrid verification system for a math reasoning model that combines three verification methods: (a) self-verification, (b) a trained PRM, and (c) a symbolic math engine (e.g., SymPy). Specify when each method is used, how conflicts are resolved, and what the system does when the symbolic engine cannot parse the model's reasoning.

References

Canonical:

  • Cobbe et al., "Training Verifiers to Solve Math Word Problems" (2021). ORM for math.
  • Lightman et al., "Let's Verify Step by Step" (2023). Process reward models.

Current:

  • Wang et al., "Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations" (2024). Monte Carlo PRM training.
  • Snell et al., "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters" (2024). Test-time compute scaling.
  • OpenAI, "Learning to Reason with LLMs" (2024). o1 system description.

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics