Verifier Design and Process Reward

Sneiderman, Robby

AI Safety

Verifier Design and Process Reward

Detailed treatment of verifier types, process vs outcome reward models, verifier-guided search, self-verification, and the connection to test-time compute scaling. How to design reward signals for reasoning models.

AdvancedTier 2FrontierFrontier watch~55 min

Prerequisites

Reward Models and Verifiers

Prereq Map

Learning position

Read this page in the graph.

ai-safety | layer 5 | tier 2. This page has 1 direct prerequisite and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Test-Time Compute and Search

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

The reward-models-and-verifiers page introduced the distinction between learned reward models and objective verifiers. This page goes deeper into the design of verification systems for reasoning: how to build process reward models, when to use outcome vs. process rewards, how verifiers guide search at inference time, and why this connects to the scaling of test-time compute.

Every frontier reasoning model in 2025-2026 (OpenAI o-series, DeepSeek-R1, Claude with extended thinking) uses some form of verifier-guided generation. The design of the verifier is often more important than the design of the generator.

Current Checkpoint

The current frontier splits verification into two practical regimes:

Objective verifiers for domains with checkable answers: code tests, math answer checkers, proof assistants, unit tests, and database queries. These are narrow but high-trust.
Learned or prompted verifiers for broad reasoning: PRMs, ORMs, critique models, and aspect-specific judges. These cover more tasks but can be fooled by fluent wrong reasoning.

Reasoning-model training increasingly depends on verifiable rewards, but that does not remove the verification problem. It moves the problem into reward design, data coverage, and evaluator failure modes. The product rule is simple: when an answer can be checked, check it with a tool or deterministic rubric; when it cannot, expose confidence, evidence, and disagreement rather than pretending the verifier is authoritative.

Build It This Way by Default

For learning applications, keep a visible distinction between "checked", "reviewed", and "estimated." A quiz answer with an exact key can be checked. A proof step can be reviewed against a rubric or formal checker. A study recommendation is estimated from learner evidence and should carry uncertainty.

Outcome Reward Models (ORM)

Definition

Outcome Reward Model

An outcome reward model evaluates only the final answer:

$r_{\text{ORM}}(x, y) = f(x, y_{\text{final}})$

where $x$ is the problem, $y$ is the full solution (including reasoning chain), and $y_{\text{final}}$ is the extracted final answer.

For math problems: check if the numerical answer matches the reference. For code: run the code against test cases and check pass/fail.

The ORM provides a binary or scalar signal at the end of the reasoning chain and says nothing about the intermediate steps.

Strengths of ORM:

No bias from a learned step evaluator; the answer is either correct or not
Simple to implement for domains with verifiable answers (math, code, logic)
Scales with compute: generate many solutions, check each, select the best

Weaknesses of ORM:

Sparse signal: a 20-step reasoning chain gets a single bit of feedback
No credit assignment: which step caused the error?
High variance: with a fixed compute budget, you might not sample any correct solution for hard problems

Process Reward Models (PRM)

Proposition

PRM Credit Assignment Advantage

Statement

Consider a reasoning chain of $T$ steps where step $k$ introduces an error (all steps before $k$ are correct, step $k$ is wrong). An ORM assigns the same negative signal to all $T$ steps (it cannot distinguish which step caused the failure). A PRM assigns positive signal to steps $1, \ldots, k-1$ and negative signal to step $k$ .

For RL training, the effective gradient signal per step is:

ORM: Each step receives gradient proportional to $\frac{1}{T} \cdot r_{\text{final}}$ . The signal is diluted across all steps, including correct ones.
PRM: Step $k$ receives a targeted negative gradient. Steps before $k$ receive positive gradients. The error is localized.

The PRM's advantage grows with chain length $T$ : for a chain of 20 steps with one error, the ORM dilutes the error signal by a factor of 20, while the PRM targets it directly.

Intuition

Imagine a teacher grading a math exam. The ORM teacher only checks the final answer: "wrong." The student does not know which of 20 steps was the mistake. The PRM teacher checks each line: "steps 1-14 correct, step 15 wrong, steps 16-20 built on the error." The student knows exactly where to improve.

Why It Matters

This credit assignment advantage is why PRMs are preferred for training reasoning models. DeepSeek-R1's GRPO training and OpenAI's process-supervised approaches both exploit dense step-level signal to train chains of 10-100 reasoning steps. With ORM alone, training such long chains would require enormous sample sizes to overcome the variance from sparse rewards.

Failure Mode

The PRM is itself a learned model. If the PRM incorrectly labels a wrong step as correct, the generator learns to reproduce that error. PRM errors compound across steps: a single false-positive can derail the entire remaining chain. The total accumulated PRM error scales as $O(T \cdot \epsilon_{\text{step}})$ where $\epsilon_{\text{step}}$ is the per-step PRM error rate. For very long chains, this accumulated error can exceed the ORM's variance.

report a correction →

PRM Training Methods

Human annotation. Annotators label each reasoning step as correct or incorrect. This produces high-quality labels but is expensive (10-50 USD per problem for detailed step-level annotation) and does not scale.

Monte Carlo estimation (Math-Shepherd). For each intermediate step $s_t$ in a reasoning chain:

Generate $N$ completions from step $s_t$ to the final answer
Check each completion against the ground truth
Estimate $P(\text{correct} | s_1, \ldots, s_t) \approx \frac{\text{correct completions}}{N}$
Label step $s_t$ as "correct" if this probability exceeds a threshold

This converts an ORM signal into step-level labels via rollouts. The labels are noisy (variance scales as $1/N$ ) but the method scales because it requires only a correctness checker for final answers, not human step-level annotation.

Self-supervised PRM training. Use the model itself to generate both the reasoning chains and the step-level evaluations. Train the PRM on a mixture of Monte Carlo labels and model-generated critiques. This creates a bootstrap loop: the PRM improves, which improves the generator, which produces better training data for the PRM.

Verifier Taxonomy

Definition

Verifier Types

Verifiers can be classified by their mechanism:

Execution-based verifiers. Run code, check mathematical equalities, query databases. The result is objective. Example: running Python code against test cases.
Discriminative verifiers. A trained model that takes (problem, solution) as input and outputs a correctness score. This is a reward model specialized for correctness rather than preference. Example: a classifier trained to predict whether a math solution is correct.
Generative verifiers. A language model that generates a critique or verification of the solution. The verification itself is a text output that reasons about correctness. Example: prompting a model with "Check this solution step by step and identify any errors."
Formal verifiers. Proof assistants (Lean, Isabelle, Coq) that check mathematical proofs against axioms. The verification is logically complete: if the proof checks, it is correct. Currently limited to domains with formal specifications.

The verifier types form a spectrum of reliability:

Verifier Type	Reliability	Coverage	Cost
Formal	Perfect (sound, decidable proof check)	Very narrow (formal proofs only)	High setup cost
Execution-based	Very high (if tests are good)	Narrow (code, computable math)	Low per query
Discriminative	Moderate (learned, can err)	Broad (any domain)	Low per query
Generative	Low-moderate (depends on model)	Very broad (any text output)	Moderate per query

Verifier-Guided Search

Proposition

Best-of-N Scaling with Verifiers

Statement

Generate $N$ independent solutions to a problem. Use a verifier to score each solution and select the highest-scoring one. If the generator has probability $p$ of producing a correct solution on a single attempt, then the probability of finding at least one correct solution among $N$ attempts is:

$P(\text{at least one correct}) = 1 - (1-p)^N$

For $p = 0.1$ and $N = 50$ : $P \approx 0.995$ . For $p = 0.01$ and $N = 100$ : $P \approx 0.634$ .

The number of samples needed to achieve success probability $1 - \delta$ is:

$N \geq \frac{\log(1/\delta)}{\log(1/(1-p))} \approx \frac{\log(1/\delta)}{p}$

for small $p$ .

Intuition

If you can check answers cheaply, you can compensate for a weak generator by sampling many times. A model that solves a problem 10% of the time can achieve 99.5% success with 50 attempts and a reliable verifier. This is the core insight behind test-time compute scaling: spend more compute at inference by generating and verifying multiple attempts.

Why It Matters

Best-of-N with a verifier is the simplest form of test-time compute scaling. It explains why reasoning models appear to get smarter when given more time: they generate more candidate solutions and use verifiers to select the best one. The ceiling is the verifier's ability to distinguish correct from incorrect solutions.

Failure Mode

This analysis assumes a perfect verifier. In practice, the verifier is imperfect: it may score an incorrect solution highly (false positive) or reject a correct solution (false negative). With an imperfect verifier, increasing $N$ past a certain point stops helping because the bottleneck becomes verifier accuracy, not generator coverage. If the verifier independently marks each incorrect candidate as correct with probability $q$ , the probability that at least one false positive appears among $N$ samples is $1 - (1-q)^N$ , which grows with $N$ . Past the optimum, additional samples increase the chance the verifier is fooled by a confidently wrong output that happens to be scored highest. The practical effective ceiling is set by verifier precision at the top of its scoring distribution, not by generator coverage.

report a correction →

Beyond Best-of-N: Tree Search

Best-of-N generates complete solutions independently. More sophisticated search methods use the verifier to guide partial solutions:

Beam search with PRM. Generate step by step. At each step, generate $B$ candidate next steps, score each with the PRM, keep the top $K$ , and continue. This prunes bad reasoning paths early instead of completing them.

Monte Carlo Tree Search (MCTS). Build a tree of reasoning steps. Use the PRM as a value function to estimate the probability of reaching a correct answer from each node. Balance exploration (trying new steps) and exploitation (extending promising paths). AlphaGo used MCTS for Go; the same principle applies to mathematical reasoning.

Sequential refinement. Generate a complete solution, use a verifier to identify errors, then generate a revised solution conditioned on the critique. Repeat until the verifier is satisfied or a compute budget is exhausted.

Self-Verification

Proposition

Limitations of Self-Verification

Statement

When a model verifies its own output, systematic biases present during generation persist during verification. If the model consistently makes a specific type of reasoning error (e.g., sign errors in algebra, off-by-one errors in counting), it will also fail to detect that error during self-verification.

Formally: let $e$ be a systematic error pattern. If $P(\text{generate } e | \text{problem type}) > 0$ , then typically $P(\text{detect } e | \text{verify own output}) < P(\text{detect } e | \text{external verifier})$ because the same learned heuristics that produce the error also fail to flag it during review.

Intuition

Asking a model to check its own work is like asking a student who made an algebra mistake to recheck by doing the same algebra. The student will likely make the same mistake again. An independent checker (a different model, a calculator, a proof assistant) does not share the same blind spots.

Why It Matters

Self-verification is appealing because it requires no external tools or additional models. But its limitations are important: for hard problems where the model's error rate is high, self-verification provides diminishing returns. The best systems combine self-verification (cheap, broad) with external verification (reliable, narrow) where available.

Failure Mode

Self-verification can create a false sense of confidence. If the model generates an incorrect answer and then "verifies" it as correct (because it has the same blind spot), the user receives a wrong answer that the model explicitly claims to have checked. This is worse than no verification because it increases trust in an incorrect result.

report a correction →

Making Self-Verification Work Better

Several techniques improve self-verification despite its inherent limitations:

Diverse sampling. Generate the solution multiple times with different prompts, temperatures, or chain-of-thought structures. Check whether the answers agree. Inconsistency signals potential errors. This works because different generation paths activate different heuristics, breaking the correlation between generation and verification errors.

Role separation. Prompt the model differently for generation and verification. The generator is told to solve the problem. The verifier is told to critique a solution written by "someone else." This can partially decouple the biases, though the underlying model weights are the same.

Formal grounding. When the model generates a mathematical claim, ask it to verify by computation rather than by re-reading the argument. For example: "compute the left-hand side and the right-hand side of this equation separately and check if they are equal." This converts verification from pattern matching to computation, which is more reliable.

Connection to Test-Time Compute

The verifier-guided search framework explains the test-time compute scaling observed in reasoning models:

More compute = more samples. With budget $C$ , generate $C/c$ solutions (where $c$ is the cost per solution). More solutions mean higher probability of finding a correct one.
More compute = deeper search. With budget $C$ , run beam search or MCTS with wider beams or deeper trees. The PRM guides the search toward promising paths.
Compute-optimal allocation. Given a fixed budget, how should you split between generation (more samples) and verification (better scoring)? Empirically, the optimal split depends on problem difficulty: easy problems benefit more from a few high-quality samples with a simple verifier; hard problems benefit from many samples with a strong verifier.

The key insight: verifier quality determines the ceiling of test-time compute scaling. No amount of sampling helps if the verifier cannot distinguish correct from incorrect solutions. Improving verifiers (better PRMs, more reliable execution environments, formal verification coverage) directly increases the returns from test-time compute.

Common Confusions

Watch Out

PRM does not guarantee step-level correctness

A PRM predicts whether each step is correct. It is a learned model with nonzero error rate. A PRM score of 0.95 for a step means the PRM thinks the step is probably correct, not that it is correct. Only formal verifiers and execution-based verifiers can guarantee correctness.

Watch Out

Best-of-N is not the same as majority voting

Best-of-N selects the highest-scoring solution according to a verifier. Majority voting selects the most common answer among $N$ samples. They differ when the verifier disagrees with the majority. For math, majority voting is a weak baseline because many incorrect solutions can agree on the same wrong answer (e.g., common computation errors that multiple samples reproduce).

Watch Out

Self-verification is not verification

When papers report "the model verified its answer," this means the model generated text asserting correctness. It does not mean correctness was checked against ground truth. Self-verification reduces but does not eliminate errors, and the reduction is smallest for the hardest problems where verification matters most.

Summary

ORM: scores final answer only. Unbiased but sparse, no credit assignment
PRM: scores each step. Dense signal, better credit assignment, but biased by PRM errors
PRM error accumulates as $O(T \cdot \epsilon_{\text{step}})$ over $T$ steps
Monte Carlo PRM training: use rollout success rates to label step correctness
Verifier types: formal (perfect, narrow), execution (reliable, narrow), discriminative (moderate, broad), generative (weak, very broad)
Best-of-N with verifier: $P(\text{success}) = 1 - (1-p)^N$ . Sample many, verify, select
Tree search (beam, MCTS) uses PRM to prune bad paths early
Self-verification is limited by shared biases between generator and verifier
Verifier quality is the ceiling on test-time compute scaling returns

Exercises

ExerciseCore

Problem

A model solves a particular math problem correctly with probability $p = 0.05$ per attempt. You have a perfect verifier (it always correctly identifies correct solutions). How many independent attempts $N$ do you need to find a correct solution with probability at least 0.99?

ExerciseAdvanced

Problem

A PRM has per-step error rate $\epsilon = 0.02$ (probability of labeling a correct step as incorrect, or an incorrect step as correct). For a reasoning chain of $T = 30$ steps, estimate the probability that the PRM makes zero errors across all steps. Compare this to a chain of $T = 5$ steps.

ExerciseResearch

Problem

Design a hybrid verification system for a math reasoning model that combines three verification methods: (a) self-verification, (b) a trained PRM, and (c) a symbolic math engine (e.g., SymPy). Specify when each method is used, how conflicts are resolved, and what the system does when the symbolic engine cannot parse the model's reasoning.

References

Canonical:

Cobbe et al., "Training Verifiers to Solve Math Word Problems" (2021). ORM for math.
Lightman et al., "Let's Verify Step by Step" (2023). Process reward models.

Current:

Wang et al., "Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations" (2024). Monte Carlo PRM training.
Snell et al., "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters" (2024). Test-time compute scaling.
OpenAI, "Learning to Reason with LLMs" (2024). o1 system description.
Setlur et al., "Scaling Test-Time Compute Without Verification or RL is Suboptimal" (2025). Verifier-based scaling argument.
DeepSeek-AI et al., "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2025). Verifiable-reward reasoning training.
Lifshitz et al., "Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers" (2025). Multiple-verifier test-time scaling.

Next Topics

Test-time compute and search: the broader framework for trading inference compute for accuracy
Reward models and verifiers: the prerequisite covering the reward model side

Last reviewed: May 25, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Reward Models and Verifierslayer 5 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.