AI Safety
Verifier Design and Process Reward
Detailed treatment of verifier types, process vs outcome reward models, verifier-guided search, self-verification, and the connection to test-time compute scaling. How to design reward signals for reasoning models.
Prerequisites
Why This Matters
The reward-models-and-verifiers page introduced the distinction between learned reward models and objective verifiers. This page goes deeper into the design of verification systems for reasoning: how to build process reward models, when to use outcome vs. process rewards, how verifiers guide search at inference time, and why this connects to the scaling of test-time compute.
Every frontier reasoning model in 2025-2026 (OpenAI o-series, DeepSeek-R1, Claude with extended thinking) uses some form of verifier-guided generation. The design of the verifier is often more important than the design of the generator.
Outcome Reward Models (ORM)
Outcome Reward Model
An outcome reward model evaluates only the final answer:
where is the problem, is the full solution (including reasoning chain), and is the extracted final answer.
For math problems: check if the numerical answer matches the reference. For code: run the code against test cases and check pass/fail.
The ORM provides a binary or scalar signal at the end of the reasoning chain and says nothing about the intermediate steps.
Strengths of ORM:
- No bias from a learned step evaluator; the answer is either correct or not
- Simple to implement for domains with verifiable answers (math, code, logic)
- Scales with compute: generate many solutions, check each, select the best
Weaknesses of ORM:
- Sparse signal: a 20-step reasoning chain gets a single bit of feedback
- No credit assignment: which step caused the error?
- High variance: with a fixed compute budget, you might not sample any correct solution for hard problems
Process Reward Models (PRM)
PRM Credit Assignment Advantage
Statement
Consider a reasoning chain of steps where step introduces an error (all steps before are correct, step is wrong). An ORM assigns the same negative signal to all steps (it cannot distinguish which step caused the failure). A PRM assigns positive signal to steps and negative signal to step .
For RL training, the effective gradient signal per step is:
- ORM: Each step receives gradient proportional to . The signal is diluted across all steps, including correct ones.
- PRM: Step receives a targeted negative gradient. Steps before receive positive gradients. The error is localized.
The PRM's advantage grows with chain length : for a chain of 20 steps with one error, the ORM dilutes the error signal by a factor of 20, while the PRM targets it directly.
Intuition
Imagine a teacher grading a math exam. The ORM teacher only checks the final answer: "wrong." The student does not know which of 20 steps was the mistake. The PRM teacher checks each line: "steps 1-14 correct, step 15 wrong, steps 16-20 built on the error." The student knows exactly where to improve.
Why It Matters
This credit assignment advantage is why PRMs are preferred for training reasoning models. DeepSeek-R1's GRPO training and OpenAI's process-supervised approaches both exploit dense step-level signal to train chains of 10-100 reasoning steps. With ORM alone, training such long chains would require enormous sample sizes to overcome the variance from sparse rewards.
Failure Mode
The PRM is itself a learned model. If the PRM incorrectly labels a wrong step as correct, the generator learns to reproduce that error. PRM errors compound across steps: a single false-positive can derail the entire remaining chain. The total accumulated PRM error scales as where is the per-step PRM error rate. For very long chains, this accumulated error can exceed the ORM's variance.
PRM Training Methods
Human annotation. Annotators label each reasoning step as correct or incorrect. This produces high-quality labels but is expensive ($10-50 per problem for detailed step-level annotation) and does not scale.
Monte Carlo estimation (Math-Shepherd). For each intermediate step in a reasoning chain:
- Generate completions from step to the final answer
- Check each completion against the ground truth
- Estimate
- Label step as "correct" if this probability exceeds a threshold
This converts an ORM signal into step-level labels via rollouts. The labels are noisy (variance scales as ) but the method scales because it requires only a correctness checker for final answers, not human step-level annotation.
Self-supervised PRM training. Use the model itself to generate both the reasoning chains and the step-level evaluations. Train the PRM on a mixture of Monte Carlo labels and model-generated critiques. This creates a bootstrap loop: the PRM improves, which improves the generator, which produces better training data for the PRM.
Verifier Taxonomy
Verifier Types
Verifiers can be classified by their mechanism:
-
Execution-based verifiers. Run code, check mathematical equalities, query databases. The result is objective. Example: running Python code against test cases.
-
Discriminative verifiers. A trained model that takes (problem, solution) as input and outputs a correctness score. This is a reward model specialized for correctness rather than preference. Example: a classifier trained to predict whether a math solution is correct.
-
Generative verifiers. A language model that generates a critique or verification of the solution. The verification itself is a text output that reasons about correctness. Example: prompting a model with "Check this solution step by step and identify any errors."
-
Formal verifiers. Proof assistants (Lean, Isabelle, Coq) that check mathematical proofs against axioms. The verification is logically complete: if the proof checks, it is correct. Currently limited to domains with formal specifications.
The verifier types form a spectrum of reliability:
| Verifier Type | Reliability | Coverage | Cost |
|---|---|---|---|
| Formal | Perfect (sound and complete) | Very narrow (formal proofs only) | High setup cost |
| Execution-based | Very high (if tests are good) | Narrow (code, computable math) | Low per query |
| Discriminative | Moderate (learned, can err) | Broad (any domain) | Low per query |
| Generative | Low-moderate (depends on model) | Very broad (any text output) | Moderate per query |
Verifier-Guided Search
Best-of-N Scaling with Verifiers
Statement
Generate independent solutions to a problem. Use a verifier to score each solution and select the highest-scoring one. If the generator has probability of producing a correct solution on a single attempt, then the probability of finding at least one correct solution among attempts is:
For and : . For and : .
The number of samples needed to achieve success probability is:
for small .
Intuition
If you can check answers cheaply, you can compensate for a weak generator by sampling many times. A model that solves a problem 10% of the time can achieve 99.5% success with 50 attempts and a reliable verifier. This is the core insight behind test-time compute scaling: spend more compute at inference by generating and verifying multiple attempts.
Why It Matters
Best-of-N with a verifier is the simplest form of test-time compute scaling. It explains why reasoning models appear to get smarter when given more time: they generate more candidate solutions and use verifiers to select the best one. The ceiling is the verifier's ability to distinguish correct from incorrect solutions.
Failure Mode
This analysis assumes a perfect verifier. In practice, the verifier is imperfect: it may score an incorrect solution highly (false positive) or reject a correct solution (false negative). With an imperfect verifier, increasing past a certain point stops helping because the bottleneck becomes verifier accuracy, not generator coverage. If the verifier has false positive rate , the effective error floor is approximately , which does approach zero, but each false positive is selected with probability proportional to its score, creating a practical plateau.
Beyond Best-of-N: Tree Search
Best-of-N generates complete solutions independently. More sophisticated search methods use the verifier to guide partial solutions:
Beam search with PRM. Generate step by step. At each step, generate candidate next steps, score each with the PRM, keep the top , and continue. This prunes bad reasoning paths early instead of completing them.
Monte Carlo Tree Search (MCTS). Build a tree of reasoning steps. Use the PRM as a value function to estimate the probability of reaching a correct answer from each node. Balance exploration (trying new steps) and exploitation (extending promising paths). AlphaGo used MCTS for Go; the same principle applies to mathematical reasoning.
Sequential refinement. Generate a complete solution, use a verifier to identify errors, then generate a revised solution conditioned on the critique. Repeat until the verifier is satisfied or a compute budget is exhausted.
Self-Verification
Limitations of Self-Verification
Statement
When a model verifies its own output, systematic biases present during generation persist during verification. If the model consistently makes a specific type of reasoning error (e.g., sign errors in algebra, off-by-one errors in counting), it will also fail to detect that error during self-verification.
Formally: let be a systematic error pattern. If , then typically because the same learned heuristics that produce the error also fail to flag it during review.
Intuition
Asking a model to check its own work is like asking a student who made an algebra mistake to recheck by doing the same algebra. The student will likely make the same mistake again. An independent checker (a different model, a calculator, a proof assistant) does not share the same blind spots.
Why It Matters
Self-verification is appealing because it requires no external tools or additional models. But its limitations are important: for hard problems where the model's error rate is high, self-verification provides diminishing returns. The best systems combine self-verification (cheap, broad) with external verification (reliable, narrow) where available.
Failure Mode
Self-verification can create a false sense of confidence. If the model generates an incorrect answer and then "verifies" it as correct (because it has the same blind spot), the user receives a wrong answer that the model explicitly claims to have checked. This is worse than no verification because it increases trust in an incorrect result.
Making Self-Verification Work Better
Several techniques improve self-verification despite its inherent limitations:
Diverse sampling. Generate the solution multiple times with different prompts, temperatures, or chain-of-thought structures. Check whether the answers agree. Inconsistency signals potential errors. This works because different generation paths activate different heuristics, breaking the correlation between generation and verification errors.
Role separation. Prompt the model differently for generation and verification. The generator is told to solve the problem. The verifier is told to critique a solution written by "someone else." This can partially decouple the biases, though the underlying model weights are the same.
Formal grounding. When the model generates a mathematical claim, ask it to verify by computation rather than by re-reading the argument. For example: "compute the left-hand side and the right-hand side of this equation separately and check if they are equal." This converts verification from pattern matching to computation, which is more reliable.
Connection to Test-Time Compute
The verifier-guided search framework explains the test-time compute scaling observed in reasoning models:
-
More compute = more samples. With budget , generate solutions (where is the cost per solution). More solutions mean higher probability of finding a correct one.
-
More compute = deeper search. With budget , run beam search or MCTS with wider beams or deeper trees. The PRM guides the search toward promising paths.
-
Compute-optimal allocation. Given a fixed budget, how should you split between generation (more samples) and verification (better scoring)? Empirically, the optimal split depends on problem difficulty: easy problems benefit more from a few high-quality samples with a simple verifier; hard problems benefit from many samples with a strong verifier.
The key insight: verifier quality determines the ceiling of test-time compute scaling. No amount of sampling helps if the verifier cannot distinguish correct from incorrect solutions. Improving verifiers (better PRMs, more reliable execution environments, formal verification coverage) directly increases the returns from test-time compute.
Common Confusions
PRM does not guarantee step-level correctness
A PRM predicts whether each step is correct. It is a learned model with nonzero error rate. A PRM score of 0.95 for a step means the PRM thinks the step is probably correct, not that it is correct. Only formal verifiers and execution-based verifiers can guarantee correctness.
Best-of-N is not the same as majority voting
Best-of-N selects the highest-scoring solution according to a verifier. Majority voting selects the most common answer among samples. They differ when the verifier disagrees with the majority. For math, majority voting is a weak baseline because many incorrect solutions can agree on the same wrong answer (e.g., common computation errors that multiple samples reproduce).
Self-verification is not verification
When papers report "the model verified its answer," this means the model generated text asserting correctness. It does not mean correctness was checked against ground truth. Self-verification reduces but does not eliminate errors, and the reduction is smallest for the hardest problems where verification matters most.
Summary
- ORM: scores final answer only. Unbiased but sparse, no credit assignment
- PRM: scores each step. Dense signal, better credit assignment, but biased by PRM errors
- PRM error accumulates as over steps
- Monte Carlo PRM training: use rollout success rates to label step correctness
- Verifier types: formal (perfect, narrow), execution (reliable, narrow), discriminative (moderate, broad), generative (weak, very broad)
- Best-of-N with verifier: . Sample many, verify, select
- Tree search (beam, MCTS) uses PRM to prune bad paths early
- Self-verification is limited by shared biases between generator and verifier
- Verifier quality is the ceiling on test-time compute scaling returns
Exercises
Problem
A model solves a particular math problem correctly with probability per attempt. You have a perfect verifier (it always correctly identifies correct solutions). How many independent attempts do you need to find a correct solution with probability at least 0.99?
Problem
A PRM has per-step error rate (probability of labeling a correct step as incorrect, or an incorrect step as correct). For a reasoning chain of steps, estimate the probability that the PRM makes zero errors across all steps. Compare this to a chain of steps.
Problem
Design a hybrid verification system for a math reasoning model that combines three verification methods: (a) self-verification, (b) a trained PRM, and (c) a symbolic math engine (e.g., SymPy). Specify when each method is used, how conflicts are resolved, and what the system does when the symbolic engine cannot parse the model's reasoning.
References
Canonical:
- Cobbe et al., "Training Verifiers to Solve Math Word Problems" (2021). ORM for math.
- Lightman et al., "Let's Verify Step by Step" (2023). Process reward models.
Current:
- Wang et al., "Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations" (2024). Monte Carlo PRM training.
- Snell et al., "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters" (2024). Test-time compute scaling.
- OpenAI, "Learning to Reason with LLMs" (2024). o1 system description.
Next Topics
- Test-time compute and search: the broader framework for trading inference compute for accuracy
- Reward models and verifiers: the prerequisite covering the reward model side
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Reward Models and VerifiersLayer 5
- RLHF and AlignmentLayer 4
- Policy Gradient TheoremLayer 3
- Markov Decision ProcessesLayer 2
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A