AI Safety
Reward Models and Verifiers
Reward models trained on human preferences vs verifiers that check output correctness. Bradley-Terry models, process vs outcome rewards, Goodhart's law, and why verifiers are more robust.
Prerequisites
Why This Matters
Every post-training method needs a training signal: something that tells the model which outputs are good and which are bad. This signal comes from either a reward model (a learned function that predicts human preferences) or a verifier (a system that checks output correctness).
The distinction between reward models and verifiers is one of the most important in modern AI alignment. Reward models are flexible but fragile. They can be hacked, they drift with distribution shift, and they encode biases from their training data. Verifiers are narrow and empirically more robust on domains with checkable answers. When they say an answer passes their check, it passes that check, but the check itself can be incomplete (partial test coverage, incorrect specifications, tamperable environments). Understanding this distinction explains why reasoning-focused models use verifiers and general-purpose models use reward models, and why the most capable systems combine both.
Mental Model
Think of two ways to evaluate a student's essay:
- Reward model approach: Ask a panel of judges to rate it. The judges are imperfect. They can be swayed by confident prose, they disagree with each other, and a clever student can learn to write essays that score well without actually being good.
- Verifier approach: Check each factual claim against a reference. The checker is narrow (it can only verify facts, not evaluate style) but reliable. a claim is either correct or not.
Reward models are like judges: flexible, subjective, hackable. Verifiers are like fact-checkers: narrow, objective, robust.
Reward Models
The Bradley-Terry Framework
Bradley-Terry Reward Model Training
Statement
Given a dataset of human preference comparisons , a reward model is trained by maximizing:
Under the Bradley-Terry model, the probability of preferring over is:
The reward model learns a scalar scoring function such that preferred outputs receive higher scores than dispreferred ones.
Intuition
The reward model converts ordinal preferences (A is better than B) into cardinal scores (A scores 3.2, B scores 1.7). The sigmoid function maps the score difference to a preference probability. The loss encourages the model to assign scores that are consistent with observed human preferences.
Why It Matters
The reward model is the bridge between human judgment and machine optimization. In RLHF, the policy is optimized to maximize the reward model's score. In best-of-N sampling, the reward model selects the best candidate. The quality of the entire post-training pipeline depends on the quality of this single learned function.
Failure Mode
The Bradley-Terry model assumes a total ordering of outputs for each prompt: for any two outputs, one is definitively better. In reality, human preferences are often intransitive (A > B, B > C, but C > A), context-dependent, and influenced by factors irrelevant to quality (length, confidence, formatting). The reward model absorbs all of these imperfections and presents them as a clean scalar score, creating a false sense of precision.
Why Reward Models Are Fragile
Problem 1: Distributional shift. The reward model is trained on outputs from . As the policy changes during RL, it generates outputs increasingly different from the training distribution. The reward model's scores become unreliable on these out-of-distribution outputs. This is not a hypothetical concern. It is the primary failure mode in practice.
Problem 2: Reward features. Reward models learn spurious correlations between surface features and human preferences. Common examples:
- Longer responses tend to be rated higher the model learns "length = quality"
- Confident language is preferred the model learns "confidence = quality"
- Markdown formatting looks professional the model learns "formatting = quality"
A model optimized against these spurious features produces long, confident, well-formatted responses that may be substantively wrong.
Problem 3: Goodhart's Law.
Reward Model Overoptimization
Statement
Let be the true (unobservable) reward and the learned proxy reward. Assume the policy is the KL-constrained optimum of at inverse KL weight , and that the two expected rewards are differentiable in (these are the regularity conditions under which the sign claims below hold. For approximate RL trajectories such as a fixed number of PPO steps, the signed statements become empirical tendencies rather than theorems). Then as the policy is optimized more aggressively against (by decreasing ), the proxy reward increases monotonically, while the true reward typically first increases and then decreases:
The first inequality is the envelope property of the KL-constrained optimum and requires the monotone-optimum assumption above. The second is the empirical overoptimization curve fit by Gao et al. 2023. The point where true reward peaks is the optimal level of optimization. Beyond it, further optimization actively makes the model worse according to the quantity we actually care about.
Intuition
Imagine optimizing a student's exam score. Initially, studying improves both the score and the knowledge. But eventually, the student discovers tricks that improve the score without improving knowledge (e.g., writing longer answers, using jargon). The score keeps going up while the actual knowledge plateaus and then declines (because time spent on tricks displaces time spent learning).
Why It Matters
This result, documented by Gao et al. (2023) with quantitative scaling laws, is the mathematical formalization of Goodhart's Law applied to RLHF. It means there is a fundamental limit to how much you can optimize against a proxy reward. The practical implication: you must monitor true quality (via human evaluation) during RL training and stop before overoptimization kicks in.
Failure Mode
The overoptimization point depends on the quality of the reward model, and you cannot determine it without access to the true reward (which is why you are using a proxy in the first place). In practice, teams use held-out human evaluations at checkpoints, but this is expensive and discrete. Between checkpoints, the model can overshoot into the overoptimized regime.
Verifiers
Verifiers check outputs against objective criteria rather than predicting human preferences. They do not rate. They verify.
Types of Verifiers
- Code execution: Run the generated code against test cases. Pass or fail.
- Math verification: Check the final numerical answer or use a formal proof assistant (Lean, Isabelle) to verify a proof.
- Fact-checking: Compare generated claims against a knowledge base or retrieve supporting/contradicting evidence.
- Constraint checking: Verify that the output satisfies specified format or content constraints.
Why Verifiers Are More Robust
Verifiers check the output, not the process. A reward model says "this looks like a good response." A verifier says "this response is factually correct" or "this code passes all tests." The critical difference:
- Reward models can be hacked by outputs that look good but are wrong
- Verifiers are harder to reward-hack than scalar reward models because they check a property of the output rather than predict preference, but they are not immune. Unit-test verifiers can be gamed by memorizing tests or by tampering with the evaluation procedure (Denison et al. 2024), formal proof verifiers still trust the axioms and specification, and PRM step-scores can themselves be Goodharted. On math benchmarks they resist the most common failure modes, but overoptimization still occurs (Gao et al. 2023, Skalse et al. 2022)
- Reward models degrade under distributional shift; verifiers are more stable across distributions within their checkable domain, but their signal is only as sound as the specification they check against
The limitation: verifiers only exist for domains with checkable answers. You cannot build a verifier for "is this essay insightful?" or "is this response helpful?" These subjective qualities require reward models.
Process Reward Models vs Outcome Reward Models
Process vs Outcome Reward Model Tradeoff
Statement
An outcome reward model (ORM) scores only the final answer: . A process reward model (PRM) scores each intermediate step: .
The tradeoff:
- ORM: Unbiased (the final answer is either correct or not) but sparse (provides signal only at the end of a long reasoning chain). Variance scales with chain length.
- PRM: Dense signal (feedback at each step) but potentially biased (the PRM is a learned model that can be wrong about intermediate steps). Bias comes from PRM training errors.
For a chain of steps with per-step PRM error : the total PRM bias scales as , while ORM variance scales as where is the number of rollouts.
Intuition
Judging a long math derivation by only looking at the final answer (ORM) is like grading an exam by only checking the last line. You miss where the student went wrong, and you cannot help them improve their reasoning process. Judging each step (PRM) is like grading each line. You get rich feedback, but you need a grader who understands the subject well enough to evaluate each step. If the per-step grader is unreliable, the accumulated errors can mislead.
Why It Matters
The PRM vs ORM choice determines the granularity of the training signal for reasoning models. PRMs enable step-level search (beam search over reasoning steps) and provide dense rewards for RL training. ORMs are simpler and unbiased but require many rollouts to reduce variance. The best systems in 2025-2026 use both: PRMs for guiding search and ORMs for final verification.
PRM Training
PRMs require step-level labels: for each reasoning step, "was this step correct?" These labels are expensive to collect from humans. An alternative: automated PRM training via Monte Carlo estimation. For each step in a reasoning chain:
- Run many rollouts from that step to completion
- Check how many rollouts reach the correct final answer
- Label the step as "correct" if the success rate is above a threshold
This provides noisy but scalable step-level labels without human annotators.
Every reasoning-focused model in 2025-2026 uses verifier-guided training. OpenAI's o1 and o3 use verifier-guided RL for mathematical reasoning. DeepSeek-R1 uses GRPO with code execution verifiers. The Llama 3.1 technical report describes both reward models for general preference alignment and verifiers for code and math. The trend is clear: reward models for subjective quality, verifiers for objective correctness, and the best systems use both in different stages of the post-training pipeline.
The Fundamental Tension
Reward models and verifiers represent two ends of a spectrum:
| Property | Reward Model | Verifier |
|---|---|---|
| Signal | Dense (scalar for any output) | Sparse (binary for checkable outputs) |
| Coverage | Universal (any domain) | Narrow (only verifiable domains) |
| Robustness | Fragile (hackable, drifts) | Robust (objective, stable) |
| Bias | High (encodes human biases) | Low (checks truth, not preference) |
| Scalability | Limited by preference data | Limited by verifier domains |
The frontier of alignment research is extending verifier coverage to more domains (better fact-checking, better consistency checking) while making reward models more robust (distributional robustness, ensemble methods, iterated retraining).
Common Confusions
A high reward model score does not mean high quality
The reward model score is a proxy for quality, trained on a finite sample of human preferences. A score of 4.2 vs 3.8 does not mean one response is objectively better. It means the reward model predicts the first would be preferred by the kind of human raters who produced the training data. This prediction may be wrong, especially for outputs far from the training distribution.
Verifiers are not reward models with better data
Reward models and verifiers are structurally different computational objects. A reward model is a learned function that maps outputs to scalar scores. A verifier is a procedure that checks a property of the output (runs code, checks a proof, queries a database). Improving a reward model's training data does not make it a verifier. The distinction is between predicting quality and checking correctness.
Process reward models are not verifiers
A PRM is still a learned model that can be wrong. It predicts whether a reasoning step is correct. It does not prove it. PRMs can be hacked just like any other reward model. They are better than ORMs for guiding search because they provide earlier signal, but they are not a substitute for ground-truth verification.
Summary
- Reward models: trained on Bradley-Terry preferences, flexible but hackable
- Verifiers: check correctness objectively, robust but narrow in domain
- Goodhart's law: optimizing a proxy reward past a certain point makes the model worse
- Overoptimization: proxy reward increases monotonically, true reward peaks then declines
- PRM: dense step-level signal, potentially biased. ORM: sparse final signal, unbiased
- Reward models encode human biases (length preference, confidence preference)
- Verifiers work for code (tests), math (proofs), facts (retrieval). Not for style
- Best systems combine reward models (for general quality) and verifiers (for correctness)
Exercises
Problem
A reward model assigns scores and to two responses. Under the Bradley-Terry model, what is the predicted probability that a human rater prefers over ?
Problem
The Gao et al. (2023) scaling law for reward model overoptimization can be approximated as:
where is the KL divergence between the optimized policy and the reference policy, relates to reward model quality, and relates to reward model error. Find the optimal that maximizes the gold reward, and show that increases with reward model quality .
Problem
Design a hybrid system that uses both a reward model and a verifier for training a code generation model. Specify: (a) when each signal is used, (b) how they are combined, and (c) what failure modes the hybrid avoids that neither component alone can handle.
References
Canonical:
- Christiano et al., "Deep Reinforcement Learning from Human Feedback" (2017). reward model framework
- Cobbe et al., "Training Verifiers to Solve Math Word Problems" (2021). verifiers for math
Current:
- Gao et al., "Scaling Laws for Reward Model Overoptimization" (2023). quantitative Goodhart's law
- Lightman et al., "Let's Verify Step by Step" (2023). process reward models
- Wang et al., "Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations" (2024). automated PRM training
Next Topics
The natural next steps from reward models and verifiers:
- Test-time compute and search: using verifiers at inference time
- DPO vs GRPO vs RL reasoning: how reward/verifier signals drive different training methods
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- RLHF and AlignmentLayer 4
- Policy Gradient TheoremLayer 3
- Markov Decision ProcessesLayer 2
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
Builds on This
- Reward HackingLayer 5
- Verifier Design and Process RewardLayer 5