Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

AI Safety

Reward Models and Verifiers

Reward models trained on human preferences vs verifiers that check output correctness. Bradley-Terry models, process vs outcome rewards, Goodhart's law, and why verifiers are more robust.

AdvancedTier 2Frontier~55 min

Prerequisites

0

Why This Matters

Every post-training method needs a training signal: something that tells the model which outputs are good and which are bad. This signal comes from either a reward model (a learned function that predicts human preferences) or a verifier (a system that checks output correctness).

The distinction between reward models and verifiers is one of the most important in modern AI alignment. Reward models are flexible but fragile. They can be hacked, they drift with distribution shift, and they encode biases from their training data. Verifiers are narrow and empirically more robust on domains with checkable answers. When they say an answer passes their check, it passes that check, but the check itself can be incomplete (partial test coverage, incorrect specifications, tamperable environments). Understanding this distinction explains why reasoning-focused models use verifiers and general-purpose models use reward models, and why the most capable systems combine both.

Mental Model

Think of two ways to evaluate a student's essay:

  1. Reward model approach: Ask a panel of judges to rate it. The judges are imperfect. They can be swayed by confident prose, they disagree with each other, and a clever student can learn to write essays that score well without actually being good.
  2. Verifier approach: Check each factual claim against a reference. The checker is narrow (it can only verify facts, not evaluate style) but reliable. a claim is either correct or not.

Reward models are like judges: flexible, subjective, hackable. Verifiers are like fact-checkers: narrow, objective, robust.

Reward Models

The Bradley-Terry Framework

Theorem

Bradley-Terry Reward Model Training

Statement

Given a dataset of human preference comparisons D={(x(i),yw(i),yl(i))}i=1N\mathcal{D} = \{(x^{(i)}, y_w^{(i)}, y_l^{(i)})\}_{i=1}^N, a reward model rϕ(x,y)r_\phi(x, y) is trained by maximizing:

L(ϕ)=i=1Nlogσ(rϕ(x(i),yw(i))rϕ(x(i),yl(i)))\mathcal{L}(\phi) = \sum_{i=1}^{N} \log \sigma\left(r_\phi(x^{(i)}, y_w^{(i)}) - r_\phi(x^{(i)}, y_l^{(i)})\right)

Under the Bradley-Terry model, the probability of preferring ywy_w over yly_l is:

P(ywylx)=σ(r(x,yw)r(x,yl))P(y_w \succ y_l | x) = \sigma(r(x, y_w) - r(x, y_l))

The reward model learns a scalar scoring function such that preferred outputs receive higher scores than dispreferred ones.

Intuition

The reward model converts ordinal preferences (A is better than B) into cardinal scores (A scores 3.2, B scores 1.7). The sigmoid function maps the score difference to a preference probability. The loss encourages the model to assign scores that are consistent with observed human preferences.

Why It Matters

The reward model is the bridge between human judgment and machine optimization. In RLHF, the policy is optimized to maximize the reward model's score. In best-of-N sampling, the reward model selects the best candidate. The quality of the entire post-training pipeline depends on the quality of this single learned function.

Failure Mode

The Bradley-Terry model assumes a total ordering of outputs for each prompt: for any two outputs, one is definitively better. In reality, human preferences are often intransitive (A > B, B > C, but C > A), context-dependent, and influenced by factors irrelevant to quality (length, confidence, formatting). The reward model absorbs all of these imperfections and presents them as a clean scalar score, creating a false sense of precision.

Why Reward Models Are Fragile

Problem 1: Distributional shift. The reward model is trained on outputs from πSFT\pi_{\text{SFT}}. As the policy changes during RL, it generates outputs increasingly different from the training distribution. The reward model's scores become unreliable on these out-of-distribution outputs. This is not a hypothetical concern. It is the primary failure mode in practice.

Problem 2: Reward features. Reward models learn spurious correlations between surface features and human preferences. Common examples:

  • Longer responses tend to be rated higher \Rightarrow the model learns "length = quality"
  • Confident language is preferred \Rightarrow the model learns "confidence = quality"
  • Markdown formatting looks professional \Rightarrow the model learns "formatting = quality"

A model optimized against these spurious features produces long, confident, well-formatted responses that may be substantively wrong.

Problem 3: Goodhart's Law.

Proposition

Reward Model Overoptimization

Statement

Let rr^* be the true (unobservable) reward and rϕr_\phi the learned proxy reward. Assume the policy πθ\pi_\theta is the KL-constrained optimum of E[rϕ]\mathbb{E}[r_\phi] at inverse KL weight β1\beta^{-1}, and that the two expected rewards are differentiable in β1\beta^{-1} (these are the regularity conditions under which the sign claims below hold. For approximate RL trajectories such as a fixed number of PPO steps, the signed statements become empirical tendencies rather than theorems). Then as the policy is optimized more aggressively against rϕr_\phi (by decreasing β\beta), the proxy reward Eπθ[rϕ]\mathbb{E}_{\pi_\theta}[r_\phi] increases monotonically, while the true reward Eπθ[r]\mathbb{E}_{\pi_\theta}[r^*] typically first increases and then decreases:

ddβ1Eπθ[rϕ]0\frac{d}{d\beta^{-1}} \mathbb{E}_{\pi_\theta}[r_\phi] \geq 0

ddβ1Eπθ[r]>0initially,<0eventually\frac{d}{d\beta^{-1}} \mathbb{E}_{\pi_\theta}[r^*] > 0 \quad \text{initially}, \quad < 0 \quad \text{eventually}

The first inequality is the envelope property of the KL-constrained optimum and requires the monotone-optimum assumption above. The second is the empirical overoptimization curve fit by Gao et al. 2023. The point where true reward peaks is the optimal level of optimization. Beyond it, further optimization actively makes the model worse according to the quantity we actually care about.

Intuition

Imagine optimizing a student's exam score. Initially, studying improves both the score and the knowledge. But eventually, the student discovers tricks that improve the score without improving knowledge (e.g., writing longer answers, using jargon). The score keeps going up while the actual knowledge plateaus and then declines (because time spent on tricks displaces time spent learning).

Why It Matters

This result, documented by Gao et al. (2023) with quantitative scaling laws, is the mathematical formalization of Goodhart's Law applied to RLHF. It means there is a fundamental limit to how much you can optimize against a proxy reward. The practical implication: you must monitor true quality (via human evaluation) during RL training and stop before overoptimization kicks in.

Failure Mode

The overoptimization point depends on the quality of the reward model, and you cannot determine it without access to the true reward (which is why you are using a proxy in the first place). In practice, teams use held-out human evaluations at checkpoints, but this is expensive and discrete. Between checkpoints, the model can overshoot into the overoptimized regime.

Verifiers

Verifiers check outputs against objective criteria rather than predicting human preferences. They do not rate. They verify.

Types of Verifiers

  • Code execution: Run the generated code against test cases. Pass or fail.
  • Math verification: Check the final numerical answer or use a formal proof assistant (Lean, Isabelle) to verify a proof.
  • Fact-checking: Compare generated claims against a knowledge base or retrieve supporting/contradicting evidence.
  • Constraint checking: Verify that the output satisfies specified format or content constraints.

Why Verifiers Are More Robust

Verifiers check the output, not the process. A reward model says "this looks like a good response." A verifier says "this response is factually correct" or "this code passes all tests." The critical difference:

  • Reward models can be hacked by outputs that look good but are wrong
  • Verifiers are harder to reward-hack than scalar reward models because they check a property of the output rather than predict preference, but they are not immune. Unit-test verifiers can be gamed by memorizing tests or by tampering with the evaluation procedure (Denison et al. 2024), formal proof verifiers still trust the axioms and specification, and PRM step-scores can themselves be Goodharted. On math benchmarks they resist the most common failure modes, but overoptimization still occurs (Gao et al. 2023, Skalse et al. 2022)
  • Reward models degrade under distributional shift; verifiers are more stable across distributions within their checkable domain, but their signal is only as sound as the specification they check against

The limitation: verifiers only exist for domains with checkable answers. You cannot build a verifier for "is this essay insightful?" or "is this response helpful?" These subjective qualities require reward models.

Process Reward Models vs Outcome Reward Models

Proposition

Process vs Outcome Reward Model Tradeoff

Statement

An outcome reward model (ORM) scores only the final answer: rORM(x,y)=f(x,yfinal)r_{\text{ORM}}(x, y) = f(x, y_{\text{final}}). A process reward model (PRM) scores each intermediate step: rPRM(x,y)=t=1Tft(x,yt)r_{\text{PRM}}(x, y) = \sum_{t=1}^{T} f_t(x, y_{\leq t}).

The tradeoff:

  • ORM: Unbiased (the final answer is either correct or not) but sparse (provides signal only at the end of a long reasoning chain). Variance scales with chain length.
  • PRM: Dense signal (feedback at each step) but potentially biased (the PRM is a learned model that can be wrong about intermediate steps). Bias comes from PRM training errors.

For a chain of TT steps with per-step PRM error ϵstep\epsilon_{\text{step}}: the total PRM bias scales as O(Tϵstep)O(T \epsilon_{\text{step}}), while ORM variance scales as O(1/n)O(1/\sqrt{n}) where nn is the number of rollouts.

Intuition

Judging a long math derivation by only looking at the final answer (ORM) is like grading an exam by only checking the last line. You miss where the student went wrong, and you cannot help them improve their reasoning process. Judging each step (PRM) is like grading each line. You get rich feedback, but you need a grader who understands the subject well enough to evaluate each step. If the per-step grader is unreliable, the accumulated errors can mislead.

Why It Matters

The PRM vs ORM choice determines the granularity of the training signal for reasoning models. PRMs enable step-level search (beam search over reasoning steps) and provide dense rewards for RL training. ORMs are simpler and unbiased but require many rollouts to reduce variance. The best systems in 2025-2026 use both: PRMs for guiding search and ORMs for final verification.

PRM Training

PRMs require step-level labels: for each reasoning step, "was this step correct?" These labels are expensive to collect from humans. An alternative: automated PRM training via Monte Carlo estimation. For each step in a reasoning chain:

  1. Run many rollouts from that step to completion
  2. Check how many rollouts reach the correct final answer
  3. Label the step as "correct" if the success rate is above a threshold

This provides noisy but scalable step-level labels without human annotators.

Where This Shows Up in Current Papers

Every reasoning-focused model in 2025-2026 uses verifier-guided training. OpenAI's o1 and o3 use verifier-guided RL for mathematical reasoning. DeepSeek-R1 uses GRPO with code execution verifiers. The Llama 3.1 technical report describes both reward models for general preference alignment and verifiers for code and math. The trend is clear: reward models for subjective quality, verifiers for objective correctness, and the best systems use both in different stages of the post-training pipeline.

The Fundamental Tension

Reward models and verifiers represent two ends of a spectrum:

PropertyReward ModelVerifier
SignalDense (scalar for any output)Sparse (binary for checkable outputs)
CoverageUniversal (any domain)Narrow (only verifiable domains)
RobustnessFragile (hackable, drifts)Robust (objective, stable)
BiasHigh (encodes human biases)Low (checks truth, not preference)
ScalabilityLimited by preference dataLimited by verifier domains

The frontier of alignment research is extending verifier coverage to more domains (better fact-checking, better consistency checking) while making reward models more robust (distributional robustness, ensemble methods, iterated retraining).

Common Confusions

Watch Out

A high reward model score does not mean high quality

The reward model score is a proxy for quality, trained on a finite sample of human preferences. A score of 4.2 vs 3.8 does not mean one response is objectively better. It means the reward model predicts the first would be preferred by the kind of human raters who produced the training data. This prediction may be wrong, especially for outputs far from the training distribution.

Watch Out

Verifiers are not reward models with better data

Reward models and verifiers are structurally different computational objects. A reward model is a learned function that maps outputs to scalar scores. A verifier is a procedure that checks a property of the output (runs code, checks a proof, queries a database). Improving a reward model's training data does not make it a verifier. The distinction is between predicting quality and checking correctness.

Watch Out

Process reward models are not verifiers

A PRM is still a learned model that can be wrong. It predicts whether a reasoning step is correct. It does not prove it. PRMs can be hacked just like any other reward model. They are better than ORMs for guiding search because they provide earlier signal, but they are not a substitute for ground-truth verification.

Summary

  • Reward models: trained on Bradley-Terry preferences, flexible but hackable
  • Verifiers: check correctness objectively, robust but narrow in domain
  • Goodhart's law: optimizing a proxy reward past a certain point makes the model worse
  • Overoptimization: proxy reward increases monotonically, true reward peaks then declines
  • PRM: dense step-level signal, potentially biased. ORM: sparse final signal, unbiased
  • Reward models encode human biases (length preference, confidence preference)
  • Verifiers work for code (tests), math (proofs), facts (retrieval). Not for style
  • Best systems combine reward models (for general quality) and verifiers (for correctness)

Exercises

ExerciseCore

Problem

A reward model assigns scores r(x,yA)=2.5r(x, y_A) = 2.5 and r(x,yB)=1.8r(x, y_B) = 1.8 to two responses. Under the Bradley-Terry model, what is the predicted probability that a human rater prefers yAy_A over yBy_B?

ExerciseAdvanced

Problem

The Gao et al. (2023) scaling law for reward model overoptimization can be approximated as:

rgold(d)=αdβdr^*_{\text{gold}}(d) = \alpha \sqrt{d} - \beta d

where dd is the KL divergence between the optimized policy and the reference policy, α\alpha relates to reward model quality, and β\beta relates to reward model error. Find the optimal dd^* that maximizes the gold reward, and show that rgold(d)r^*_{\text{gold}}(d^*) increases with reward model quality α\alpha.

ExerciseResearch

Problem

Design a hybrid system that uses both a reward model and a verifier for training a code generation model. Specify: (a) when each signal is used, (b) how they are combined, and (c) what failure modes the hybrid avoids that neither component alone can handle.

References

Canonical:

  • Christiano et al., "Deep Reinforcement Learning from Human Feedback" (2017). reward model framework
  • Cobbe et al., "Training Verifiers to Solve Math Word Problems" (2021). verifiers for math

Current:

  • Gao et al., "Scaling Laws for Reward Model Overoptimization" (2023). quantitative Goodhart's law
  • Lightman et al., "Let's Verify Step by Step" (2023). process reward models
  • Wang et al., "Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations" (2024). automated PRM training

Next Topics

The natural next steps from reward models and verifiers:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics