LLM Construction
Reinforcement Learning from Human Feedback: Deep Dive
The full RLHF pipeline: supervised fine-tuning, Bradley-Terry reward modeling, PPO with KL penalty, reward hacking via Goodhart, and the post-RLHF landscape of DPO, GRPO, and RLVR.
Prerequisites
Why This Matters
RLHF: from pretrained model to aligned model in three phases
RLHF is the technique that turned base language models into useful assistants. GPT-4, Claude, Gemini, and every major chatbot uses some form of learning from human feedback. Understanding the full pipeline, including its failure modes, is necessary to evaluate claims about alignment, safety, and model behavior. The theory behind RLHF also explains why newer methods (DPO, GRPO) emerged and what tradeoffs they make.
The Three-Stage Pipeline
The canonical RLHF pipeline (Ouyang et al., 2022, InstructGPT) has three stages:
- Supervised Fine-Tuning (SFT): Fine-tune a pretrained LM on high-quality demonstration data
- Reward Model Training: Train a scalar reward model on human preference comparisons
- RL Optimization: Optimize the policy against the reward model using PPO with a KL penalty against the SFT policy
Stage 1: Supervised Fine-Tuning
Start with a pretrained language model . Fine-tune it on a dataset of (prompt, high-quality response) pairs using standard next-token cross-entropy loss. This produces .
SFT alone already improves instruction-following substantially. The purpose of the next two stages is to go beyond what demonstration data can teach: to learn preferences over response quality rather than just response format.
Stage 2: Reward Model
Bradley-Terry Preference Model
Given two responses (preferred) and (dispreferred) to prompt , the Bradley-Terry model assumes:
where is a learned reward function and is the sigmoid function. The reward model is trained by maximizing this likelihood over a dataset of human comparisons.
Bradley-Terry Reward Model Loss
Statement
The maximum likelihood objective for the reward model given a dataset is:
This is binary cross-entropy where the reward difference predicts which response humans prefer.
Intuition
The reward model does not learn absolute scores. It learns to rank responses: preferred responses get higher reward than dispreferred ones. The sigmoid converts the reward gap into a probability, and we maximize the probability of the observed preference ordering.
Proof Sketch
Write the log-likelihood of the data under the Bradley-Terry model: . Negate for a loss to minimize.
Why It Matters
This is the standard reward model objective used in InstructGPT, the original ChatGPT RLHF pipeline, and most subsequent work. The reward model serves as a proxy for human judgment: cheaper to query than humans, but susceptible to reward hacking.
Failure Mode
The Bradley-Terry model assumes a total ordering of responses. In practice, human preferences are intransitive (A > B, B > C, but C > A), context-dependent, and noisy. Labeler disagreement can be substantial (30%+ on subjective prompts). The reward model fits a smooth function to this noisy signal, which can produce systematic biases.
Stage 3: PPO with KL Penalty
Given the reward model and the SFT policy , optimize:
where controls the strength of the KL penalty. This is optimized using PPO (Proximal Policy Optimization) with the reward model providing the reward signal.
KL Penalty Prevents Reward Hacking
Statement
Without the KL penalty (), the optimal policy maximizes and will exploit any systematic errors in the reward model, producing outputs that score high under but low under true human judgment. The KL penalty constrains the policy to stay near , limiting the degree of reward exploitation:
Larger keeps the policy closer to SFT; smaller allows more reward optimization at the risk of more hacking.
Intuition
The reward model is a learned proxy, not the true objective. Optimizing a proxy too aggressively is Goodhart's Law: "when a measure becomes a target, it ceases to be a good measure." The KL penalty sets a trust region around the SFT policy where the reward model is still a reasonable proxy.
Proof Sketch
The KL-penalized objective has a closed-form optimal policy: . As , concentrates on the reward-maximizing response. As , . The bound on KL follows from substituting back into the objective.
Why It Matters
This is the central design choice of RLHF. The KL penalty is not just regularization for stability; it is a safety mechanism. Empirically, reward model score increases monotonically with optimization, but true quality (measured by human evaluation) peaks and then decreases. The KL penalty stops optimization before this peak.
Failure Mode
Choosing is difficult. Too large: the model barely improves over SFT. Too small: the model reward-hacks. In practice, is tuned by monitoring a held-out set of human evaluations during training, which is expensive. Reward model overoptimization (Gao et al., 2023) shows that the relationship between KL budget and true quality follows a roughly parabolic curve.
Reward Hacking and Goodhart's Law
Reward Hacking
Reward hacking occurs when the policy finds outputs that achieve high reward model scores through features that do not correspond to genuine quality. Examples include: generating longer responses (length bias in reward models), repeating confident-sounding phrases, or including irrelevant but impressive-sounding details. The reward model assigns high scores because these features correlate with quality in the training data, but the correlation breaks under optimization.
The reward model is not the objective
The true objective is human satisfaction. The reward model is a proxy. RLHF works only in the regime where the proxy and the true objective agree. The KL penalty keeps the policy in this regime. Going beyond this regime (overoptimization) degrades the actual output quality even as the proxy reward increases.
InstructGPT: The Canonical Example
InstructGPT (Ouyang et al., 2022) applied this three-stage pipeline to GPT-3 (175B parameters). Key findings:
- The 1.3B parameter InstructGPT was preferred by human labelers over the 175B base GPT-3, demonstrating that alignment training can compensate for a 100x reduction in model size
- The SFT stage alone captured most of the formatting improvement; RLHF added quality and safety on top
- Reward model accuracy was about 72% on held-out comparisons, well above the 50% random baseline but far from perfect
What Has Changed Since
DPO (Direct Preference Optimization)
DPO (Rafailov et al., 2023) eliminates the explicit reward model and PPO entirely. It reparameterizes the KL-constrained reward maximization as a classification loss on preference pairs:
DPO is simpler (no RL training loop), but it optimizes on static preference data and cannot explore.
GRPO
Group Relative Policy Optimization (GRPO)
GRPO (Shao et al., 2024) modifies PPO by removing the value function (critic) and estimating the advantage from a group of completions sampled from the same prompt. For prompt , sample , score each with reward , and define the advantage as the within-group standardized reward:
The policy is updated using the PPO-style clipped surrogate objective with as the advantage, plus a KL penalty against a fixed reference policy (typically the SFT model), matching the regularization role that plays in standard RLHF.
GRPO was introduced in DeepSeekMath (Shao et al., 2024) and is the RL algorithm used to train DeepSeek-R1 (DeepSeek-AI, 2025). Key properties:
- No critic network. Standard PPO trains a value function of comparable size to the policy to compute advantages. GRPO drops it. Memory cost drops from holding two large models (policy and value) to one, which matters at LLM scale where each is tens of billions of parameters.
- Group-relative advantage. The baseline is the group mean, not a learned value estimate. Standardizing by the group standard deviation gives per-prompt advantages with zero mean and unit variance, which stabilizes updates across prompts of very different reward scales.
- Same KL regularization as RLHF. GRPO retains a per-token KL penalty against , preserving the trust-region role that prevents reward hacking.
- Pairs well with verifiable rewards. When is a rule-based check (correct answer, passing test, valid format), GRPO skips reward-model training entirely and the group-relative estimator has low variance.
Advantage computation for a group of G=4
Prompt: "Solve ." Sample four completions and check final answers. Suppose rewards are .
Mean , standard deviation . Advantages: and . Each token in a correct completion receives advantage ; each token in an incorrect completion receives . No value network is consulted.
See the verifier design and process reward page for details on rule-based rewards, and DPO vs GRPO vs RL reasoning for a direct comparison.
Process vs Outcome Reward Models
Outcome Reward Model (ORM)
An outcome reward model scores only the final answer of a full solution, , regardless of the reasoning steps that produced it. Supervision comes from labeling complete solutions as correct or incorrect.
Process Reward Model (PRM)
A process reward model scores each intermediate step of a reasoning trace. Given a solution broken into steps , the PRM produces per-step scores for . Supervision comes from human or automated labels on each step as correct, neutral, or incorrect.
Lightman et al. (2023) "Let's Verify Step by Step" trained a PRM on the PRM800K dataset (around 800k step-level labels on MATH solutions) and showed that PRMs outperform ORMs when used as verifiers for best-of- sampling. Uesato et al. (2022) established the earlier process-vs-outcome comparison on grade-school math, finding that process supervision yields better interpretability even when outcome supervision matches final accuracy.
Why PRMs help:
- Step-level credit assignment. An ORM can only say "this solution is wrong." A PRM localizes the first incorrect step, so downstream updates (or search procedures) do not penalize correct reasoning that happened to precede a later error.
- Test-time verification. Given sampled solutions, a PRM can rank them by minimum step score or product of step scores, enabling verifier-guided best-of-. This is one of the main levers behind test-time compute scaling in reasoning models.
- Robustness to false positives. ORMs can assign high reward to a solution that reaches the correct answer through flawed reasoning. PRMs penalize such traces at the step level.
Costs:
- Annotation burden. Step-level labels are far more expensive than outcome labels. PRM800K required heavy human effort. Automated step labeling (e.g., Math-Shepherd, rollouts from intermediate states) partially mitigates this but introduces its own noise.
- Step segmentation. "What counts as a step" is not well-defined for free-form text. Segmentation choices affect both training and inference.
- PRM hacking. Like any learned reward, PRMs are proxies. A policy optimized against a PRM can produce traces that score well step by step without being globally correct.
PRM does not replace the outcome signal
Using a PRM does not mean the final answer stops mattering. Most pipelines combine a PRM (for step-level verification and search guidance) with an ORM or ground-truth checker (for final correctness). The PRM narrows the search space; the outcome check is the terminal objective.
Constitutional AI
Instead of collecting human preference labels, generate preference data by having the model critique its own outputs against a set of principles. This scales the feedback generation process but still requires careful principle design.
RLVR (RL with Verifiable Rewards)
For tasks with objective correctness criteria (math, code, factual QA), skip the reward model entirely and use the correctness signal as the reward. This eliminates Goodhart concerns for the verifiable component but does not help with subjective quality.
Common Confusions
RLHF does not teach the model new knowledge
RLHF steers the model toward responses that humans prefer. It cannot teach facts that were absent from pretraining data. If the base model does not know a fact, RLHF will not inject it. RLHF changes the distribution over existing capabilities, not the capabilities themselves.
DPO is not strictly better than PPO-based RLHF
DPO is simpler and avoids reward model training, but it optimizes on a fixed dataset of preferences. PPO-based RLHF can explore: the policy generates new responses during training, which the reward model evaluates. This exploration can find good responses not present in the preference dataset. The best choice depends on data availability and computational budget.
Key Takeaways
- RLHF has three stages: SFT for format, reward model for preferences, PPO for optimization against the reward model
- The reward model uses Bradley-Terry to convert pairwise comparisons into a scalar reward function
- The KL penalty against the SFT policy is a safety mechanism that prevents reward hacking, not just a regularizer
- Reward overoptimization (Goodhart) is the central failure mode: proxy reward increases while true quality decreases
- DPO removes the reward model and RL loop by reparameterizing as a classification loss; GRPO uses verifiable rewards with group-relative scoring
Exercises
Problem
A reward model assigns and for a preference pair. What is the predicted probability that is preferred under the Bradley-Terry model?
Problem
The optimal policy under the KL-penalized RLHF objective is . Derive this. Start from the objective and use calculus of variations or the known solution for KL-regularized optimization.
Problem
Gao et al. (2023) observed that as the KL budget increases, reward model score increases monotonically but true quality (gold reward) peaks and then decreases. Propose an experiment to estimate the optimal KL budget for a new reward model without access to a gold reward model.
Related Comparisons
References
Canonical:
- Ouyang et al., "Training language models to follow instructions with human feedback" (InstructGPT, 2022), Sections 2-4
- Christiano et al., "Deep RL from Human Preferences" (2017), Sections 2-3
Current:
- Rafailov et al., "Direct Preference Optimization" (DPO, 2023), Section 4
- Gao et al., "Scaling Laws for Reward Model Overoptimization" (2023)
- Bai et al., "Constitutional AI" (2022), Section 3
GRPO and RL for reasoning:
- Shao, Wang, Zhu et al., "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (2024), arXiv:2402.03300, Section 4 (GRPO)
- DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2025), arXiv:2501.12948
Process vs outcome supervision:
- Lightman, Kosaraju, Burda et al., "Let's Verify Step by Step" (2023), arXiv:2305.20050
- Uesato et al., "Solving math word problems with process- and outcome-based feedback" (2022), arXiv:2211.14275
Next Topics
- DPO vs GRPO vs RL reasoning: detailed comparison of post-RLHF methods
- Constitutional AI: self-supervised alignment without human labels
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Policy Gradient TheoremLayer 3
- Markov Decision ProcessesLayer 2
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- RLHF and AlignmentLayer 4