Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Reinforcement Learning from Human Feedback: Deep Dive

The full RLHF pipeline: supervised fine-tuning, Bradley-Terry reward modeling, PPO with KL penalty, reward hacking via Goodhart, and the post-RLHF landscape of DPO, GRPO, and RLVR.

AdvancedTier 1Current~70 min

Why This Matters

RLHF: from pretrained model to aligned model in three phases

Phase 1: SFTSupervised fine-tuningon demonstration dataPhase 2: RMTrain reward modelon human preferencesPhase 3: RLPPO against rewardmodel (with KL penalty)Aligned ModelHelpful, harmless,honestDemonstrationdata (human-written)Comparison data(A > B rankings)Prompts(from distribution)reward signalR(x, y) - β · KL(π_RL ∥ π_SFT)reward minus KL penalty to prevent reward hacking

RLHF is the technique that turned base language models into useful assistants. GPT-4, Claude, Gemini, and every major chatbot uses some form of learning from human feedback. Understanding the full pipeline, including its failure modes, is necessary to evaluate claims about alignment, safety, and model behavior. The theory behind RLHF also explains why newer methods (DPO, GRPO) emerged and what tradeoffs they make.

The Three-Stage Pipeline

The canonical RLHF pipeline (Ouyang et al., 2022, InstructGPT) has three stages:

  1. Supervised Fine-Tuning (SFT): Fine-tune a pretrained LM on high-quality demonstration data
  2. Reward Model Training: Train a scalar reward model on human preference comparisons
  3. RL Optimization: Optimize the policy against the reward model using PPO with a KL penalty against the SFT policy

Stage 1: Supervised Fine-Tuning

Start with a pretrained language model πpre\pi_{\text{pre}}. Fine-tune it on a dataset of (prompt, high-quality response) pairs using standard next-token cross-entropy loss. This produces πSFT\pi_{\text{SFT}}.

SFT alone already improves instruction-following substantially. The purpose of the next two stages is to go beyond what demonstration data can teach: to learn preferences over response quality rather than just response format.

Stage 2: Reward Model

Definition

Bradley-Terry Preference Model

Given two responses ywy_w (preferred) and yly_l (dispreferred) to prompt xx, the Bradley-Terry model assumes:

P(ywylx)=σ(r(x,yw)r(x,yl))P(y_w \succ y_l \mid x) = \sigma(r(x, y_w) - r(x, y_l))

where r:X×YRr: \mathcal{X} \times \mathcal{Y} \to \mathbb{R} is a learned reward function and σ\sigma is the sigmoid function. The reward model is trained by maximizing this likelihood over a dataset of human comparisons.

Proposition

Bradley-Terry Reward Model Loss

Statement

The maximum likelihood objective for the reward model given a dataset D={(x(i),yw(i),yl(i))}i=1N\mathcal{D} = \{(x^{(i)}, y_w^{(i)}, y_l^{(i)})\}_{i=1}^N is:

LRM=1Ni=1Nlogσ(r(x(i),yw(i))r(x(i),yl(i)))\mathcal{L}_{\text{RM}} = -\frac{1}{N}\sum_{i=1}^{N} \log \sigma\left(r(x^{(i)}, y_w^{(i)}) - r(x^{(i)}, y_l^{(i)})\right)

This is binary cross-entropy where the reward difference predicts which response humans prefer.

Intuition

The reward model does not learn absolute scores. It learns to rank responses: preferred responses get higher reward than dispreferred ones. The sigmoid converts the reward gap into a probability, and we maximize the probability of the observed preference ordering.

Proof Sketch

Write the log-likelihood of the data under the Bradley-Terry model: ilogP(yw(i)yl(i))=ilogσ(r(x(i),yw(i))r(x(i),yl(i)))\sum_i \log P(y_w^{(i)} \succ y_l^{(i)}) = \sum_i \log \sigma(r(x^{(i)}, y_w^{(i)}) - r(x^{(i)}, y_l^{(i)})). Negate for a loss to minimize.

Why It Matters

This is the standard reward model objective used in InstructGPT, the original ChatGPT RLHF pipeline, and most subsequent work. The reward model serves as a proxy for human judgment: cheaper to query than humans, but susceptible to reward hacking.

Failure Mode

The Bradley-Terry model assumes a total ordering of responses. In practice, human preferences are intransitive (A > B, B > C, but C > A), context-dependent, and noisy. Labeler disagreement can be substantial (30%+ on subjective prompts). The reward model fits a smooth function to this noisy signal, which can produce systematic biases.

Stage 3: PPO with KL Penalty

Given the reward model rr and the SFT policy πSFT\pi_{\text{SFT}}, optimize:

maxπExD,yπ(x)[r(x,y)βDKL(π(x)πSFT(x))]\max_{\pi} \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi(\cdot \mid x)}\left[r(x, y) - \beta \cdot D_{\text{KL}}(\pi(\cdot \mid x) \| \pi_{\text{SFT}}(\cdot \mid x))\right]

where β>0\beta > 0 controls the strength of the KL penalty. This is optimized using PPO (Proximal Policy Optimization) with the reward model providing the reward signal.

Proposition

KL Penalty Prevents Reward Hacking

Statement

Without the KL penalty (β=0\beta = 0), the optimal policy π\pi^* maximizes r(x,y)r(x, y) and will exploit any systematic errors in the reward model, producing outputs that score high under rr but low under true human judgment. The KL penalty constrains the policy to stay near πSFT\pi_{\text{SFT}}, limiting the degree of reward exploitation:

DKL(ππSFT)rmaxβD_{\text{KL}}(\pi^* \| \pi_{\text{SFT}}) \leq \frac{r_{\max}}{\beta}

Larger β\beta keeps the policy closer to SFT; smaller β\beta allows more reward optimization at the risk of more hacking.

Intuition

The reward model is a learned proxy, not the true objective. Optimizing a proxy too aggressively is Goodhart's Law: "when a measure becomes a target, it ceases to be a good measure." The KL penalty sets a trust region around the SFT policy where the reward model is still a reasonable proxy.

Proof Sketch

The KL-penalized objective has a closed-form optimal policy: π(yx)πSFT(yx)exp(r(x,y)/β)\pi^*(y \mid x) \propto \pi_{\text{SFT}}(y \mid x) \exp(r(x,y)/\beta). As β0\beta \to 0, π\pi^* concentrates on the reward-maximizing response. As β\beta \to \infty, ππSFT\pi^* \to \pi_{\text{SFT}}. The bound on KL follows from substituting back into the objective.

Why It Matters

This is the central design choice of RLHF. The KL penalty is not just regularization for stability; it is a safety mechanism. Empirically, reward model score increases monotonically with optimization, but true quality (measured by human evaluation) peaks and then decreases. The KL penalty stops optimization before this peak.

Failure Mode

Choosing β\beta is difficult. Too large: the model barely improves over SFT. Too small: the model reward-hacks. In practice, β\beta is tuned by monitoring a held-out set of human evaluations during training, which is expensive. Reward model overoptimization (Gao et al., 2023) shows that the relationship between KL budget and true quality follows a roughly parabolic curve.

Reward Hacking and Goodhart's Law

Definition

Reward Hacking

Reward hacking occurs when the policy finds outputs that achieve high reward model scores through features that do not correspond to genuine quality. Examples include: generating longer responses (length bias in reward models), repeating confident-sounding phrases, or including irrelevant but impressive-sounding details. The reward model assigns high scores because these features correlate with quality in the training data, but the correlation breaks under optimization.

Watch Out

The reward model is not the objective

The true objective is human satisfaction. The reward model is a proxy. RLHF works only in the regime where the proxy and the true objective agree. The KL penalty keeps the policy in this regime. Going beyond this regime (overoptimization) degrades the actual output quality even as the proxy reward increases.

InstructGPT: The Canonical Example

InstructGPT (Ouyang et al., 2022) applied this three-stage pipeline to GPT-3 (175B parameters). Key findings:

  • The 1.3B parameter InstructGPT was preferred by human labelers over the 175B base GPT-3, demonstrating that alignment training can compensate for a 100x reduction in model size
  • The SFT stage alone captured most of the formatting improvement; RLHF added quality and safety on top
  • Reward model accuracy was about 72% on held-out comparisons, well above the 50% random baseline but far from perfect

What Has Changed Since

DPO (Direct Preference Optimization)

DPO (Rafailov et al., 2023) eliminates the explicit reward model and PPO entirely. It reparameterizes the KL-constrained reward maximization as a classification loss on preference pairs:

LDPO=logσ(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))\mathcal{L}_{\text{DPO}} = -\log \sigma\left(\beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)}\right)

DPO is simpler (no RL training loop), but it optimizes on static preference data and cannot explore.

GRPO

Definition

Group Relative Policy Optimization (GRPO)

GRPO (Shao et al., 2024) modifies PPO by removing the value function (critic) and estimating the advantage from a group of GG completions sampled from the same prompt. For prompt xx, sample {y1,,yG}πθold(x)\{y_1, \ldots, y_G\} \sim \pi_{\theta_{\text{old}}}(\cdot \mid x), score each with reward rir_i, and define the advantage as the within-group standardized reward:

Ai=rimean({r1,,rG})std({r1,,rG})A_i = \frac{r_i - \text{mean}(\{r_1, \ldots, r_G\})}{\text{std}(\{r_1, \ldots, r_G\})}

The policy is updated using the PPO-style clipped surrogate objective with AiA_i as the advantage, plus a KL penalty against a fixed reference policy πref\pi_{\text{ref}} (typically the SFT model), matching the regularization role that πSFT\pi_{\text{SFT}} plays in standard RLHF.

GRPO was introduced in DeepSeekMath (Shao et al., 2024) and is the RL algorithm used to train DeepSeek-R1 (DeepSeek-AI, 2025). Key properties:

  • No critic network. Standard PPO trains a value function Vϕ(s)V_\phi(s) of comparable size to the policy to compute advantages. GRPO drops it. Memory cost drops from holding two large models (policy and value) to one, which matters at LLM scale where each is tens of billions of parameters.
  • Group-relative advantage. The baseline is the group mean, not a learned value estimate. Standardizing by the group standard deviation gives per-prompt advantages with zero mean and unit variance, which stabilizes updates across prompts of very different reward scales.
  • Same KL regularization as RLHF. GRPO retains a per-token KL penalty against πref\pi_{\text{ref}}, preserving the trust-region role that prevents reward hacking.
  • Pairs well with verifiable rewards. When rir_i is a rule-based check (correct answer, passing test, valid format), GRPO skips reward-model training entirely and the group-relative estimator has low variance.
Example

Advantage computation for a group of G=4

Prompt: "Solve 2x+3=112x + 3 = 11." Sample four completions and check final answers. Suppose rewards are r1=1,r2=0,r3=1,r4=0r_1 = 1, r_2 = 0, r_3 = 1, r_4 = 0.

Mean =0.5= 0.5, standard deviation =0.5= 0.5. Advantages: A1=A3=1.0A_1 = A_3 = 1.0 and A2=A4=1.0A_2 = A_4 = -1.0. Each token in a correct completion receives advantage +1+1; each token in an incorrect completion receives 1-1. No value network is consulted.

See the verifier design and process reward page for details on rule-based rewards, and DPO vs GRPO vs RL reasoning for a direct comparison.

Process vs Outcome Reward Models

Definition

Outcome Reward Model (ORM)

An outcome reward model scores only the final answer of a full solution, rORM(x,y)Rr_{\text{ORM}}(x, y) \in \mathbb{R}, regardless of the reasoning steps that produced it. Supervision comes from labeling complete solutions as correct or incorrect.

Definition

Process Reward Model (PRM)

A process reward model scores each intermediate step of a reasoning trace. Given a solution broken into steps y=(s1,s2,,sT)y = (s_1, s_2, \ldots, s_T), the PRM produces per-step scores rPRM(x,s1:t)Rr_{\text{PRM}}(x, s_{1:t}) \in \mathbb{R} for t=1,,Tt = 1, \ldots, T. Supervision comes from human or automated labels on each step as correct, neutral, or incorrect.

Lightman et al. (2023) "Let's Verify Step by Step" trained a PRM on the PRM800K dataset (around 800k step-level labels on MATH solutions) and showed that PRMs outperform ORMs when used as verifiers for best-of-NN sampling. Uesato et al. (2022) established the earlier process-vs-outcome comparison on grade-school math, finding that process supervision yields better interpretability even when outcome supervision matches final accuracy.

Why PRMs help:

  • Step-level credit assignment. An ORM can only say "this solution is wrong." A PRM localizes the first incorrect step, so downstream updates (or search procedures) do not penalize correct reasoning that happened to precede a later error.
  • Test-time verification. Given NN sampled solutions, a PRM can rank them by minimum step score or product of step scores, enabling verifier-guided best-of-NN. This is one of the main levers behind test-time compute scaling in reasoning models.
  • Robustness to false positives. ORMs can assign high reward to a solution that reaches the correct answer through flawed reasoning. PRMs penalize such traces at the step level.

Costs:

  • Annotation burden. Step-level labels are far more expensive than outcome labels. PRM800K required heavy human effort. Automated step labeling (e.g., Math-Shepherd, rollouts from intermediate states) partially mitigates this but introduces its own noise.
  • Step segmentation. "What counts as a step" is not well-defined for free-form text. Segmentation choices affect both training and inference.
  • PRM hacking. Like any learned reward, PRMs are proxies. A policy optimized against a PRM can produce traces that score well step by step without being globally correct.
Watch Out

PRM does not replace the outcome signal

Using a PRM does not mean the final answer stops mattering. Most pipelines combine a PRM (for step-level verification and search guidance) with an ORM or ground-truth checker (for final correctness). The PRM narrows the search space; the outcome check is the terminal objective.

Constitutional AI

Instead of collecting human preference labels, generate preference data by having the model critique its own outputs against a set of principles. This scales the feedback generation process but still requires careful principle design.

RLVR (RL with Verifiable Rewards)

For tasks with objective correctness criteria (math, code, factual QA), skip the reward model entirely and use the correctness signal as the reward. This eliminates Goodhart concerns for the verifiable component but does not help with subjective quality.

Common Confusions

Watch Out

RLHF does not teach the model new knowledge

RLHF steers the model toward responses that humans prefer. It cannot teach facts that were absent from pretraining data. If the base model does not know a fact, RLHF will not inject it. RLHF changes the distribution over existing capabilities, not the capabilities themselves.

Watch Out

DPO is not strictly better than PPO-based RLHF

DPO is simpler and avoids reward model training, but it optimizes on a fixed dataset of preferences. PPO-based RLHF can explore: the policy generates new responses during training, which the reward model evaluates. This exploration can find good responses not present in the preference dataset. The best choice depends on data availability and computational budget.

Key Takeaways

  • RLHF has three stages: SFT for format, reward model for preferences, PPO for optimization against the reward model
  • The reward model uses Bradley-Terry to convert pairwise comparisons into a scalar reward function
  • The KL penalty against the SFT policy is a safety mechanism that prevents reward hacking, not just a regularizer
  • Reward overoptimization (Goodhart) is the central failure mode: proxy reward increases while true quality decreases
  • DPO removes the reward model and RL loop by reparameterizing as a classification loss; GRPO uses verifiable rewards with group-relative scoring

Exercises

ExerciseCore

Problem

A reward model assigns r(x,yw)=2.0r(x, y_w) = 2.0 and r(x,yl)=0.5r(x, y_l) = 0.5 for a preference pair. What is the predicted probability that ywy_w is preferred under the Bradley-Terry model?

ExerciseAdvanced

Problem

The optimal policy under the KL-penalized RLHF objective is π(yx)πSFT(yx)exp(r(x,y)/β)\pi^*(y \mid x) \propto \pi_{\text{SFT}}(y \mid x) \exp(r(x,y)/\beta). Derive this. Start from the objective maxπEπ[r(x,y)]βDKL(ππSFT)\max_\pi \mathbb{E}_\pi[r(x,y)] - \beta D_{\text{KL}}(\pi \| \pi_{\text{SFT}}) and use calculus of variations or the known solution for KL-regularized optimization.

ExerciseResearch

Problem

Gao et al. (2023) observed that as the KL budget increases, reward model score increases monotonically but true quality (gold reward) peaks and then decreases. Propose an experiment to estimate the optimal KL budget for a new reward model without access to a gold reward model.

Related Comparisons

References

Canonical:

  • Ouyang et al., "Training language models to follow instructions with human feedback" (InstructGPT, 2022), Sections 2-4
  • Christiano et al., "Deep RL from Human Preferences" (2017), Sections 2-3

Current:

  • Rafailov et al., "Direct Preference Optimization" (DPO, 2023), Section 4
  • Gao et al., "Scaling Laws for Reward Model Overoptimization" (2023)
  • Bai et al., "Constitutional AI" (2022), Section 3

GRPO and RL for reasoning:

  • Shao, Wang, Zhu et al., "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (2024), arXiv:2402.03300, Section 4 (GRPO)
  • DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2025), arXiv:2501.12948

Process vs outcome supervision:

  • Lightman, Kosaraju, Burda et al., "Let's Verify Step by Step" (2023), arXiv:2305.20050
  • Uesato et al., "Solving math word problems with process- and outcome-based feedback" (2022), arXiv:2211.14275

Next Topics

  • DPO vs GRPO vs RL reasoning: detailed comparison of post-RLHF methods
  • Constitutional AI: self-supervised alignment without human labels

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics