Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Comparison

RLHF vs. DPO vs. GRPO

Three approaches to aligning language models with human preferences. RLHF trains a separate reward model and optimizes via PPO. DPO eliminates the reward model by reparameterizing the preference objective. GRPO uses group-relative scoring without a reward model, suited for reasoning tasks with verifiable answers.

What Each Does

All three methods fine-tune a pretrained language model πθ\pi_\theta to align with human preferences or task objectives. They differ in whether they use an explicit reward model and how they optimize.

RLHF (Reinforcement Learning from Human Feedback) trains a reward model rϕr_\phi on preference data, then optimizes πθ\pi_\theta using PPO to maximize rϕr_\phi while staying close to a reference policy πref\pi_{\text{ref}}.

DPO (Direct Preference Optimization) shows that the RLHF objective has a closed-form optimal policy, allowing direct optimization on preference pairs without a separate reward model or RL loop.

GRPO (Group Relative Policy Optimization) generates multiple outputs per prompt, scores them (with a verifier or simple correctness check), and uses group-relative advantages for policy gradient updates.

Side-by-Side Objectives

Definition

RLHF Objective

The reward model rϕr_\phi is trained on preference pairs (yw,yl)(y_w, y_l) where ywy_w is preferred over yly_l:

LRM=E(x,yw,yl)[logσ(rϕ(x,yw)rϕ(x,yl))]\mathcal{L}_{\text{RM}} = -\mathbb{E}_{(x, y_w, y_l)}[\log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))]

The policy is then optimized via PPO:

maxπθEx,yπθ[rϕ(x,y)]βDKL(πθπref)\max_{\pi_\theta} \mathbb{E}_{x, y \sim \pi_\theta}[r_\phi(x, y)] - \beta \, D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})

Definition

DPO Objective

DPO reparameterizes the reward as r(x,y)=βlog(πθ(yx)/πref(yx))+βlogZ(x)r(x,y) = \beta \log(\pi_\theta(y|x) / \pi_{\text{ref}}(y|x)) + \beta \log Z(x) and optimizes directly:

LDPO=E(x,yw,yl) ⁣[logσ ⁣(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))]\mathcal{L}_{\text{DPO}} = -\mathbb{E}_{(x, y_w, y_l)}\!\left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]

No reward model. No RL loop. Standard cross-entropy-style training on preference pairs.

Definition

GRPO Objective

For each prompt xx, sample a group of GG responses {y1,,yG}\{y_1, \ldots, y_G\} from πθ\pi_\theta. Score each with a verifier or reward function r(x,yi)r(x, y_i). Compute group-relative advantages:

A^i=r(x,yi)mean({r(x,yj)})std({r(x,yj)})\hat{A}_i = \frac{r(x, y_i) - \text{mean}(\{r(x, y_j)\})}{\text{std}(\{r(x, y_j)\})}

Update πθ\pi_\theta with a clipped policy gradient using these advantages, plus a KL penalty to πref\pi_{\text{ref}}.

Where Each Is Stronger

RLHF wins on flexibility and established results

RLHF with PPO was used to train ChatGPT, Claude, and other early aligned models. The separate reward model can capture complex, nuanced preferences that are hard to express as binary comparisons. The reward model can also be reused across multiple policy training runs and analyzed independently.

DPO wins on simplicity and stability

DPO removes two complex components: the reward model and the RL optimizer. Training reduces to supervised learning on preference pairs. There are no reward model training instabilities, no PPO hyperparameter tuning (clip ratio, GAE lambda, number of epochs), and no need to maintain separate reward model and policy infrastructure. DPO is simpler to implement and debug.

GRPO wins on reasoning tasks with verifiable answers

For math, coding, and logical reasoning, correctness is objectively verifiable. GRPO exploits this: sample multiple answers, check which are correct, and use the group statistics as advantages. No human preference data is needed. No reward model is needed. The policy learns from its own successes and failures on verifiable tasks.

Where Each Fails

RLHF fails on computational cost and complexity

RLHF requires three models in memory during training: the policy πθ\pi_\theta, the reference policy πref\pi_{\text{ref}}, and the reward model rϕr_\phi. PPO itself requires computing advantages, managing rollout buffers, and tuning several hyperparameters. The training pipeline is brittle: reward model quality directly bounds alignment quality, and reward hacking (the policy exploiting reward model errors) is a persistent failure mode.

DPO fails on distribution shift

DPO trains on fixed preference pairs from a static dataset. As πθ\pi_\theta moves away from the distribution that generated the preference data, the implicit reward becomes unreliable. Online DPO (generating new completions and collecting preferences iteratively) mitigates this but reintroduces much of the complexity DPO was designed to avoid.

GRPO fails when correctness is not verifiable

GRPO requires a scoring function that can evaluate each response. For open-ended tasks (creative writing, summarization, nuanced advice), there is no simple verifier. The method is specialized for domains where answers are checkable: math proofs, code execution, factual questions with known answers.

Key Properties Compared

RLHFDPOGRPO
Reward modelExplicit, separately trainedImplicit (reparameterized)Not needed (verifier instead)
Training loopRL (PPO)Supervised (cross-entropy-like)Policy gradient with group advantages
DataPreference pairsPreference pairsPrompts + verifier
Models in memory3 (policy, ref, reward)2 (policy, ref)2 (policy, ref)
Best domainGeneral alignmentGeneral alignmentVerifiable reasoning
Distribution shiftOnline generation helpsOffline data is a limitationOnline by design

The Implicit Reward Connection

Theorem

DPO as Implicit Reward Maximization

Statement

The optimal policy under the RLHF objective maxπE[r(x,y)]βDKL(ππref)\max_\pi \mathbb{E}[r(x,y)] - \beta D_{\text{KL}}(\pi \| \pi_{\text{ref}}) satisfies:

π(yx)=1Z(x)πref(yx)exp ⁣(r(x,y)β)\pi^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\!\left(\frac{r(x,y)}{\beta}\right)

Inverting this gives the implicit reward: r(x,y)=βlog(π(yx)/πref(yx))+βlogZ(x)r(x,y) = \beta \log(\pi^*(y|x) / \pi_{\text{ref}}(y|x)) + \beta \log Z(x).

DPO substitutes this into the Bradley-Terry preference model and optimizes directly over the policy parameters, eliminating the need to learn rr separately.

Intuition

The reward model and the policy encode the same information: if you know one, you can recover the other (up to a constant). DPO exploits this redundancy by parameterizing the reward through the policy directly.

Failure Mode

The equivalence holds at the optimum. During training, the policy is not at the optimum, so the implicit reward is an approximation. If the policy drifts far from πref\pi_{\text{ref}}, the implicit reward can become poorly calibrated, leading to worse alignment than RLHF with a well-trained explicit reward model.

When a Researcher Would Use Each

Example

General-purpose LLM alignment with human preference data

Start with DPO for its simplicity. If alignment quality is insufficient or distribution shift is a concern, switch to online DPO or RLHF with PPO. The complexity of RLHF is justified only when simpler methods fall short.

Example

Training a math reasoning model

Use GRPO. Generate multiple solutions per problem, verify correctness (e.g., check final numerical answer), and update the policy using group-relative advantages. This was the approach used by DeepSeek-R1, which demonstrated strong math reasoning through GRPO-style training.

Example

Aligning a model for safety and helpfulness simultaneously

Use RLHF. Safety and helpfulness are complex, sometimes conflicting objectives that benefit from a nuanced reward model. A well-trained reward model can capture tradeoffs that binary preference pairs alone may not express. Constitutional AI methods build on RLHF for this reason.

Common Confusions

Watch Out

DPO is not reward-model-free in spirit

DPO still requires preference data, which encodes the same information a reward model would learn. The difference is computational, not conceptual: DPO avoids training and maintaining a separate reward model, but it still depends on human preference judgments during data collection.

Watch Out

GRPO is not limited to math

GRPO works for any task with a verifiable scoring function. Code generation (run tests), factual QA (check against ground truth), and constrained generation (verify format compliance) are all amenable. The limitation is the need for automated verification, not the specific domain.

Watch Out

More RL does not always mean better alignment

RLHF with PPO can overfit to the reward model, producing outputs that score highly on rϕr_\phi but are not genuinely preferred by humans. This reward hacking is a real failure mode. Simpler methods (DPO, GRPO) sometimes achieve better alignment because they are less prone to exploiting a learned reward signal.

References

Canonical:

Current: