What Each Does
All three methods fine-tune a pretrained language model to align with human preferences or task objectives. They differ in whether they use an explicit reward model and how they optimize.
RLHF (Reinforcement Learning from Human Feedback) trains a reward model on preference data, then optimizes using PPO to maximize while staying close to a reference policy .
DPO (Direct Preference Optimization) shows that the RLHF objective has a closed-form optimal policy, allowing direct optimization on preference pairs without a separate reward model or RL loop.
GRPO (Group Relative Policy Optimization) generates multiple outputs per prompt, scores them (with a verifier or simple correctness check), and uses group-relative advantages for policy gradient updates.
Side-by-Side Objectives
RLHF Objective
The reward model is trained on preference pairs where is preferred over :
The policy is then optimized via PPO:
DPO Objective
DPO reparameterizes the reward as and optimizes directly:
No reward model. No RL loop. Standard cross-entropy-style training on preference pairs.
GRPO Objective
For each prompt , sample a group of responses from . Score each with a verifier or reward function . Compute group-relative advantages:
Update with a clipped policy gradient using these advantages, plus a KL penalty to .
Where Each Is Stronger
RLHF wins on flexibility and established results
RLHF with PPO was used to train ChatGPT, Claude, and other early aligned models. The separate reward model can capture complex, nuanced preferences that are hard to express as binary comparisons. The reward model can also be reused across multiple policy training runs and analyzed independently.
DPO wins on simplicity and stability
DPO removes two complex components: the reward model and the RL optimizer. Training reduces to supervised learning on preference pairs. There are no reward model training instabilities, no PPO hyperparameter tuning (clip ratio, GAE lambda, number of epochs), and no need to maintain separate reward model and policy infrastructure. DPO is simpler to implement and debug.
GRPO wins on reasoning tasks with verifiable answers
For math, coding, and logical reasoning, correctness is objectively verifiable. GRPO exploits this: sample multiple answers, check which are correct, and use the group statistics as advantages. No human preference data is needed. No reward model is needed. The policy learns from its own successes and failures on verifiable tasks.
Where Each Fails
RLHF fails on computational cost and complexity
RLHF requires three models in memory during training: the policy , the reference policy , and the reward model . PPO itself requires computing advantages, managing rollout buffers, and tuning several hyperparameters. The training pipeline is brittle: reward model quality directly bounds alignment quality, and reward hacking (the policy exploiting reward model errors) is a persistent failure mode.
DPO fails on distribution shift
DPO trains on fixed preference pairs from a static dataset. As moves away from the distribution that generated the preference data, the implicit reward becomes unreliable. Online DPO (generating new completions and collecting preferences iteratively) mitigates this but reintroduces much of the complexity DPO was designed to avoid.
GRPO fails when correctness is not verifiable
GRPO requires a scoring function that can evaluate each response. For open-ended tasks (creative writing, summarization, nuanced advice), there is no simple verifier. The method is specialized for domains where answers are checkable: math proofs, code execution, factual questions with known answers.
Key Properties Compared
| RLHF | DPO | GRPO | |
|---|---|---|---|
| Reward model | Explicit, separately trained | Implicit (reparameterized) | Not needed (verifier instead) |
| Training loop | RL (PPO) | Supervised (cross-entropy-like) | Policy gradient with group advantages |
| Data | Preference pairs | Preference pairs | Prompts + verifier |
| Models in memory | 3 (policy, ref, reward) | 2 (policy, ref) | 2 (policy, ref) |
| Best domain | General alignment | General alignment | Verifiable reasoning |
| Distribution shift | Online generation helps | Offline data is a limitation | Online by design |
The Implicit Reward Connection
DPO as Implicit Reward Maximization
Statement
The optimal policy under the RLHF objective satisfies:
Inverting this gives the implicit reward: .
DPO substitutes this into the Bradley-Terry preference model and optimizes directly over the policy parameters, eliminating the need to learn separately.
Intuition
The reward model and the policy encode the same information: if you know one, you can recover the other (up to a constant). DPO exploits this redundancy by parameterizing the reward through the policy directly.
Failure Mode
The equivalence holds at the optimum. During training, the policy is not at the optimum, so the implicit reward is an approximation. If the policy drifts far from , the implicit reward can become poorly calibrated, leading to worse alignment than RLHF with a well-trained explicit reward model.
When a Researcher Would Use Each
General-purpose LLM alignment with human preference data
Start with DPO for its simplicity. If alignment quality is insufficient or distribution shift is a concern, switch to online DPO or RLHF with PPO. The complexity of RLHF is justified only when simpler methods fall short.
Training a math reasoning model
Use GRPO. Generate multiple solutions per problem, verify correctness (e.g., check final numerical answer), and update the policy using group-relative advantages. This was the approach used by DeepSeek-R1, which demonstrated strong math reasoning through GRPO-style training.
Aligning a model for safety and helpfulness simultaneously
Use RLHF. Safety and helpfulness are complex, sometimes conflicting objectives that benefit from a nuanced reward model. A well-trained reward model can capture tradeoffs that binary preference pairs alone may not express. Constitutional AI methods build on RLHF for this reason.
Common Confusions
DPO is not reward-model-free in spirit
DPO still requires preference data, which encodes the same information a reward model would learn. The difference is computational, not conceptual: DPO avoids training and maintaining a separate reward model, but it still depends on human preference judgments during data collection.
GRPO is not limited to math
GRPO works for any task with a verifiable scoring function. Code generation (run tests), factual QA (check against ground truth), and constrained generation (verify format compliance) are all amenable. The limitation is the need for automated verification, not the specific domain.
More RL does not always mean better alignment
RLHF with PPO can overfit to the reward model, producing outputs that score highly on but are not genuinely preferred by humans. This reward hacking is a real failure mode. Simpler methods (DPO, GRPO) sometimes achieve better alignment because they are less prone to exploiting a learned reward signal.
References
Canonical:
- Ouyang et al., Training Language Models to Follow Instructions with Human Feedback (NeurIPS 2022)
- Rafailov et al., Direct Preference Optimization: Your Language Model is Secretly a Reward Model (NeurIPS 2023)
Current:
- Shao et al., DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (2024)
- Guo et al., DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (2025)