RLHF vs DPO vs GRPO. Alignment Methods Compared

What Each Does

All three methods fine-tune a pretrained language model $\pi_\theta$ to align with human preferences or task objectives. They differ in whether they use an explicit reward model and how they optimize.

RLHF (Reinforcement Learning from Human Feedback) trains a reward model $r_\phi$ on preference data, then optimizes $\pi_\theta$ using PPO to maximize $r_\phi$ while staying close to a reference policy $\pi_{\text{ref}}$ .

DPO (Direct Preference Optimization) shows that the RLHF objective has a closed-form optimal policy, allowing direct optimization on preference pairs without a separate reward model or RL loop.

GRPO (Group Relative Policy Optimization) generates multiple outputs per prompt, scores them (with a verifier or simple correctness check), and uses group-relative advantages for policy gradient updates.

Side-by-Side Objectives

Definition

RLHF Objective

The reward model $r_\phi$ is trained on preference pairs $(y_w, y_l)$ where $y_w$ is preferred over $y_l$ :

$\mathcal{L}_{\text{RM}} = -\mathbb{E}_{(x, y_w, y_l)}[\log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))]$

The policy is then optimized via PPO:

$\max_{\pi_\theta} \mathbb{E}_{x, y \sim \pi_\theta}[r_\phi(x, y)] - \beta \, D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})$

Definition

DPO Objective

DPO reparameterizes the reward as $r(x,y) = \beta \log(\pi_\theta(y|x) / \pi_{\text{ref}}(y|x)) + \beta \log Z(x)$ and optimizes directly:

$\mathcal{L}_{\text{DPO}} = -\mathbb{E}_{(x, y_w, y_l)}\!\left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]$

No reward model. No RL loop. Standard cross-entropy-style training on preference pairs.

Definition

GRPO Objective

For each prompt $x$ , sample a group of $G$ responses $\{y_1, \ldots, y_G\}$ from $\pi_\theta$ . Score each with a verifier or reward function $r(x, y_i)$ . Compute group-relative advantages:

$\hat{A}_i = \frac{r(x, y_i) - \text{mean}(\{r(x, y_j)\})}{\text{std}(\{r(x, y_j)\})}$

Update $\pi_\theta$ with a clipped policy gradient using these advantages, plus a KL penalty to $\pi_{\text{ref}}$ .

Where Each Is Stronger

RLHF wins on flexibility and established results

RLHF with PPO was used to train ChatGPT, Claude, and other early aligned models. The separate reward model can capture complex, nuanced preferences that are hard to express as binary comparisons. The reward model can also be reused across multiple policy training runs and analyzed independently.

DPO wins on simplicity and stability

DPO removes two complex components: the reward model and the RL optimizer. Training reduces to supervised learning on preference pairs. There are no reward model training instabilities, no PPO hyperparameter tuning (clip ratio, GAE lambda, number of epochs), and no need to maintain separate reward model and policy infrastructure. DPO is simpler to implement and debug.

GRPO wins on reasoning tasks with verifiable answers

For math, coding, and logical reasoning, correctness is objectively verifiable. GRPO exploits this: sample multiple answers, check which are correct, and use the group statistics as advantages. No human preference data is needed. No reward model is needed. The policy learns from its own successes and failures on verifiable tasks.

Where Each Fails

RLHF fails on computational cost and complexity

RLHF requires three models in memory during training: the policy $\pi_\theta$ , the reference policy $\pi_{\text{ref}}$ , and the reward model $r_\phi$ . PPO itself requires computing advantages, managing rollout buffers, and tuning several hyperparameters. The training pipeline is brittle: reward model quality directly bounds alignment quality, and reward hacking (the policy exploiting reward model errors) is a persistent failure mode.

DPO fails on distribution shift

DPO trains on fixed preference pairs from a static dataset. As $\pi_\theta$ moves away from the distribution that generated the preference data, the implicit reward becomes unreliable. Online DPO (generating new completions and collecting preferences iteratively) mitigates this but reintroduces much of the complexity DPO was designed to avoid.

GRPO fails when correctness is not verifiable

GRPO requires a scoring function that can evaluate each response. For open-ended tasks (creative writing, summarization, nuanced advice), there is no simple verifier. The method is specialized for domains where answers are checkable: math proofs, code execution, factual questions with known answers.

Key Properties Compared

	RLHF	DPO	GRPO
Reward model	Explicit, separately trained	Implicit (reparameterized)	Not needed (verifier instead)
Training loop	RL (PPO)	Supervised (cross-entropy-like)	Policy gradient with group advantages
Data	Preference pairs	Preference pairs	Prompts + verifier
Models in memory	3 (policy, ref, reward)	2 (policy, ref)	2 (policy, ref)
Best domain	General alignment	General alignment	Verifiable reasoning
Distribution shift	Online generation helps	Offline data is a limitation	Online by design

The Implicit Reward Connection

Theorem

DPO as Implicit Reward Maximization

Statement

The optimal policy under the RLHF objective $\max_\pi \mathbb{E}[r(x,y)] - \beta D_{\text{KL}}(\pi \| \pi_{\text{ref}})$ satisfies:

$\pi^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\!\left(\frac{r(x,y)}{\beta}\right)$

Inverting this gives the implicit reward: $r(x,y) = \beta \log(\pi^*(y|x) / \pi_{\text{ref}}(y|x)) + \beta \log Z(x)$ .

DPO substitutes this into the Bradley-Terry preference model and optimizes directly over the policy parameters, eliminating the need to learn $r$ separately.

Intuition

The reward model and the policy encode the same information: if you know one, you can recover the other (up to a constant). DPO exploits this redundancy by parameterizing the reward through the policy directly.

Failure Mode

The equivalence holds at the optimum. During training, the policy is not at the optimum, so the implicit reward is an approximation. If the policy drifts far from $\pi_{\text{ref}}$ , the implicit reward can become poorly calibrated, leading to worse alignment than RLHF with a well-trained explicit reward model.

report a correction →

When a Researcher Would Use Each

Example

General-purpose LLM alignment with human preference data

Start with DPO for its simplicity. If alignment quality is insufficient or distribution shift is a concern, switch to online DPO or RLHF with PPO. The complexity of RLHF is justified only when simpler methods fall short.

Example

Training a math reasoning model

Use GRPO. Generate multiple solutions per problem, verify correctness (e.g., check final numerical answer), and update the policy using group-relative advantages. This was the approach used by DeepSeek-R1, which demonstrated strong math reasoning through GRPO-style training.

Example

Aligning a model for safety and helpfulness simultaneously

Use RLHF. Safety and helpfulness are complex, sometimes conflicting objectives that benefit from a nuanced reward model. A well-trained reward model can capture tradeoffs that binary preference pairs alone may not express. Constitutional AI methods build on RLHF for this reason.

Common Confusions

Watch Out

DPO is not reward-model-free in spirit

DPO still requires preference data, which encodes the same information a reward model would learn. The difference is computational, not conceptual: DPO avoids training and maintaining a separate reward model, but it still depends on human preference judgments during data collection.

Watch Out

GRPO is not limited to math

GRPO works for any task with a verifiable scoring function. Code generation (run tests), factual QA (check against ground truth), and constrained generation (verify format compliance) are all amenable. The limitation is the need for automated verification, not the specific domain.

Watch Out

More RL does not always mean better alignment

RLHF with PPO can overfit to the reward model, producing outputs that score highly on $r_\phi$ but are not genuinely preferred by humans. This reward hacking is a real failure mode. Simpler methods (DPO, GRPO) sometimes achieve better alignment because they are less prone to exploiting a learned reward signal.

References

Canonical:

Ouyang et al., Training Language Models to Follow Instructions with Human Feedback (NeurIPS 2022)
Rafailov et al., Direct Preference Optimization: Your Language Model is Secretly a Reward Model (NeurIPS 2023)

Current:

Shao et al., DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (2024)
Guo et al., DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (2025)