LLM Construction
DPO vs GRPO vs RL for Reasoning
Head-to-head comparison of preference optimization methods: DPO's implicit reward, GRPO's group-relative comparisons, and RL with verifier feedback for reasoning. When each works and when each fails.
Prerequisites
Why This Matters
After supervised fine-tuning, the next stage of post-training optimizes the model on preference data: which outputs are better than others. Three methods dominate in 2026, each with distinct tradeoffs:
- DPO (Direct Preference Optimization). no separate reward model, stable training, widely adopted
- GRPO (Group Relative Policy Optimization). group-level comparisons, used by DeepSeek for reasoning
- RL with verifiers: PPO/REINFORCE with ground-truth feedback from code execution or math provers
Choosing the wrong method wastes compute and produces worse models. Understanding the mathematical differences. Not just the acronyms. is essential for making informed decisions about post-training pipelines.
Mental Model
Three ways to teach a model which outputs are "better":
- DPO: Show the model pairs of outputs (one preferred, one not) and directly adjust probabilities. No intermediary. Simple but limited to pairwise comparisons.
- GRPO: Show the model a group of outputs, rank them by reward, and adjust probabilities relative to the group average. More signal per batch than pairwise comparisons.
- RL with verifiers: Let the model generate solutions, check them with an objective verifier (run the code, check the math), and use the binary correctness signal as reward in a standard RL loop.
DPO: Direct Preference Optimization
DPO Implicit Reward Equivalence
Statement
DPO reparameterizes the reward model through the policy itself. The implicit reward under DPO is:
The DPO loss directly optimizes on preference pairs :
At convergence, this recovers the same optimal policy as RLHF with a Bradley-Terry reward model and KL penalty .
Intuition
DPO says: instead of training a reward model and then doing RL, observe that the optimal policy defines a reward function. So skip the reward model and optimize the policy directly on preference data. The log-ratio plays the role of the reward: increase the log-ratio for preferred outputs, decrease it for dispreferred ones.
Why It Matters
DPO eliminates the reward model training, the PPO loop, and the associated hyperparameter tuning. This makes it simpler to implement, more stable to train, and cheaper in compute. It became the default preference optimization method for many research groups in 2023-2024.
Failure Mode
DPO is an offline algorithm: it optimizes on a fixed dataset of preference pairs. It does not generate new outputs during training. This means it cannot explore. It only learns from the preferences in the training data. If the training data does not cover a failure mode, DPO cannot fix it. Additionally, the implicit reward can still be hacked: the model learns to increase for preferred outputs, which can be achieved by making the reference policy assign low probability (rather than by genuinely improving quality).
DPO strengths:
- Simple pipeline: one-stage supervised loss
- Stable training: no RL instabilities
- Well-understood theoretically (equivalence to KL-regularized RLHF)
DPO weaknesses:
- Offline: no exploration, limited by training data distribution
- Pairwise only: uses one comparison per training example
- Can degrade with noisy preferences or low-quality pairs
GRPO: Group Relative Policy Optimization
GRPO Objective
Statement
For a prompt , GRPO generates a group of outputs from the current policy . Each output receives a reward . The group-relative advantage is:
The GRPO objective is:
where is the importance ratio (as in PPO). The key innovation: advantages are computed relative to the group, not against an external baseline or value function.
Intuition
GRPO eliminates the need for a learned value function (critic). Instead of asking "how good is this output in absolute terms?" it asks "how good is this output compared to the other outputs the model just generated?" This group-relative comparison provides a natural baseline that adapts to the difficulty of each prompt.
For a hard prompt where all outputs are bad, the best-of-bad gets a positive advantage. For an easy prompt where all outputs are good, the worst-of-good gets a negative advantage. This automatic calibration stabilizes training.
Why It Matters
GRPO was introduced by DeepSeek and used to train DeepSeek-R1 for mathematical and code reasoning. By eliminating the critic network, GRPO reduces memory requirements and removes a source of approximation error. The group-relative advantage provides richer signal than DPO's pairwise comparisons: from outputs you get advantage estimates, not just one pairwise comparison.
Failure Mode
GRPO requires generating outputs per prompt during training, which increases compute cost. If is too small, the advantage estimates are noisy. If the reward model is unreliable, the group-relative advantage amplifies reward model errors. The model learns to distinguish between "slightly more hackable" and "slightly less hackable" outputs rather than between genuinely good and bad ones.
GRPO strengths:
- Online: generates fresh outputs during training, enabling exploration
- No critic network: reduces memory and eliminates value function errors
- Group-relative advantages: automatic baseline calibration per prompt
- Richer signal: comparisons per prompt vs DPO's single pair
GRPO weaknesses:
- Higher compute cost per step (generating samples)
- Requires a reward model or verifier for scoring
- Advantage normalization can reduce gradient signal when group variance is low
RL with Verifier Feedback
RL with Verifier Feedback
Statement
For reasoning tasks with a binary verifier (e.g., code passes all tests, math answer is correct), the RL objective is:
The gradient (via REINFORCE with baseline ) is:
where is typically the running average pass rate for prompt .
Intuition
This is the purest form of RL for reasoning: generate a solution, check if it is correct, reinforce correct solutions and suppress incorrect ones. The verifier provides ground truth rather than a learned proxy, eliminating reward model errors. The binary signal is sparse but honest.
Why It Matters
Verifier-guided RL is how models learn to reason about math and code. Unlike preference-based methods (DPO, RLHF), the reward signal comes from objective verification, not human judgment. This eliminates reward hacking. You cannot "hack" a unit test suite or a mathematical proof checker. The limitation is that verifiers only exist for domains with checkable answers.
Failure Mode
Binary rewards are sparse: most solutions to hard problems are wrong, giving reward 0 for most samples. This makes learning slow and high-variance. Process reward models (PRMs) address this by providing intermediate rewards for correct reasoning steps, but PRMs reintroduce the proxy reward problem. The fundamental tension: dense rewards from learned models are hackable; sparse rewards from verifiers are honest but noisy.
RL with verifiers strengths:
- Ground-truth signal: no reward model to hack
- Effective for domains with checkable answers (math, code, formal logic)
- Can improve beyond human-level (verifier can check solutions humans cannot)
RL with verifiers weaknesses:
- Sparse binary reward: slow learning, high variance
- Only works for verifiable domains
- Requires infrastructure for code execution, proof checking, etc.
Early claims that DPO "solves" RLHF were premature. When DPO was published in 2023, it was presented as a simpler alternative that achieves the same results without RL instabilities. In practice, DPO's limitations became clear by 2024-2025. Its offline nature means it cannot improve beyond the quality of its training data. Reward hacking still occurs through the implicit reward. And for reasoning tasks where verifier-guided RL excels, DPO consistently underperforms. DPO remains useful for general preference alignment, but the claim that it makes RL unnecessary was wrong. The most capable models in 2026 all use RL at some stage.
"DPO is RLHF without RL." This is wrong and reveals a shallow understanding. DPO is RL. It optimizes the exact same KL-regularized objective as RLHF. The implicit reward is a reward function parameterized by the policy. DPO merely reparameterizes the optimization to avoid explicit reward model training and PPO. The RL objective is still there; it is absorbed into the supervised loss. Saying "DPO eliminates RL" is like saying "substituting variables eliminates the equation". the mathematical content is identical, only the computational procedure changed.
When to Use Each Method
| Criterion | DPO | GRPO | RL + Verifier |
|---|---|---|---|
| Training data | Offline preference pairs | Online generation | Online generation |
| Reward signal | Implicit (from preferences) | Explicit (RM or verifier) | Explicit (verifier) |
| Exploration | None | Yes (generates new outputs) | Yes (generates new outputs) |
| Compute per step | Low | Medium ( samples) | Medium-High |
| Best domain | General preference alignment | Reasoning with RM | Math, code, formal verification |
| Reward hacking risk | Medium (implicit reward) | Medium (RM-dependent) | Low (ground-truth verifier) |
Common Confusions
DPO and RLHF target the same optimal policy
Under the Bradley-Terry model with KL regularization, the theoretical optimum is identical for DPO and RLHF. The differences are purely algorithmic: convergence speed, sensitivity to hyperparameters, exploration capability. In practice, these algorithmic differences matter enormously. RLHF with PPO can explore beyond the training data while DPO cannot.
GRPO is not just PPO without a critic
GRPO shares PPO's clipped objective but differs in how advantages are computed. PPO uses a learned value function as baseline; GRPO uses the group mean reward. This changes the gradient dynamics: GRPO's advantages are always centered at zero within each group, while PPO's advantages depend on the accuracy of the value function. When the value function is inaccurate (common for long reasoning tasks), GRPO's group-relative approach can be more stable.
Verifier-guided RL is not limited to final-answer checking
While the simplest setup uses a binary outcome reward, process reward models provide step-level feedback within the reasoning chain. The choice between outcome and process rewards is a bias-variance tradeoff: outcome rewards are unbiased but sparse; process rewards are dense but potentially biased (the PRM is a learned proxy).
Summary
- DPO: implicit reward via policy log-ratio, offline, no RM needed, no exploration
- GRPO: group-relative advantages, online, no critic, used for DeepSeek-R1
- RL + verifier: ground-truth reward from code execution or provers, sparse but honest
- DPO is RL. It optimizes the same objective as RLHF, just reparameterized
- Offline vs online is the biggest practical difference: DPO cannot explore
- For reasoning tasks with verifiers, RL consistently outperforms DPO
- For general preference alignment without verifiers, DPO is simpler and competitive
- GRPO occupies the middle ground: online exploration with simpler infrastructure than PPO
Exercises
Problem
Show that the DPO gradient for a single preference pair increases and decreases . What determines the magnitude of the update?
Problem
GRPO computes advantages as where and are the group mean and standard deviation. Show that when all outputs receive the same reward (e.g., all correct or all incorrect), the GRPO gradient is zero. Explain why this is both a feature and a limitation.
Problem
DPO is offline (fixed dataset), while GRPO and RL with verifiers are online (generate new outputs during training). Formalize the advantage of online methods: consider a policy that has learned to avoid one failure mode but now exhibits a new one not present in the original preference data. Why can GRPO/RL address this but DPO cannot?
Related Comparisons
References
Canonical:
- Rafailov et al., "Direct Preference Optimization" (2023). DPO
- Shao et al., "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (2024). GRPO
Current:
- DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2025). GRPO at scale
- Xu et al., "Some Things Are More CRINGE Than Others" (2024). DPO failure modes
- Ahmadian et al., "Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback" (2024)
Next Topics
The natural next steps from preference optimization:
- Reward models and verifiers: the signals that drive these methods
- Post-training overview: how these methods fit into the full pipeline
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- RLHF and AlignmentLayer 4
- Policy Gradient TheoremLayer 3
- Markov Decision ProcessesLayer 2
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A