Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

DPO vs GRPO vs RL for Reasoning

Head-to-head comparison of preference optimization methods: DPO's implicit reward, GRPO's group-relative comparisons, and RL with verifier feedback for reasoning. When each works and when each fails.

AdvancedTier 2Frontier~60 min

Why This Matters

After supervised fine-tuning, the next stage of post-training optimizes the model on preference data: which outputs are better than others. Three methods dominate in 2026, each with distinct tradeoffs:

  • DPO (Direct Preference Optimization). no separate reward model, stable training, widely adopted
  • GRPO (Group Relative Policy Optimization). group-level comparisons, used by DeepSeek for reasoning
  • RL with verifiers: PPO/REINFORCE with ground-truth feedback from code execution or math provers

Choosing the wrong method wastes compute and produces worse models. Understanding the mathematical differences. Not just the acronyms. is essential for making informed decisions about post-training pipelines.

Mental Model

Three ways to teach a model which outputs are "better":

  1. DPO: Show the model pairs of outputs (one preferred, one not) and directly adjust probabilities. No intermediary. Simple but limited to pairwise comparisons.
  2. GRPO: Show the model a group of outputs, rank them by reward, and adjust probabilities relative to the group average. More signal per batch than pairwise comparisons.
  3. RL with verifiers: Let the model generate solutions, check them with an objective verifier (run the code, check the math), and use the binary correctness signal as reward in a standard RL loop.

DPO: Direct Preference Optimization

Theorem

DPO Implicit Reward Equivalence

Statement

DPO reparameterizes the reward model through the policy itself. The implicit reward under DPO is:

rDPO(x,y)=βlogπθ(yx)πref(yx)+βlogZ(x)r_{\text{DPO}}(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)

The DPO loss directly optimizes on preference pairs (yw,yl)(y_w, y_l):

LDPO(θ)=E[logσ(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))]\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}\left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]

At convergence, this recovers the same optimal policy as RLHF with a Bradley-Terry reward model and KL penalty β\beta.

Intuition

DPO says: instead of training a reward model and then doing RL, observe that the optimal policy defines a reward function. So skip the reward model and optimize the policy directly on preference data. The log-ratio log(πθ/πref)\log(\pi_\theta / \pi_{\text{ref}}) plays the role of the reward: increase the log-ratio for preferred outputs, decrease it for dispreferred ones.

Why It Matters

DPO eliminates the reward model training, the PPO loop, and the associated hyperparameter tuning. This makes it simpler to implement, more stable to train, and cheaper in compute. It became the default preference optimization method for many research groups in 2023-2024.

Failure Mode

DPO is an offline algorithm: it optimizes on a fixed dataset of preference pairs. It does not generate new outputs during training. This means it cannot explore. It only learns from the preferences in the training data. If the training data does not cover a failure mode, DPO cannot fix it. Additionally, the implicit reward can still be hacked: the model learns to increase log(πθ/πref)\log(\pi_\theta / \pi_{\text{ref}}) for preferred outputs, which can be achieved by making the reference policy assign low probability (rather than by genuinely improving quality).

DPO strengths:

  • Simple pipeline: one-stage supervised loss
  • Stable training: no RL instabilities
  • Well-understood theoretically (equivalence to KL-regularized RLHF)

DPO weaknesses:

  • Offline: no exploration, limited by training data distribution
  • Pairwise only: uses one comparison per training example
  • Can degrade with noisy preferences or low-quality pairs

GRPO: Group Relative Policy Optimization

Proposition

GRPO Objective

Statement

For a prompt xx, GRPO generates a group of KK outputs {y1,,yK}\{y_1, \ldots, y_K\} from the current policy πθ\pi_\theta. Each output receives a reward rir_i. The group-relative advantage is:

A^i=rimean({rj})std({rj})\hat{A}_i = \frac{r_i - \text{mean}(\{r_j\})}{\text{std}(\{r_j\})}

The GRPO objective is:

LGRPO(θ)=Ex[1Ki=1Kmin(ρiA^i,clip(ρi,1ϵ,1+ϵ)A^i)]+βKL(πθπref)\mathcal{L}_{\text{GRPO}}(\theta) = -\mathbb{E}_x \left[\frac{1}{K} \sum_{i=1}^{K} \min\left(\rho_i \hat{A}_i, \, \text{clip}(\rho_i, 1-\epsilon, 1+\epsilon) \hat{A}_i\right)\right] + \beta \, \text{KL}(\pi_\theta \| \pi_{\text{ref}})

where ρi=πθ(yix)/πold(yix)\rho_i = \pi_\theta(y_i|x) / \pi_{\text{old}}(y_i|x) is the importance ratio (as in PPO). The key innovation: advantages are computed relative to the group, not against an external baseline or value function.

Intuition

GRPO eliminates the need for a learned value function (critic). Instead of asking "how good is this output in absolute terms?" it asks "how good is this output compared to the other outputs the model just generated?" This group-relative comparison provides a natural baseline that adapts to the difficulty of each prompt.

For a hard prompt where all outputs are bad, the best-of-bad gets a positive advantage. For an easy prompt where all outputs are good, the worst-of-good gets a negative advantage. This automatic calibration stabilizes training.

Why It Matters

GRPO was introduced by DeepSeek and used to train DeepSeek-R1 for mathematical and code reasoning. By eliminating the critic network, GRPO reduces memory requirements and removes a source of approximation error. The group-relative advantage provides richer signal than DPO's pairwise comparisons: from KK outputs you get KK advantage estimates, not just one pairwise comparison.

Failure Mode

GRPO requires generating KK outputs per prompt during training, which increases compute cost. If KK is too small, the advantage estimates are noisy. If the reward model is unreliable, the group-relative advantage amplifies reward model errors. The model learns to distinguish between "slightly more hackable" and "slightly less hackable" outputs rather than between genuinely good and bad ones.

GRPO strengths:

  • Online: generates fresh outputs during training, enabling exploration
  • No critic network: reduces memory and eliminates value function errors
  • Group-relative advantages: automatic baseline calibration per prompt
  • Richer signal: KK comparisons per prompt vs DPO's single pair

GRPO weaknesses:

  • Higher compute cost per step (generating KK samples)
  • Requires a reward model or verifier for scoring
  • Advantage normalization can reduce gradient signal when group variance is low

RL with Verifier Feedback

Proposition

RL with Verifier Feedback

Statement

For reasoning tasks with a binary verifier V(x,y){0,1}V(x, y) \in \{0, 1\} (e.g., code passes all tests, math answer is correct), the RL objective is:

maxπθExD,yπθ(x)[V(x,y)]βKL(πθπref)\max_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D}, \, y \sim \pi_\theta(\cdot|x)}\left[V(x, y)\right] - \beta \, \text{KL}(\pi_\theta \| \pi_{\text{ref}})

The gradient (via REINFORCE with baseline b(x)b(x)) is:

θJ=E[(V(x,y)b(x))θlogπθ(yx)]\nabla_\theta J = \mathbb{E}\left[(V(x,y) - b(x)) \nabla_\theta \log \pi_\theta(y|x)\right]

where b(x)b(x) is typically the running average pass rate for prompt xx.

Intuition

This is the purest form of RL for reasoning: generate a solution, check if it is correct, reinforce correct solutions and suppress incorrect ones. The verifier provides ground truth rather than a learned proxy, eliminating reward model errors. The binary signal is sparse but honest.

Why It Matters

Verifier-guided RL is how models learn to reason about math and code. Unlike preference-based methods (DPO, RLHF), the reward signal comes from objective verification, not human judgment. This eliminates reward hacking. You cannot "hack" a unit test suite or a mathematical proof checker. The limitation is that verifiers only exist for domains with checkable answers.

Failure Mode

Binary rewards are sparse: most solutions to hard problems are wrong, giving reward 0 for most samples. This makes learning slow and high-variance. Process reward models (PRMs) address this by providing intermediate rewards for correct reasoning steps, but PRMs reintroduce the proxy reward problem. The fundamental tension: dense rewards from learned models are hackable; sparse rewards from verifiers are honest but noisy.

RL with verifiers strengths:

  • Ground-truth signal: no reward model to hack
  • Effective for domains with checkable answers (math, code, formal logic)
  • Can improve beyond human-level (verifier can check solutions humans cannot)

RL with verifiers weaknesses:

  • Sparse binary reward: slow learning, high variance
  • Only works for verifiable domains
  • Requires infrastructure for code execution, proof checking, etc.
What Aged Badly

Early claims that DPO "solves" RLHF were premature. When DPO was published in 2023, it was presented as a simpler alternative that achieves the same results without RL instabilities. In practice, DPO's limitations became clear by 2024-2025. Its offline nature means it cannot improve beyond the quality of its training data. Reward hacking still occurs through the implicit reward. And for reasoning tasks where verifier-guided RL excels, DPO consistently underperforms. DPO remains useful for general preference alignment, but the claim that it makes RL unnecessary was wrong. The most capable models in 2026 all use RL at some stage.

Common Fake Understanding

"DPO is RLHF without RL." This is wrong and reveals a shallow understanding. DPO is RL. It optimizes the exact same KL-regularized objective as RLHF. The implicit reward r(x,y)=βlog(πθ(yx)/πref(yx))r(x,y) = \beta \log(\pi_\theta(y|x) / \pi_{\text{ref}}(y|x)) is a reward function parameterized by the policy. DPO merely reparameterizes the optimization to avoid explicit reward model training and PPO. The RL objective is still there; it is absorbed into the supervised loss. Saying "DPO eliminates RL" is like saying "substituting variables eliminates the equation". the mathematical content is identical, only the computational procedure changed.

When to Use Each Method

CriterionDPOGRPORL + Verifier
Training dataOffline preference pairsOnline generationOnline generation
Reward signalImplicit (from preferences)Explicit (RM or verifier)Explicit (verifier)
ExplorationNoneYes (generates new outputs)Yes (generates new outputs)
Compute per stepLowMedium (KK samples)Medium-High
Best domainGeneral preference alignmentReasoning with RMMath, code, formal verification
Reward hacking riskMedium (implicit reward)Medium (RM-dependent)Low (ground-truth verifier)

Common Confusions

Watch Out

DPO and RLHF target the same optimal policy

Under the Bradley-Terry model with KL regularization, the theoretical optimum is identical for DPO and RLHF. The differences are purely algorithmic: convergence speed, sensitivity to hyperparameters, exploration capability. In practice, these algorithmic differences matter enormously. RLHF with PPO can explore beyond the training data while DPO cannot.

Watch Out

GRPO is not just PPO without a critic

GRPO shares PPO's clipped objective but differs in how advantages are computed. PPO uses a learned value function as baseline; GRPO uses the group mean reward. This changes the gradient dynamics: GRPO's advantages are always centered at zero within each group, while PPO's advantages depend on the accuracy of the value function. When the value function is inaccurate (common for long reasoning tasks), GRPO's group-relative approach can be more stable.

Watch Out

Verifier-guided RL is not limited to final-answer checking

While the simplest setup uses a binary outcome reward, process reward models provide step-level feedback within the reasoning chain. The choice between outcome and process rewards is a bias-variance tradeoff: outcome rewards are unbiased but sparse; process rewards are dense but potentially biased (the PRM is a learned proxy).

Summary

  • DPO: implicit reward via policy log-ratio, offline, no RM needed, no exploration
  • GRPO: group-relative advantages, online, no critic, used for DeepSeek-R1
  • RL + verifier: ground-truth reward from code execution or provers, sparse but honest
  • DPO is RL. It optimizes the same objective as RLHF, just reparameterized
  • Offline vs online is the biggest practical difference: DPO cannot explore
  • For reasoning tasks with verifiers, RL consistently outperforms DPO
  • For general preference alignment without verifiers, DPO is simpler and competitive
  • GRPO occupies the middle ground: online exploration with simpler infrastructure than PPO

Exercises

ExerciseCore

Problem

Show that the DPO gradient for a single preference pair (x,yw,yl)(x, y_w, y_l) increases logπθ(ywx)\log \pi_\theta(y_w|x) and decreases logπθ(ylx)\log \pi_\theta(y_l|x). What determines the magnitude of the update?

ExerciseAdvanced

Problem

GRPO computes advantages as A^i=(rirˉ)/s\hat{A}_i = (r_i - \bar{r}) / s where rˉ\bar{r} and ss are the group mean and standard deviation. Show that when all KK outputs receive the same reward (e.g., all correct or all incorrect), the GRPO gradient is zero. Explain why this is both a feature and a limitation.

ExerciseResearch

Problem

DPO is offline (fixed dataset), while GRPO and RL with verifiers are online (generate new outputs during training). Formalize the advantage of online methods: consider a policy that has learned to avoid one failure mode but now exhibits a new one not present in the original preference data. Why can GRPO/RL address this but DPO cannot?

Related Comparisons

References

Canonical:

  • Rafailov et al., "Direct Preference Optimization" (2023). DPO
  • Shao et al., "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (2024). GRPO

Current:

  • DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2025). GRPO at scale
  • Xu et al., "Some Things Are More CRINGE Than Others" (2024). DPO failure modes
  • Ahmadian et al., "Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback" (2024)

Next Topics

The natural next steps from preference optimization:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics