Paper breakdown

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov et al. · 2023 · NeurIPS 2023

Closes the form of the KL-regularised RLHF policy in terms of a reward function, then inverts the relation: the policy is the reward model, up to a partition function that cancels in pairwise comparisons. Reduces preference fine-tuning to a single supervised log-likelihood objective with no reward model, no rollouts, and no PPO.

arXiv:2305.18290

Overview

Rafailov et al. (2023) showed that the standard RLHF pipeline — reward modelling followed by KL-regularised PPO — has a closed-form solution that lets you skip the RL step entirely. The KL-regularised reward-maximising policy admits an analytic form in terms of the reward and the reference policy. Inverting that form expresses the reward as a log-ratio between the optimal policy and the reference. Substituting this expression into the Bradley-Terry preference likelihood gives an objective that depends only on policy probabilities — no separate reward model, no rollouts, no PPO.

The result is that preference fine-tuning becomes a supervised problem on pairwise comparison data $(x, y_w, y_l)$ . The loss is a single binary cross-entropy on the policy log-ratios, and training reduces to standard cross-entropy fine-tuning with a slightly more complicated per-example computation. The trained model itself plays the role of the reward, which is the meaning of "your language model is secretly a reward model".

Empirically, DPO matches or beats PPO-based RLHF on the 2023-era benchmarks the paper considers (sentiment-controlled generation, summarisation, single-turn dialogue) at a fraction of the engineering cost. The recipe is now standard in the open-weight ecosystem (Llama-3-Instruct, Zephyr, Mistral-Instruct, Qwen-2.5-Instruct, DeepSeek-Chat) where the cost and instability of running PPO at scale was a real obstacle. PPO and reward-model-based RL have not gone away — they are still needed for online learning, multi-turn objectives, and verifiable-reward RL — but for one-shot preference fine-tuning DPO is the default.

Mathematical Contributions

The KL-regularised reward objective

InstructGPT-style RLHF maximises

$\max_\pi \;\; \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi(\cdot \mid x)}\!\big[ r(x, y) \big] - \beta\, D_{\mathrm{KL}}\!\big( \pi(\cdot \mid x) \,\|\, \pi_{\mathrm{ref}}(\cdot \mid x) \big)$

over policies $\pi$ , with a fixed reward $r$ and a reference policy $\pi_{\mathrm{ref}}$ (the SFT model). The KL anchor is what stops the policy from drifting onto adversarial tokens that score well on $r$ but are off the reward model's training distribution.

The closed-form optimal policy

The optimisation has a known analytic solution (the standard derivation from the calculus of variations, also recoverable from the soft-Bellman equation):

$\pi^*(y \mid x) = \frac{1}{Z(x)}\, \pi_{\mathrm{ref}}(y \mid x)\, \exp\!\Big(\frac{1}{\beta}\, r(x, y)\Big), \qquad Z(x) = \sum_{y'} \pi_{\mathrm{ref}}(y' \mid x)\, \exp\!\Big(\frac{1}{\beta}\, r(x, y')\Big).$

This is the Boltzmann tilt of the reference policy by the exponentiated reward. The intractable piece is $Z(x)$ — a sum over all completions — which is why the original RLHF pipeline needs an iterative method (PPO) to optimise the reward expectation rather than computing $\pi^*$ in closed form.

Inverting the relation

Take logs of the closed-form solution and rearrange:

$r(x, y) = \beta\, \log \frac{\pi^*(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)} + \beta\, \log Z(x).$

The reward is recovered, up to the prompt-dependent log-partition $\beta \log Z(x)$ , from the log-ratio between the optimal policy and the reference. This is the paper's "reward-policy duality": any KL-regularised optimal policy implicitly defines a reward, and any reward together with $\pi_{\mathrm{ref}}$ implicitly defines an optimal policy.

The Bradley-Terry trick

The Bradley-Terry preference likelihood for a comparison $(x, y_w, y_l)$ depends only on the difference in rewards:

$\Pr(y_w \succ y_l \mid x) = \sigma\!\big(r(x, y_w) - r(x, y_l)\big).$

When you substitute the inverted relation, the prompt-dependent $\beta \log Z(x)$ terms cancel:

$r(x, y_w) - r(x, y_l) = \beta\, \log \frac{\pi^*(y_w \mid x)}{\pi_{\mathrm{ref}}(y_w \mid x)} - \beta\, \log \frac{\pi^*(y_l \mid x)}{\pi_{\mathrm{ref}}(y_l \mid x)}.$

The intractable partition function is gone, because it does not depend on the completion.

The DPO loss

Plug this into the Bradley-Terry log-likelihood and parameterise $\pi^*$ as $\pi_\theta$ (the policy you are training). The DPO objective is (Equation 7 in the paper):

$\mathcal{L}_{\mathrm{DPO}}(\theta) = -\, \mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\!\left[ \log \sigma\!\left( \beta\, \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\mathrm{ref}}(y_w \mid x)} - \beta\, \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\mathrm{ref}}(y_l \mid x)} \right) \right].$

This is a pure supervised loss. Each example needs four log-probabilities: $\pi_\theta(y_w)$ , $\pi_\theta(y_l)$ , $\pi_{\mathrm{ref}}(y_w)$ , $\pi_{\mathrm{ref}}(y_l)$ . The reference log-probs are computed once and cached; the policy log-probs are differentiated through. There are no rollouts, no value function, and no separate reward network.

Gradient

The gradient (Equation 8) is

$\nabla_\theta \mathcal{L}_{\mathrm{DPO}} = -\beta\, \mathbb{E}_{(x, y_w, y_l)}\!\Big[ \sigma\!\big(\hat r(x, y_l) - \hat r(x, y_w)\big) \big( \nabla_\theta \log \pi_\theta(y_w \mid x) - \nabla_\theta \log \pi_\theta(y_l \mid x) \big) \Big]$

where $\hat r(x, y) = \beta \log (\pi_\theta(y \mid x) / \pi_{\mathrm{ref}}(y \mid x))$ is the implicit reward. The sigmoid factor is a confidence-weighting: examples on which the implicit reward already prefers $y_w$ contribute small gradients, examples on which it does not contribute large gradients. This is the same self-paced behaviour you get in margin-based losses.

DPO vs PPO at a glance

Property	PPO-RLHF	DPO
Reward model	Separate network, trained first	Implicit; the policy itself
Rollouts at training time	Yes (on-policy samples)	No (offline pairwise data)
Optimiser	Clipped policy-gradient	Standard SGD/AdamW on a log-likelihood
KL anchor	Per-token penalty	Built into the closed-form loss
Off-policy data	Hard	Native
Online preference collection	Native	Requires a regenerate-and-relabel loop
Engineering complexity	High (value head, generation, GAE, PPO clipping)	Low (one extra forward pass on $\pi_{\mathrm{ref}}$ )

What $\beta$ controls

$\beta$ in DPO plays the role of the KL coefficient in PPO: large $\beta$ keeps the policy near $\pi_{\mathrm{ref}}$ , small $\beta$ lets it move further. Empirically, $\beta \in [0.1, 0.5]$ is the typical range used by Llama-3, Mistral, and other open-weight post-training recipes. If $\beta$ is too small, the loss can drive log-probabilities under $\pi_{\mathrm{ref}}$ to extreme values and the policy degenerates; if too large, the policy barely moves.

What It Gets Right

The first thing is that the derivation is honest and short. Two pages of paper get you from the KL-regularised RLHF objective to a supervised loss with no approximations along the way. The match between DPO and PPO at convergence on the same preference data is not coincidental — under the Bradley-Terry assumption and a sufficiently expressive policy class, both are optimising the same population objective, and DPO simply removes the variance introduced by sampling.

The second is the deployability story. PPO at GPT-3 scale needs a value network, a reference network, a frozen reward network, and on-policy generation; DPO needs only the policy and the frozen reference. The memory and engineering footprint dropped by enough that small-budget research groups and open-weight model labs could run preference fine-tuning end-to-end without building an RL stack.

The third is the empirical match on the benchmarks the paper considers. On IMDb sentiment-controlled generation, the TL;DR summarisation preference task, and Anthropic's helpful-harmless dialogue, DPO is at least as good as PPO with a tuned reward model. The paper does not overclaim — it is up-front that this is the regime in which they have evidence, and that the more demanding online and multi-turn settings are not addressed.

Common Misconceptions

DPO is not strictly equivalent to PPO at finite samples. Both are estimators of the same KL-regularised reward maximiser under Bradley-Terry preferences, but DPO's policy class is constrained to whatever the model can represent, and the implicit reward is just the log-ratio. When the data is sparse or the policy class is restricted, the two procedures find different local solutions. Subsequent work (Azar et al. 2023, IPO; Park et al. 2024, length-controlled DPO; Hong et al. 2024, ORPO) catalogues the gaps.

DPO does not learn a calibrated reward model. The implicit reward $\beta \log(\pi_\theta / \pi_{\mathrm{ref}})$ ranks completions correctly for the prompts in the training distribution, but it is not a usable reward function on new prompts in the way an explicit reward model is. If you need a portable reward — for filtering, for ranking distillation outputs, or for online RL — you need to fit one separately.

DPO does not solve preference data quality problems. Garbage-in-garbage-out applies as much to DPO as to PPO. The Bradley-Terry assumption (that pairwise preferences come from independent comparisons of latent rewards) breaks under noisy labelers, length bias, or systematic disagreement, and DPO inherits whatever artifacts the reward model would have inherited. The cleaner pipeline does not buy cleaner data.

Connections to TheoremPath Topics

RLHF and alignment — the broader pipeline DPO replaces in stages 2 and 3.
RLHF deep dive — InstructGPT-style PPO, including the parts DPO sidesteps.
DPO vs GRPO vs RL reasoning — comparison with on-policy alternatives that have re-emerged for verifiable-reward training.
Reward models and verifiers — Bradley-Terry reward fitting and the implicit-reward identity DPO exploits.
KL divergence — the anchor that gives the optimal policy its closed form.
Policy optimization: PPO and TRPO — the on-policy method DPO replaces.
Fine-tuning and adaptation — where preference fine-tuning sits in the post-training stack.

References

Canonical:

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." NeurIPS. arXiv:2305.18290.

Direct precursors:

Ouyang, L. et al. (2022). "Training language models to follow instructions with human feedback." NeurIPS. arXiv:2203.02155. The PPO-based RLHF pipeline DPO replaces.
Ziegler, D. et al. (2019). "Fine-Tuning Language Models from Human Preferences." arXiv:1909.08593. The KL-regularised RLHF objective whose closed form DPO inverts.
Bradley, R. A., & Terry, M. E. (1952). "Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons." Biometrika 39(3/4). The pairwise-comparison likelihood the derivation rests on.

Critical refinements:

Azar, M. G. et al. (2023). "A General Theoretical Paradigm to Understand Learning from Human Preferences." arXiv:2310.12036. IPO.
Park, R., Rafailov, R., Ermon, S., & Finn, C. (2024). "Disentangling Length from Quality in Direct Preference Optimization." arXiv:2403.19159. DPO is biased toward longer responses; this paper explains and corrects it.
Hong, J., Lee, N., & Thorne, J. (2024). "ORPO: Monolithic Preference Optimization without Reference Model." EMNLP. arXiv:2403.07691.
Meng, Y., Xia, M., & Chen, D. (2024). "SimPO: Simple Preference Optimization with a Reference-Free Reward." NeurIPS. arXiv:2405.14734.

Open-weight applications:

Tunstall, L. et al. (2023). "Zephyr: Direct Distillation of LM Alignment." arXiv:2310.16944. One of the first open-weight chat models trained end-to-end with DPO.
Dubey, A. et al. (2024). "The Llama 3 Herd of Models." arXiv:2407.21783. Llama-3-Instruct uses DPO as one of its post-training stages.

Connected topics

Last reviewed: May 7, 2026