Skip to main content

Paper breakdown

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov et al. · 2023 · NeurIPS 2023

Closes the form of the KL-regularised RLHF policy in terms of a reward function, then inverts the relation: the policy is the reward model, up to a partition function that cancels in pairwise comparisons. Reduces preference fine-tuning to a single supervised log-likelihood objective with no reward model, no rollouts, and no PPO.

Overview

Rafailov et al. (2023) showed that the standard RLHF pipeline — reward modelling followed by KL-regularised PPO — has a closed-form solution that lets you skip the RL step entirely. The KL-regularised reward-maximising policy admits an analytic form in terms of the reward and the reference policy. Inverting that form expresses the reward as a log-ratio between the optimal policy and the reference. Substituting this expression into the Bradley-Terry preference likelihood gives an objective that depends only on policy probabilities — no separate reward model, no rollouts, no PPO.

The result is that preference fine-tuning becomes a supervised problem on pairwise comparison data (x,yw,yl)(x, y_w, y_l). The loss is a single binary cross-entropy on the policy log-ratios, and training reduces to standard cross-entropy fine-tuning with a slightly more complicated per-example computation. The trained model itself plays the role of the reward, which is the meaning of "your language model is secretly a reward model".

Empirically, DPO matches or beats PPO-based RLHF on the 2023-era benchmarks the paper considers (sentiment-controlled generation, summarisation, single-turn dialogue) at a fraction of the engineering cost. The recipe is now standard in the open-weight ecosystem (Llama-3-Instruct, Zephyr, Mistral-Instruct, Qwen-2.5-Instruct, DeepSeek-Chat) where the cost and instability of running PPO at scale was a real obstacle. PPO and reward-model-based RL have not gone away — they are still needed for online learning, multi-turn objectives, and verifiable-reward RL — but for one-shot preference fine-tuning DPO is the default.

Mathematical Contributions

The KL-regularised reward objective

InstructGPT-style RLHF maximises

maxπ    ExD,yπ(x) ⁣[r(x,y)]βDKL ⁣(π(x)πref(x))\max_\pi \;\; \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi(\cdot \mid x)}\!\big[ r(x, y) \big] - \beta\, D_{\mathrm{KL}}\!\big( \pi(\cdot \mid x) \,\|\, \pi_{\mathrm{ref}}(\cdot \mid x) \big)

over policies π\pi, with a fixed reward rr and a reference policy πref\pi_{\mathrm{ref}} (the SFT model). The KL anchor is what stops the policy from drifting onto adversarial tokens that score well on rr but are off the reward model's training distribution.

The closed-form optimal policy

The optimisation has a known analytic solution (the standard derivation from the calculus of variations, also recoverable from the soft-Bellman equation):

π(yx)=1Z(x)πref(yx)exp ⁣(1βr(x,y)),Z(x)=yπref(yx)exp ⁣(1βr(x,y)).\pi^*(y \mid x) = \frac{1}{Z(x)}\, \pi_{\mathrm{ref}}(y \mid x)\, \exp\!\Big(\frac{1}{\beta}\, r(x, y)\Big), \qquad Z(x) = \sum_{y'} \pi_{\mathrm{ref}}(y' \mid x)\, \exp\!\Big(\frac{1}{\beta}\, r(x, y')\Big).

This is the Boltzmann tilt of the reference policy by the exponentiated reward. The intractable piece is Z(x)Z(x) — a sum over all completions — which is why the original RLHF pipeline needs an iterative method (PPO) to optimise the reward expectation rather than computing π\pi^* in closed form.

Inverting the relation

Take logs of the closed-form solution and rearrange:

r(x,y)=βlogπ(yx)πref(yx)+βlogZ(x).r(x, y) = \beta\, \log \frac{\pi^*(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)} + \beta\, \log Z(x).

The reward is recovered, up to the prompt-dependent log-partition βlogZ(x)\beta \log Z(x), from the log-ratio between the optimal policy and the reference. This is the paper's "reward-policy duality": any KL-regularised optimal policy implicitly defines a reward, and any reward together with πref\pi_{\mathrm{ref}} implicitly defines an optimal policy.

The Bradley-Terry trick

The Bradley-Terry preference likelihood for a comparison (x,yw,yl)(x, y_w, y_l) depends only on the difference in rewards:

Pr(ywylx)=σ ⁣(r(x,yw)r(x,yl)).\Pr(y_w \succ y_l \mid x) = \sigma\!\big(r(x, y_w) - r(x, y_l)\big).

When you substitute the inverted relation, the prompt-dependent βlogZ(x)\beta \log Z(x) terms cancel:

r(x,yw)r(x,yl)=βlogπ(ywx)πref(ywx)βlogπ(ylx)πref(ylx).r(x, y_w) - r(x, y_l) = \beta\, \log \frac{\pi^*(y_w \mid x)}{\pi_{\mathrm{ref}}(y_w \mid x)} - \beta\, \log \frac{\pi^*(y_l \mid x)}{\pi_{\mathrm{ref}}(y_l \mid x)}.

The intractable partition function is gone, because it does not depend on the completion.

The DPO loss

Plug this into the Bradley-Terry log-likelihood and parameterise π\pi^* as πθ\pi_\theta (the policy you are training). The DPO objective is (Equation 7 in the paper):

LDPO(θ)=E(x,yw,yl)D ⁣[logσ ⁣(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))].\mathcal{L}_{\mathrm{DPO}}(\theta) = -\, \mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\!\left[ \log \sigma\!\left( \beta\, \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\mathrm{ref}}(y_w \mid x)} - \beta\, \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\mathrm{ref}}(y_l \mid x)} \right) \right].

This is a pure supervised loss. Each example needs four log-probabilities: πθ(yw)\pi_\theta(y_w), πθ(yl)\pi_\theta(y_l), πref(yw)\pi_{\mathrm{ref}}(y_w), πref(yl)\pi_{\mathrm{ref}}(y_l). The reference log-probs are computed once and cached; the policy log-probs are differentiated through. There are no rollouts, no value function, and no separate reward network.

Gradient

The gradient (Equation 8) is

θLDPO=βE(x,yw,yl) ⁣[σ ⁣(r^(x,yl)r^(x,yw))(θlogπθ(ywx)θlogπθ(ylx))]\nabla_\theta \mathcal{L}_{\mathrm{DPO}} = -\beta\, \mathbb{E}_{(x, y_w, y_l)}\!\Big[ \sigma\!\big(\hat r(x, y_l) - \hat r(x, y_w)\big) \big( \nabla_\theta \log \pi_\theta(y_w \mid x) - \nabla_\theta \log \pi_\theta(y_l \mid x) \big) \Big]

where r^(x,y)=βlog(πθ(yx)/πref(yx))\hat r(x, y) = \beta \log (\pi_\theta(y \mid x) / \pi_{\mathrm{ref}}(y \mid x)) is the implicit reward. The sigmoid factor is a confidence-weighting: examples on which the implicit reward already prefers ywy_w contribute small gradients, examples on which it does not contribute large gradients. This is the same self-paced behaviour you get in margin-based losses.

DPO vs PPO at a glance

PropertyPPO-RLHFDPO
Reward modelSeparate network, trained firstImplicit; the policy itself
Rollouts at training timeYes (on-policy samples)No (offline pairwise data)
OptimiserClipped policy-gradientStandard SGD/AdamW on a log-likelihood
KL anchorPer-token penaltyBuilt into the closed-form loss
Off-policy dataHardNative
Online preference collectionNativeRequires a regenerate-and-relabel loop
Engineering complexityHigh (value head, generation, GAE, PPO clipping)Low (one extra forward pass on πref\pi_{\mathrm{ref}})

What β\beta controls

β\beta in DPO plays the role of the KL coefficient in PPO: large β\beta keeps the policy near πref\pi_{\mathrm{ref}}, small β\beta lets it move further. Empirically, β[0.1,0.5]\beta \in [0.1, 0.5] is the typical range used by Llama-3, Mistral, and other open-weight post-training recipes. If β\beta is too small, the loss can drive log-probabilities under πref\pi_{\mathrm{ref}} to extreme values and the policy degenerates; if too large, the policy barely moves.

What It Gets Right

The first thing is that the derivation is honest and short. Two pages of paper get you from the KL-regularised RLHF objective to a supervised loss with no approximations along the way. The match between DPO and PPO at convergence on the same preference data is not coincidental — under the Bradley-Terry assumption and a sufficiently expressive policy class, both are optimising the same population objective, and DPO simply removes the variance introduced by sampling.

The second is the deployability story. PPO at GPT-3 scale needs a value network, a reference network, a frozen reward network, and on-policy generation; DPO needs only the policy and the frozen reference. The memory and engineering footprint dropped by enough that small-budget research groups and open-weight model labs could run preference fine-tuning end-to-end without building an RL stack.

The third is the empirical match on the benchmarks the paper considers. On IMDb sentiment-controlled generation, the TL;DR summarisation preference task, and Anthropic's helpful-harmless dialogue, DPO is at least as good as PPO with a tuned reward model. The paper does not overclaim — it is up-front that this is the regime in which they have evidence, and that the more demanding online and multi-turn settings are not addressed.

Common Misconceptions

DPO is not strictly equivalent to PPO at finite samples. Both are estimators of the same KL-regularised reward maximiser under Bradley-Terry preferences, but DPO's policy class is constrained to whatever the model can represent, and the implicit reward is just the log-ratio. When the data is sparse or the policy class is restricted, the two procedures find different local solutions. Subsequent work (Azar et al. 2023, IPO; Park et al. 2024, length-controlled DPO; Hong et al. 2024, ORPO) catalogues the gaps.

DPO does not learn a calibrated reward model. The implicit reward βlog(πθ/πref)\beta \log(\pi_\theta / \pi_{\mathrm{ref}}) ranks completions correctly for the prompts in the training distribution, but it is not a usable reward function on new prompts in the way an explicit reward model is. If you need a portable reward — for filtering, for ranking distillation outputs, or for online RL — you need to fit one separately.

DPO does not solve preference data quality problems. Garbage-in-garbage-out applies as much to DPO as to PPO. The Bradley-Terry assumption (that pairwise preferences come from independent comparisons of latent rewards) breaks under noisy labelers, length bias, or systematic disagreement, and DPO inherits whatever artifacts the reward model would have inherited. The cleaner pipeline does not buy cleaner data.

Connections to TheoremPath Topics

Further Reading

  • Azar, M. G., Rowland, M., Piot, B., Guo, D., Calandriello, D., Valko, M., & Munos, R. (2023). "A General Theoretical Paradigm to Understand Learning from Human Preferences." arXiv:2310.12036. IPO. Identifies an overfitting failure mode in DPO when preferences are deterministic, and proposes a regularised alternative.
  • Hong, J., Lee, N., & Thorne, J. (2024). "ORPO: Monolithic Preference Optimization without Reference Model." EMNLP. arXiv:2403.07691. Drops the reference model entirely by combining SFT and an odds-ratio preference loss in a single stage.
  • Meng, Y., Xia, M., & Chen, D. (2024). "SimPO: Simple Preference Optimization with a Reference-Free Reward." NeurIPS. arXiv:2405.14734. Uses the average log-probability as the implicit reward, which removes the need for the reference model at training time and improves length calibration.
  • Shao, Z. et al. (2024). "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models." arXiv:2402.03300. Introduces GRPO, a group-relative on-policy alternative used for verifiable-reward RL where DPO does not apply.

References

Canonical:

  • Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." NeurIPS. arXiv:2305.18290.

Direct precursors:

  • Ouyang, L. et al. (2022). "Training language models to follow instructions with human feedback." NeurIPS. arXiv:2203.02155. The PPO-based RLHF pipeline DPO replaces.
  • Ziegler, D. et al. (2019). "Fine-Tuning Language Models from Human Preferences." arXiv:1909.08593. The KL-regularised RLHF objective whose closed form DPO inverts.
  • Bradley, R. A., & Terry, M. E. (1952). "Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons." Biometrika 39(3/4). The pairwise-comparison likelihood the derivation rests on.

Critical refinements:

  • Azar, M. G. et al. (2023). "A General Theoretical Paradigm to Understand Learning from Human Preferences." arXiv:2310.12036. IPO.
  • Park, R., Rafailov, R., Ermon, S., & Finn, C. (2024). "Disentangling Length from Quality in Direct Preference Optimization." arXiv:2403.19159. DPO is biased toward longer responses; this paper explains and corrects it.
  • Hong, J., Lee, N., & Thorne, J. (2024). "ORPO: Monolithic Preference Optimization without Reference Model." EMNLP. arXiv:2403.07691.
  • Meng, Y., Xia, M., & Chen, D. (2024). "SimPO: Simple Preference Optimization with a Reference-Free Reward." NeurIPS. arXiv:2405.14734.

Open-weight applications:

  • Tunstall, L. et al. (2023). "Zephyr: Direct Distillation of LM Alignment." arXiv:2310.16944. One of the first open-weight chat models trained end-to-end with DPO.
  • Dubey, A. et al. (2024). "The Llama 3 Herd of Models." arXiv:2407.21783. Llama-3-Instruct uses DPO as one of its post-training stages.

Connected topics

Last reviewed: May 7, 2026