What Each Does
Both SFT and DPO are methods for aligning a pretrained language model to desired behavior. They differ in what data they consume and what objective they optimize.
Supervised Fine-Tuning (SFT) maximizes the log-likelihood of demonstration data. Given a dataset of prompt-response pairs where is a human-written (or curated) response:
This is standard maximum likelihood estimation applied to conditional text generation. The model learns to imitate the demonstrations.
Direct Preference Optimization (DPO) optimizes a policy directly from pairwise preferences. Given a dataset of triples where is preferred over :
where is the reference policy (usually the SFT model), controls the strength of the KL constraint, and is the sigmoid function.
The Key Insight Behind DPO
The standard RLHF pipeline has three stages: (1) SFT on demonstrations, (2) fit a reward model on preferences, (3) optimize the policy against the reward model using PPO with a KL penalty. DPO collapses stages 2 and 3 into a single supervised loss by exploiting a closed-form solution to the KL-constrained reward maximization problem.
Under the RLHF objective, the optimal policy satisfies:
Rearranging gives . Substituting this into the Bradley-Terry preference model and noting that cancels yields the DPO loss. No reward model is ever explicitly trained.
Side-by-Side Comparison
| Property | SFT | DPO |
|---|---|---|
| Data format | demonstration pairs | preference triples |
| Objective | Maximum likelihood on demonstrations | Bradley-Terry preference likelihood |
| Reward model | Not needed | Not needed (implicit) |
| Reference policy | Not needed | Required (, usually SFT model) |
| Training stability | Very stable (standard cross-entropy) | Moderately stable, but sensitive to |
| Hyperparameters | Learning rate, epochs | Learning rate, epochs, |
| Data collection cost | Demonstrations (expensive per sample) | Pairwise comparisons (cheaper per sample) |
| What it optimizes | Imitation of best behavior | Relative ranking between outputs |
| Typical pipeline position | Stage 1 of alignment | Stage 2 (replaces reward model + PPO) |
| Compute cost | Low (single model, standard forward/backward) | Moderate (two forward passes: policy + reference) |
| Failure mode | Distribution mismatch at test time | Reward hacking if preferences are noisy |
| When it plateaus | Limited by demonstration quality ceiling | Limited by preference data quality and coverage |
When Each Wins
SFT wins: abundant demonstrations, limited preference data
When you have a large corpus of high-quality demonstrations (expert-written responses, curated instruction-following data) and no preference annotations, SFT is the only option. It is also the standard first stage before any preference-based method. InstructGPT, LLaMA-2 Chat, and most open-source chat models begin with SFT.
SFT wins: simple instruction following
For tasks where the desired behavior is unambiguous (translation, summarization with a fixed style, structured extraction), SFT is sufficient. Preference optimization adds complexity without clear benefit when there is a single correct output distribution to imitate.
DPO wins: subjective quality judgments
When the task involves subjective quality (helpfulness, harmlessness, tone, creativity), demonstrations alone cannot capture the full preference structure. Two responses may both be acceptable but one is clearly better. DPO can learn from these relative rankings. RLHF with PPO can also learn from preferences, but DPO does it without the reward model and RL infrastructure.
DPO wins: simpler training infrastructure
DPO replaces the three-model setup of RLHF (policy, reward model, value function) with a two-model setup (policy, frozen reference). There is no RL loop, no advantage estimation, no clipping. The loss is a standard binary cross-entropy computed on batches of preference pairs. This makes DPO substantially easier to implement and debug.
SFT + DPO: the standard two-stage pipeline
In practice, most alignment pipelines use both. SFT first to get a capable instruction-following model, then DPO (or GRPO) to refine the model using preference data. The SFT model serves as the reference policy for DPO. Skipping SFT and running DPO directly from the pretrained base model generally performs worse because the base model produces outputs too far from the preference data distribution.
Limitations of Each
SFT suffers from exposure bias: during training the model sees ground-truth prefixes, but during generation it conditions on its own outputs. This can cause compounding errors in long sequences. SFT also has a hard ceiling set by the quality of demonstrations. The model cannot learn to be better than its training data.
DPO assumes that preferences follow the Bradley-Terry model, which may not hold for complex human judgments. DPO can also suffer from length exploitation: if annotators systematically prefer longer responses, the model learns to be verbose rather than genuinely better. The parameter requires careful tuning. Too small and the model diverges far from the reference, losing coherence. Too large and the model barely moves from the reference.
Common Confusions
DPO does not eliminate the need for SFT
DPO needs a reference policy, and that reference policy is almost always an SFT model. Running DPO on a raw pretrained model without SFT typically fails because the base model's output distribution is too far from the preference data. SFT is a prerequisite, not an alternative.
SFT is not just behavioral cloning for LLMs
While SFT is maximum likelihood on demonstrations (similar to behavioral cloning in RL), the language modeling objective provides dense supervision at every token position. This is much richer than the sparse reward signal in behavioral cloning, where only the trajectory outcome matters.
DPO is not reward-free
DPO does not learn an explicit reward function, but the optimal policy implicitly defines one: . You can extract this implicit reward for analysis. The advantage is not that rewards disappear but that you never need to fit a separate model to approximate them.
References
- Ouyang, L. et al. (2022). "Training language models to follow instructions with human feedback." NeurIPS 2022. (InstructGPT: the SFT + reward model + PPO pipeline.)
- Rafailov, R. et al. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." NeurIPS 2023. (Original DPO paper, derivation of the closed-form loss.)
- Touvron, H. et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288. Section 3 (SFT and RLHF pipeline details).
- Tunstall, L. et al. (2023). "Zephyr: Direct Distillation of LM Alignment." arXiv:2310.16944. (DPO applied to distilled models, comparison with SFT-only baselines.)
- Azar, M. G. et al. (2023). "A General Theoretical Paradigm to Understand Learning from Human Feedback." arXiv:2310.12036. (IPO: analysis of DPO assumptions and Bradley-Terry limitations.)
- Tajwar, F. et al. (2024). "Preference Fine-Tuning of LLMs Should Leverage Suboptimally-Rated Examples." ICML 2024. (Analysis of when preference data is more informative than demonstrations.)