SFT vs DPO: Supervised Fine-Tuning vs DPO

What Each Does

Both SFT and DPO are methods for aligning a pretrained language model to desired behavior. They differ in what data they consume and what objective they optimize.

Supervised Fine-Tuning (SFT) maximizes the log-likelihood of demonstration data. Given a dataset of prompt-response pairs $(x, y^*)$ where $y^*$ is a human-written (or curated) response:

$\mathcal{L}_{\text{SFT}}(\theta) = -\mathbb{E}_{(x,y^*) \sim \mathcal{D}} \left[ \log \pi_\theta(y^* | x) \right]$

This is standard maximum likelihood estimation applied to conditional text generation. The model learns to imitate the demonstrations.

Direct Preference Optimization (DPO) optimizes a policy directly from pairwise preferences. Given a dataset of triples $(x, y_w, y_l)$ where $y_w$ is preferred over $y_l$ :

$\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x,y_w,y_l) \sim \mathcal{D}} \left[ \log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right) \right]$

where $\pi_{\text{ref}}$ is the reference policy (usually the SFT model), $\beta$ controls the strength of the KL constraint, and $\sigma$ is the sigmoid function.

The Key Insight Behind DPO

The standard RLHF pipeline has three stages: (1) SFT on demonstrations, (2) fit a reward model on preferences, (3) optimize the policy against the reward model using PPO with a KL penalty. DPO collapses stages 2 and 3 into a single supervised loss by exploiting a closed-form solution to the KL-constrained reward maximization problem.

Under the RLHF objective, the optimal policy satisfies:

$\pi^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta} r(x, y)\right)$

Rearranging gives $r(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)$ . Substituting this into the Bradley-Terry preference model and noting that $Z(x)$ cancels yields the DPO loss. No reward model is ever explicitly trained.

Side-by-Side Comparison

Property	SFT	DPO
Data format	$(x, y^*)$ demonstration pairs	$(x, y_w, y_l)$ preference triples
Objective	Maximum likelihood on demonstrations	Bradley-Terry preference likelihood
Reward model	Not needed	Not needed (implicit)
Reference policy	Not needed	Required ( $\pi_{\text{ref}}$ , usually SFT model)
Training stability	Very stable (standard cross-entropy)	Moderately stable, but sensitive to $\beta$
Hyperparameters	Learning rate, epochs	Learning rate, epochs, $\beta$
Data collection cost	Demonstrations (expensive per sample)	Pairwise comparisons (cheaper per sample)
What it optimizes	Imitation of best behavior	Relative ranking between outputs
Typical pipeline position	Stage 1 of alignment	Stage 2 (replaces reward model + PPO)
Compute cost	Low (single model, standard forward/backward)	Moderate (two forward passes: policy + reference)
Failure mode	Distribution mismatch at test time	Reward hacking if preferences are noisy
When it plateaus	Limited by demonstration quality ceiling	Limited by preference data quality and coverage

When Each Wins

SFT wins: abundant demonstrations, limited preference data

When you have a large corpus of high-quality demonstrations (expert-written responses, curated instruction-following data) and no preference annotations, SFT is the only option. It is also the standard first stage before any preference-based method. InstructGPT, LLaMA-2 Chat, and most open-source chat models begin with SFT.

SFT wins: simple instruction following

For tasks where the desired behavior is unambiguous (translation, summarization with a fixed style, structured extraction), SFT is sufficient. Preference optimization adds complexity without clear benefit when there is a single correct output distribution to imitate.

DPO wins: subjective quality judgments

When the task involves subjective quality (helpfulness, harmlessness, tone, creativity), demonstrations alone cannot capture the full preference structure. Two responses may both be acceptable but one is clearly better. DPO can learn from these relative rankings. RLHF with PPO can also learn from preferences, but DPO does it without the reward model and RL infrastructure.

DPO wins: simpler training infrastructure

DPO replaces the three-model setup of RLHF (policy, reward model, value function) with a two-model setup (policy, frozen reference). There is no RL loop, no advantage estimation, no clipping. The loss is a standard binary cross-entropy computed on batches of preference pairs. This makes DPO substantially easier to implement and debug.

SFT + DPO: the standard two-stage pipeline

In practice, most alignment pipelines use both. SFT first to get a capable instruction-following model, then DPO (or GRPO) to refine the model using preference data. The SFT model serves as the reference policy $\pi_{\text{ref}}$ for DPO. Skipping SFT and running DPO directly from the pretrained base model generally performs worse because the base model produces outputs too far from the preference data distribution.

Limitations of Each

SFT suffers from exposure bias: during training the model sees ground-truth prefixes, but during generation it conditions on its own outputs. This can cause compounding errors in long sequences. SFT also has a hard ceiling set by the quality of demonstrations. The model cannot learn to be better than its training data.

DPO assumes that preferences follow the Bradley-Terry model, which may not hold for complex human judgments. DPO can also suffer from length exploitation: if annotators systematically prefer longer responses, the model learns to be verbose rather than genuinely better. The $\beta$ parameter requires careful tuning. Too small and the model diverges far from the reference, losing coherence. Too large and the model barely moves from the reference.

Common Confusions

Watch Out

DPO does not eliminate the need for SFT

DPO needs a reference policy, and that reference policy is almost always an SFT model. Running DPO on a raw pretrained model without SFT typically fails because the base model's output distribution is too far from the preference data. SFT is a prerequisite, not an alternative.

Watch Out

SFT is not just behavioral cloning for LLMs

While SFT is maximum likelihood on demonstrations (similar to behavioral cloning in RL), the language modeling objective provides dense supervision at every token position. This is much richer than the sparse reward signal in behavioral cloning, where only the trajectory outcome matters.

Watch Out

DPO is not reward-free

DPO does not learn an explicit reward function, but the optimal policy implicitly defines one: $r(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}$ . You can extract this implicit reward for analysis. The advantage is not that rewards disappear but that you never need to fit a separate model to approximate them.

References

Ouyang, L. et al. (2022). "Training language models to follow instructions with human feedback." NeurIPS 2022. (InstructGPT: the SFT + reward model + PPO pipeline.)
Rafailov, R. et al. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." NeurIPS 2023. (Original DPO paper, derivation of the closed-form loss.)
Touvron, H. et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288. Section 3 (SFT and RLHF pipeline details).
Tunstall, L. et al. (2023). "Zephyr: Direct Distillation of LM Alignment." arXiv:2310.16944. (DPO applied to distilled models, comparison with SFT-only baselines.)
Azar, M. G. et al. (2023). "A General Theoretical Paradigm to Understand Learning from Human Feedback." arXiv:2310.12036. (IPO: analysis of DPO assumptions and Bradley-Terry limitations.)
Tajwar, F. et al. (2024). "Preference Fine-Tuning of LLMs Should Leverage Suboptimally-Rated Examples." ICML 2024. (Analysis of when preference data is more informative than demonstrations.)