Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Comparison

SFT vs. DPO

Supervised fine-tuning (SFT) learns from demonstration data by maximizing log-likelihood on human-written outputs. Direct preference optimization (DPO) learns from pairwise preference data by directly optimizing the policy without fitting a separate reward model. SFT is simpler and data-efficient for instruction following. DPO is preferred when preference signals are available and you want to skip the reward model stage of RLHF.

What Each Does

Both SFT and DPO are methods for aligning a pretrained language model to desired behavior. They differ in what data they consume and what objective they optimize.

Supervised Fine-Tuning (SFT) maximizes the log-likelihood of demonstration data. Given a dataset of prompt-response pairs (x,y)(x, y^*) where yy^* is a human-written (or curated) response:

LSFT(θ)=E(x,y)D[logπθ(yx)]\mathcal{L}_{\text{SFT}}(\theta) = -\mathbb{E}_{(x,y^*) \sim \mathcal{D}} \left[ \log \pi_\theta(y^* | x) \right]

This is standard maximum likelihood estimation applied to conditional text generation. The model learns to imitate the demonstrations.

Direct Preference Optimization (DPO) optimizes a policy directly from pairwise preferences. Given a dataset of triples (x,yw,yl)(x, y_w, y_l) where ywy_w is preferred over yly_l:

LDPO(θ)=E(x,yw,yl)D[logσ(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))]\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x,y_w,y_l) \sim \mathcal{D}} \left[ \log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right) \right]

where πref\pi_{\text{ref}} is the reference policy (usually the SFT model), β\beta controls the strength of the KL constraint, and σ\sigma is the sigmoid function.

The Key Insight Behind DPO

The standard RLHF pipeline has three stages: (1) SFT on demonstrations, (2) fit a reward model on preferences, (3) optimize the policy against the reward model using PPO with a KL penalty. DPO collapses stages 2 and 3 into a single supervised loss by exploiting a closed-form solution to the KL-constrained reward maximization problem.

Under the RLHF objective, the optimal policy satisfies:

π(yx)=1Z(x)πref(yx)exp(1βr(x,y))\pi^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta} r(x, y)\right)

Rearranging gives r(x,y)=βlogπ(yx)πref(yx)+βlogZ(x)r(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x). Substituting this into the Bradley-Terry preference model and noting that Z(x)Z(x) cancels yields the DPO loss. No reward model is ever explicitly trained.

Side-by-Side Comparison

PropertySFTDPO
Data format(x,y)(x, y^*) demonstration pairs(x,yw,yl)(x, y_w, y_l) preference triples
ObjectiveMaximum likelihood on demonstrationsBradley-Terry preference likelihood
Reward modelNot neededNot needed (implicit)
Reference policyNot neededRequired (πref\pi_{\text{ref}}, usually SFT model)
Training stabilityVery stable (standard cross-entropy)Moderately stable, but sensitive to β\beta
HyperparametersLearning rate, epochsLearning rate, epochs, β\beta
Data collection costDemonstrations (expensive per sample)Pairwise comparisons (cheaper per sample)
What it optimizesImitation of best behaviorRelative ranking between outputs
Typical pipeline positionStage 1 of alignmentStage 2 (replaces reward model + PPO)
Compute costLow (single model, standard forward/backward)Moderate (two forward passes: policy + reference)
Failure modeDistribution mismatch at test timeReward hacking if preferences are noisy
When it plateausLimited by demonstration quality ceilingLimited by preference data quality and coverage

When Each Wins

SFT wins: abundant demonstrations, limited preference data

When you have a large corpus of high-quality demonstrations (expert-written responses, curated instruction-following data) and no preference annotations, SFT is the only option. It is also the standard first stage before any preference-based method. InstructGPT, LLaMA-2 Chat, and most open-source chat models begin with SFT.

SFT wins: simple instruction following

For tasks where the desired behavior is unambiguous (translation, summarization with a fixed style, structured extraction), SFT is sufficient. Preference optimization adds complexity without clear benefit when there is a single correct output distribution to imitate.

DPO wins: subjective quality judgments

When the task involves subjective quality (helpfulness, harmlessness, tone, creativity), demonstrations alone cannot capture the full preference structure. Two responses may both be acceptable but one is clearly better. DPO can learn from these relative rankings. RLHF with PPO can also learn from preferences, but DPO does it without the reward model and RL infrastructure.

DPO wins: simpler training infrastructure

DPO replaces the three-model setup of RLHF (policy, reward model, value function) with a two-model setup (policy, frozen reference). There is no RL loop, no advantage estimation, no clipping. The loss is a standard binary cross-entropy computed on batches of preference pairs. This makes DPO substantially easier to implement and debug.

SFT + DPO: the standard two-stage pipeline

In practice, most alignment pipelines use both. SFT first to get a capable instruction-following model, then DPO (or GRPO) to refine the model using preference data. The SFT model serves as the reference policy πref\pi_{\text{ref}} for DPO. Skipping SFT and running DPO directly from the pretrained base model generally performs worse because the base model produces outputs too far from the preference data distribution.

Limitations of Each

SFT suffers from exposure bias: during training the model sees ground-truth prefixes, but during generation it conditions on its own outputs. This can cause compounding errors in long sequences. SFT also has a hard ceiling set by the quality of demonstrations. The model cannot learn to be better than its training data.

DPO assumes that preferences follow the Bradley-Terry model, which may not hold for complex human judgments. DPO can also suffer from length exploitation: if annotators systematically prefer longer responses, the model learns to be verbose rather than genuinely better. The β\beta parameter requires careful tuning. Too small and the model diverges far from the reference, losing coherence. Too large and the model barely moves from the reference.

Common Confusions

Watch Out

DPO does not eliminate the need for SFT

DPO needs a reference policy, and that reference policy is almost always an SFT model. Running DPO on a raw pretrained model without SFT typically fails because the base model's output distribution is too far from the preference data. SFT is a prerequisite, not an alternative.

Watch Out

SFT is not just behavioral cloning for LLMs

While SFT is maximum likelihood on demonstrations (similar to behavioral cloning in RL), the language modeling objective provides dense supervision at every token position. This is much richer than the sparse reward signal in behavioral cloning, where only the trajectory outcome matters.

Watch Out

DPO is not reward-free

DPO does not learn an explicit reward function, but the optimal policy implicitly defines one: r(x,y)=βlogπθ(yx)πref(yx)r(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}. You can extract this implicit reward for analysis. The advantage is not that rewards disappear but that you never need to fit a separate model to approximate them.

References

  1. Ouyang, L. et al. (2022). "Training language models to follow instructions with human feedback." NeurIPS 2022. (InstructGPT: the SFT + reward model + PPO pipeline.)
  2. Rafailov, R. et al. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." NeurIPS 2023. (Original DPO paper, derivation of the closed-form loss.)
  3. Touvron, H. et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288. Section 3 (SFT and RLHF pipeline details).
  4. Tunstall, L. et al. (2023). "Zephyr: Direct Distillation of LM Alignment." arXiv:2310.16944. (DPO applied to distilled models, comparison with SFT-only baselines.)
  5. Azar, M. G. et al. (2023). "A General Theoretical Paradigm to Understand Learning from Human Feedback." arXiv:2310.12036. (IPO: analysis of DPO assumptions and Bradley-Terry limitations.)
  6. Tajwar, F. et al. (2024). "Preference Fine-Tuning of LLMs Should Leverage Suboptimally-Rated Examples." ICML 2024. (Analysis of when preference data is more informative than demonstrations.)