Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Post-Training Overview

The full post-training stack in 2026: SFT, RLHF, DPO, GRPO, constitutional AI, verifier-guided training, and self-improvement loops. Why post-training is now its own discipline.

AdvancedTier 2Frontier~70 min
0

Why This Matters

A pretrained language model is not a product. It is a text prediction engine that can produce toxic content, hallucinate confidently, and ignore user instructions. The transformation from a raw pretrained model to a useful, safe, steerable assistant happens entirely in the post-training phase.

Post-training was once a minor footnote. a few hundred steps of supervised fine-tuning. By 2026 it is a multi-stage engineering discipline that consumes months of effort, thousands of GPU-hours, and teams of specialists. The post-training pipeline now determines model character, safety properties, reasoning capabilities, and tool-use abilities more than the pretraining recipe.

Mental Model

Think of post-training as a series of increasingly precise shaping operations:

  1. Pretraining gives you raw material. a model that knows language but has no concept of "helpfulness" or "safety."
  2. SFT teaches the format. how to respond to instructions, when to refuse.
  3. Preference optimization (RLHF, DPO, GRPO) teaches the style. which responses are better than others.
  4. Safety training teaches the boundaries. what not to do.
  5. Verifier-guided training teaches reasoning. how to produce correct answers, not just fluent ones.
  6. Self-improvement loops teach the model to improve itself using its own outputs as training signal.

Each stage has different objectives, different data requirements, and different failure modes. Understanding the full pipeline is necessary for understanding why frontier models behave the way they do.

The Post-Training Pipeline

Definition

Post-Training Pipeline

The post-training pipeline is the sequence of training stages applied after pretraining to transform a base language model into a deployable assistant. A typical 2026 pipeline:

πbaseSFTπSFTpreference optπprefsafetyπsafeverifier RLπdeploy\pi_{\text{base}} \xrightarrow{\text{SFT}} \pi_{\text{SFT}} \xrightarrow{\text{preference opt}} \pi_{\text{pref}} \xrightarrow{\text{safety}} \pi_{\text{safe}} \xrightarrow{\text{verifier RL}} \pi_{\text{deploy}}

Each arrow represents a distinct training stage with its own data, loss function, and optimization procedure.

Stage 1: Supervised Fine-Tuning (SFT)

SFT trains the model on high-quality (instruction, response) pairs using standard next-token prediction:

LSFT(θ)=E(x,y)DSFT[t=1ylogπθ(ytx,y<t)]\mathcal{L}_{\text{SFT}}(\theta) = -\mathbb{E}_{(x,y) \sim \mathcal{D}_{\text{SFT}}} \left[\sum_{t=1}^{|y|} \log \pi_\theta(y_t | x, y_{<t})\right]

This teaches the model format. how to follow instructions, use markdown, cite sources, refuse harmful requests. But not quality. A model trained only with SFT will follow instructions but produce mediocre responses.

Proposition

SFT as Forward-KL Minimization (Identity)

Statement

This is an algebraic identity, not a substantive theorem. The expected SFT loss decomposes into the entropy of the data distribution plus the forward KL from data to model:

E(x,y)pdata[logπθ(yx)]=H(pdata)+KL(pdataπθ).\mathbb{E}_{(x,y) \sim p_{\text{data}}}[-\log \pi_\theta(y|x)] = H(p_{\text{data}}) + \text{KL}(p_{\text{data}} \| \pi_\theta).

Since H(pdata)H(p_{\text{data}}) does not depend on θ\theta, minimizing the SFT loss is equivalent to minimizing KL(pdataπθ)\text{KL}(p_{\text{data}} \| \pi_\theta). This is the standard cross-entropy to KL decomposition (Cover and Thomas, Chapter 2, Section 2.3). Forward KL is mode-covering: it penalizes assigning low probability to any point where pdatap_{\text{data}} has mass.

Intuition

SFT tries to match the distribution of training responses. This means the model learns to produce average responses that cover the range of the training data. It does not learn to produce the best responses. This is why SFT alone produces adequate but rarely excellent outputs.

Why It Matters

The mode-covering property of forward KL explains why SFT models are often "bland". They hedge across the full range of training responses rather than committing to the best one. Preference optimization addresses this by pushing the model toward the preferred modes.

Stage 2: Preference Optimization

After SFT, the model is trained on human (or AI-generated) preference data. The three dominant approaches in 2026:

  • RLHF: train a reward model on preferences, then optimize with PPO
  • DPO: directly optimize on preference pairs without a separate reward model
  • GRPO: group relative policy optimization using group comparisons

Each optimizes some variant of the KL-regularized preference objective. See the DPO vs GRPO vs RL reasoning page for detailed comparison.

Proposition

Post-Training Error Decomposition

Statement

The gap between the deployed policy πdeploy\pi_{\text{deploy}} and the ideal policy π\pi^* can be decomposed as:

KL(ππdeploy)ϵSFTformat gap+ϵprefpreference gap+ϵsafetysafety gap+ϵverifyreasoning gap\text{KL}(\pi^* \| \pi_{\text{deploy}}) \leq \underbrace{\epsilon_{\text{SFT}}}_{\text{format gap}} + \underbrace{\epsilon_{\text{pref}}}_{\text{preference gap}} + \underbrace{\epsilon_{\text{safety}}}_{\text{safety gap}} + \underbrace{\epsilon_{\text{verify}}}_{\text{reasoning gap}}

where each ϵ\epsilon term represents the error introduced or left unresolved by the corresponding training stage. These terms interact: reducing one can increase another (safety training can reduce helpfulness; reasoning training can increase verbosity).

Intuition

Each post-training stage addresses a different failure mode, but none is sufficient alone. SFT teaches format but not quality. Preference optimization teaches quality but not correctness. Safety training teaches refusal but can over-refuse. Verifier training teaches reasoning but can make the model verbose. The art of post-training is balancing these competing objectives.

Why It Matters

This decomposition explains why post-training is multi-stage rather than monolithic. A single training objective cannot simultaneously optimize for format compliance, response quality, safety, and reasoning correctness. The pipeline structure is a practical response to this multi-objective optimization problem.

Stage 3: Safety Training

Safety training prevents the model from producing harmful outputs. Methods:

  • Constitutional AI: the model critiques and revises its own outputs against a set of principles, generating synthetic preference data
  • Red-teaming: adversarial prompts reveal failure modes, which become training data for the next round
  • Rule-based reward models (RBRMs): classifier-based reward signals for specific safety categories (toxicity, PII, dangerous instructions)

The key tension: safety training and helpfulness training compete. A model that always refuses is perfectly safe but useless. The Pareto frontier between safety and helpfulness is where deployment decisions are made.

Stage 4: Verifier-Guided Training

For reasoning tasks (math, code, logic), human preferences are unreliable. even expert humans cannot consistently judge whether a long chain of reasoning is correct. Verifiers solve this by checking outputs objectively:

  • Code execution: run the code and check if it passes tests
  • Math verification: check against ground truth or use formal provers
  • Fact-checking: verify claims against a knowledge base

The model is then trained with RL using verifier feedback as the reward signal. This is how models like o1 and DeepSeek-R1 were trained to reason.

Stage 5: Self-Improvement Loops

The most frontier stage: the model generates its own training data and improves iteratively.

  • Generate many candidate solutions to a problem
  • Use a verifier to select the correct ones
  • Train on the correct solutions (rejection sampling fine-tuning)
  • Repeat with the improved model

This is a form of iterative distillation where the model distills its own best-case behavior into its average-case behavior. The risk is distribution collapse. The model becomes increasingly narrow in its outputs.

Common Fake Understanding

"RLHF = alignment." RLHF makes the model produce outputs that score highly on a reward model trained on human preferences. This is not the same as alignment. A sycophantic model that tells users what they want to hear scores highly on preference data. A model that confidently confabulates scores higher than one that hedges honestly. RLHF optimizes for human approval, which correlates with but is not identical to truthfulness, safety, or genuine helpfulness. The entire post-training pipeline. including safety training, constitutional AI, and verifier-guided RL. exists because RLHF alone is insufficient.

Where This Shows Up in Current Papers

Every frontier model release in 2025-2026 describes a multi-stage post-training pipeline. The Llama 3.1 technical report (2024) devotes more pages to post-training than to pretraining. DeepSeek-R1 (2025) introduced verifier-guided RL as a core post-training stage. Claude 3 and GPT-4o describe constitutional AI and multi-stage RLHF respectively. The trend is clear: post-training complexity is increasing faster than pretraining complexity. New frontier models in 2026 typically describe 4-6 distinct post-training stages.

Common Confusions

Watch Out

Post-training is not just fine-tuning

Traditional fine-tuning adjusts a pretrained model on task-specific data. Modern post-training is a multi-stage pipeline with distinct objectives at each stage. Calling post-training "fine-tuning" obscures the complexity: SFT, preference optimization, safety training, and verifier-guided RL are structurally different training procedures with different data requirements and failure modes.

Watch Out

More post-training stages is not always better

Each additional stage risks degrading performance on earlier objectives. Safety training can make the model less helpful (over-refusal). Reasoning training can make the model verbose on simple questions. The pipeline must be carefully calibrated, and regressions on earlier stages must be monitored throughout.

Watch Out

SFT data quality matters more than quantity

A small dataset of expert-written responses often outperforms a large dataset of crowdsourced responses. The LIMA paper (2023) showed that 1,000 carefully curated examples can produce strong SFT results. The quality of SFT data determines the starting point for all subsequent stages.

Summary

  • Post-training pipeline: pretrain \to SFT \to preference optimization \to safety \to verifier RL
  • SFT teaches format (forward KL minimization, mode-covering)
  • Preference optimization teaches quality (RLHF, DPO, or GRPO)
  • Safety training teaches boundaries (constitutional AI, red-teaming, RBRMs)
  • Verifier-guided RL teaches reasoning (code execution, math provers)
  • Self-improvement loops use the model's own correct outputs as training data
  • Each stage can degrade performance on earlier objectives. balance is critical
  • RLHF alone is not alignment. The full pipeline is the alignment attempt

Exercises

ExerciseCore

Problem

Explain why SFT alone is insufficient for producing a high-quality assistant. What specific failure mode does preference optimization address that SFT cannot?

ExerciseAdvanced

Problem

The safety-helpfulness tradeoff can be modeled as a constrained optimization problem. Formalize this: write the objective (maximize helpfulness) and the constraint (maintain safety above a threshold), and explain why Lagrangian relaxation converts this into a weighted sum that maps onto the RLHF framework.

ExerciseResearch

Problem

Self-improvement loops (generate, verify, retrain) can suffer from distribution collapse. Formalize this risk: if the model generates solutions from πt\pi_t and retrains on verified correct solutions, what happens to the entropy H(πt)H(\pi_t) over iterations? Under what conditions does the model converge to a useful fixed point versus a degenerate one?

References

Canonical:

  • Ouyang et al., "Training Language Models to Follow Instructions with Human Feedback" (InstructGPT), NeurIPS 2022; arXiv:2203.02155.
  • Bai et al. (Anthropic), "Constitutional AI: Harmlessness from AI Feedback" (2022); arXiv:2212.08073.
  • Rafailov, Sharma, Mitchell, Ermon, Manning, Finn, "Direct Preference Optimization: Your Language Model is Secretly a Reward Model", NeurIPS 2023; arXiv:2305.18290. Derivation of the DPO objective from KL-regularized preference RL.

Current pipeline and reasoning RL:

  • Dubey et al., "The Llama 3 Herd of Models" (2024); arXiv:2407.21783. Detailed multi-stage post-training pipeline in a frontier release.
  • DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2025); arXiv:2501.12948. Verifier-guided RL with GRPO.
  • Zhou et al., "LIMA: Less Is More for Alignment", NeurIPS 2023; arXiv:2305.11206. Evidence that SFT data quality dominates quantity.

Critiques and limits of RLHF as alignment:

  • Casper et al., "Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback" (2023); arXiv:2307.15217. Reward hacking, distributional shift, and evaluator bias.
  • Perez et al., "Discovering Language Model Behaviors with Model-Written Evaluations" (2022); arXiv:2212.09251. Sycophancy and other emergent RLHF failure modes.

Next Topics

The natural next steps from post-training:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics