Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Reasoning Data Curation

How to build training data for reasoning models: math with verified solutions, code with test cases, rejection sampling, external verification, self-play for problem generation, and the connection to RLVR.

AdvancedTier 2Frontier~45 min
0

Why This Matters

The models that reason well (o1, DeepSeek-R1, Claude thinking models) were not trained on generic internet text. They were trained on carefully curated data where each solution has been verified to be correct. The quality of reasoning training data determines the ceiling for reasoning capability. Scaling incorrect or unverified reasoning data makes models more fluent at being wrong, not better at being right.

Mental Model

The core principle: for reasoning tasks, you can verify correctness even when you cannot generate correct solutions. A math proof checker can verify a proof without being able to produce one. A test suite can verify code without writing it. This asymmetry between generation and verification is what makes reasoning data curation possible at scale.

The pipeline: (1) source or generate problems, (2) generate many candidate solutions, (3) verify solutions with external tools, (4) keep only verified correct solutions, (5) train on the verified data.

Types of Reasoning Data

Definition

Verifiable Reasoning Data

Verifiable reasoning data consists of (problem, solution) pairs where the solution's correctness can be checked by an external system:

  1. Math with ground truth: competition problems with known numerical answers, theorem proving with formal proof assistants (Lean, Coq, Isabelle)
  2. Code with test cases: programming problems where solutions are checked by running test suites, including hidden test cases that the model never sees during training
  3. Science with known answers: physics, chemistry, and biology problems where the answer can be derived from first principles or looked up
  4. Logic puzzles: constraint satisfaction problems where a solution can be verified by checking all constraints

The key property: verification is cheap and reliable, even when generation is hard.

Rejection Sampling

Proposition

Rejection Sampling Improves Solution Quality

Statement

If a model generates NN independent solutions to a problem and each solution is correct with probability pp, then the probability that at least one correct solution exists among the NN samples is:

P(at least one correct)=1(1p)NP(\text{at least one correct}) = 1 - (1-p)^N

The expected number of correct solutions is NpNp. If we select uniformly at random from the correct solutions, the resulting training example is guaranteed correct (given a perfect verifier). To achieve P(at least one correct)1δP(\text{at least one correct}) \geq 1 - \delta:

Nln(1/δ)ln(1/(1p))ln(1/δ)pfor small pN \geq \frac{\ln(1/\delta)}{\ln(1/(1-p))} \approx \frac{\ln(1/\delta)}{p} \quad \text{for small } p

Intuition

Generate many attempts, keep the ones that work. If the model solves a problem 10% of the time, generating 50 attempts gives a 99.5% chance of at least one correct solution. The verified correct solutions become high-quality training data. The model is effectively distilling its own best-case behavior into reliable behavior.

Proof Sketch

Each sample is correct independently with probability pp. The probability all NN fail is (1p)N(1-p)^N. So P(any correct)=1(1p)NP(\text{any correct}) = 1 - (1-p)^N. Setting this 1δ\geq 1 - \delta and solving: (1p)Nδ(1-p)^N \leq \delta, so Nlnδ/ln(1p)N \geq \ln\delta / \ln(1-p). Using ln(1p)p\ln(1-p) \approx -p for small pp gives the approximation.

Why It Matters

Rejection sampling is the primary method for generating reasoning training data at scale. DeepSeek-R1, OpenAI's o1, and other reasoning models use it extensively. The method works even when the base model's pass rate is low (say 1-5%), as long as you can afford to generate enough samples and verify them.

Failure Mode

If the verifier has false positives (accepts incorrect solutions), the training data is contaminated. If the base model's pass rate pp is extremely low (below 0.1%), the cost of generating enough samples becomes prohibitive. Also, rejection sampling only selects for final-answer correctness; it does not guarantee that the reasoning chain itself is valid.

Verification Methods

Definition

External Verification

External verification uses tools outside the model to check solution correctness:

  1. Code execution: run the generated code against test cases (both visible and hidden). Check for correctness, not just compilation
  2. Formal proof assistants: Lean 4, Coq, Isabelle can type-check proofs for mathematical theorems. If the proof compiles, it is correct by construction
  3. Symbolic math checkers: computer algebra systems (SymPy, Mathematica) can verify numerical answers and simplify expressions
  4. Unit tests for reasoning: for word problems, verify the final answer against known ground truth

External verifiers are the gold standard because they do not rely on the model's own judgment, which is unreliable for hard problems.

Process Reward Models (PRMs) vs Outcome Reward Models (ORMs)

When external verification is unavailable, learned verifiers substitute:

  • ORMs score the final answer only. Binary signal: correct or incorrect
  • PRMs score each step in the reasoning chain. Richer signal but requires step-level correctness labels, which are expensive to collect

PRMs produce better training signal because they can identify where reasoning went wrong, enabling the model to learn which steps to avoid. The cost is that PRM training data requires human annotation of individual reasoning steps.

Self-Play for Problem Generation

Definition

Self-Play Problem Generation

Self-play generates new training problems using the model itself:

  1. The model generates a problem (e.g., a math question)
  2. The model generates a solution to the problem
  3. An external verifier checks the solution
  4. If verified, the (problem, solution) pair becomes training data
  5. If the model cannot solve its own problem (low pass rate), the problem is "hard" and particularly valuable for training

This creates a curriculum: the model generates problems at the frontier of its own capability. Problems it solves easily are not useful for training. Problems it fails on entirely cannot produce verified solutions. The sweet spot is problems with pass rate between 1% and 50%.

The self-play loop is inspired by AlphaZero, where the system improves by playing against itself. For reasoning, "playing against yourself" means generating problems you can barely solve, then training on the solutions you do find.

Connection to RLVR

Proposition

Verification as RL Reward Signal

Statement

RL with verifiable rewards (RLVR) uses the verifier output as a sparse reward signal:

r(x,y)={1if verifier accepts solution y to problem x0otherwiser(x, y) = \begin{cases} 1 & \text{if verifier accepts solution } y \text{ to problem } x \\ 0 & \text{otherwise} \end{cases}

The RL objective is:

maxθExD,yπθ(x)[r(x,y)]βKL(πθπref)\max_\theta \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(\cdot|x)} [r(x, y)] - \beta \, \text{KL}(\pi_\theta \| \pi_{\text{ref}})

where πref\pi_{\text{ref}} is a reference policy (typically the SFT model) and β\beta controls the KL penalty. This is equivalent to rejection sampling followed by a KL-regularized policy update.

Intuition

Instead of training on pre-collected verified solutions (rejection sampling then SFT), RLVR trains the model online: generate, verify, update policy. The RL framework handles credit assignment automatically through the reward signal. The KL penalty prevents the model from collapsing to a narrow set of solution strategies.

Why It Matters

RLVR is how DeepSeek-R1 was trained. The advantage over rejection sampling + SFT is that the model can explore and discover new solution strategies during training, rather than being limited to strategies it already knew. The verifier provides a reliable training signal without requiring human preference labels.

Failure Mode

Sparse binary rewards make RL optimization difficult. If the model's pass rate is very low, the reward signal is almost always zero and learning stalls. Reward hacking is possible if the verifier has exploitable weaknesses (e.g., a test suite with insufficient coverage). The KL penalty is crucial: without it, the model collapses to generating only the simplest correct solutions.

Data Quality vs Data Quantity

For reasoning, quality dominates quantity. Key empirical findings:

  1. Correct > plentiful: training on 10,000 verified-correct math solutions outperforms training on 100,000 unverified solutions that include errors
  2. Hard > easy: training on problems the model finds difficult (pass rate 1-20%) produces more improvement than training on easy problems (pass rate > 80%)
  3. Diverse > repetitive: solutions that use different reasoning strategies for the same problem type produce more robust reasoning than many solutions using the same approach
  4. Process > outcome: when available, step-by-step verified chains produce better reasoners than outcome-only verification

Common Confusions

Watch Out

Rejection sampling is not cherry-picking results

Rejection sampling generates training data, not evaluation results. You generate many solutions, keep the correct ones, and train on them. This is the training pipeline, not the evaluation protocol. At evaluation time, the model generates a single solution (or uses majority voting), and that solution is either correct or not.

Watch Out

A perfect verifier does not guarantee perfect reasoning

Even with a zero-false-positive verifier, the training data only contains solutions that happen to arrive at the correct final answer. The reasoning chain may contain errors that cancel out (e.g., two sign errors that compensate). This is why process reward models, which verify each step, produce stronger reasoners than outcome-only verification.

Watch Out

More compute for rejection sampling has diminishing returns

Generating N=1000N = 1000 samples when p=0.1p = 0.1 gives 1(0.9)100011 - (0.9)^{1000} \approx 1 probability of finding a correct solution. But generating 10,000 samples barely helps further. The marginal value of additional samples decreases rapidly once NpNp is large. The bottleneck shifts to problem diversity, not sample count.

Summary

  • Reasoning data requires verified correctness, not just fluency
  • Rejection sampling: generate many solutions, keep verified correct ones
  • External verifiers (code execution, proof assistants, symbolic checkers) are the gold standard
  • Self-play generates problems at the model's capability frontier
  • RLVR uses verifier output as RL reward, enabling online exploration
  • Data quality matters more than quantity: correct, hard, diverse solutions
  • Process verification (step-by-step) is stronger than outcome verification (final answer only)

Exercises

ExerciseCore

Problem

A model has a 5% pass rate on competition math problems. How many samples per problem do you need to generate to have at least a 95% chance of finding one correct solution?

ExerciseAdvanced

Problem

You are designing a self-play loop for math reasoning. The model currently solves 30% of problems at difficulty level dd and 2% at level d+1d+1. Should you primarily train on level dd problems (high pass rate) or level d+1d+1 problems (low pass rate)? Justify quantitatively by considering the cost per verified correct solution.

References

Canonical:

  • DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2025)
  • Lightman et al., "Let's Verify Step by Step" (2023). process reward models

Current:

  • Singh et al., "Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models" (2024)
  • Zelikman et al., "STaR: Bootstrapping Reasoning With Reasoning" (2022)
  • Polu and Sutskever, "Generative Language Modeling for Automated Theorem Proving" (2020)

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics