LLM Construction
Reasoning Data Curation
How to build training data for reasoning models: math with verified solutions, code with test cases, rejection sampling, external verification, self-play for problem generation, and the connection to RLVR.
Prerequisites
Why This Matters
The models that reason well (o1, DeepSeek-R1, Claude thinking models) were not trained on generic internet text. They were trained on carefully curated data where each solution has been verified to be correct. The quality of reasoning training data determines the ceiling for reasoning capability. Scaling incorrect or unverified reasoning data makes models more fluent at being wrong, not better at being right.
Mental Model
The core principle: for reasoning tasks, you can verify correctness even when you cannot generate correct solutions. A math proof checker can verify a proof without being able to produce one. A test suite can verify code without writing it. This asymmetry between generation and verification is what makes reasoning data curation possible at scale.
The pipeline: (1) source or generate problems, (2) generate many candidate solutions, (3) verify solutions with external tools, (4) keep only verified correct solutions, (5) train on the verified data.
Types of Reasoning Data
Verifiable Reasoning Data
Verifiable reasoning data consists of (problem, solution) pairs where the solution's correctness can be checked by an external system:
- Math with ground truth: competition problems with known numerical answers, theorem proving with formal proof assistants (Lean, Coq, Isabelle)
- Code with test cases: programming problems where solutions are checked by running test suites, including hidden test cases that the model never sees during training
- Science with known answers: physics, chemistry, and biology problems where the answer can be derived from first principles or looked up
- Logic puzzles: constraint satisfaction problems where a solution can be verified by checking all constraints
The key property: verification is cheap and reliable, even when generation is hard.
Rejection Sampling
Rejection Sampling Improves Solution Quality
Statement
If a model generates independent solutions to a problem and each solution is correct with probability , then the probability that at least one correct solution exists among the samples is:
The expected number of correct solutions is . If we select uniformly at random from the correct solutions, the resulting training example is guaranteed correct (given a perfect verifier). To achieve :
Intuition
Generate many attempts, keep the ones that work. If the model solves a problem 10% of the time, generating 50 attempts gives a 99.5% chance of at least one correct solution. The verified correct solutions become high-quality training data. The model is effectively distilling its own best-case behavior into reliable behavior.
Proof Sketch
Each sample is correct independently with probability . The probability all fail is . So . Setting this and solving: , so . Using for small gives the approximation.
Why It Matters
Rejection sampling is the primary method for generating reasoning training data at scale. DeepSeek-R1, OpenAI's o1, and other reasoning models use it extensively. The method works even when the base model's pass rate is low (say 1-5%), as long as you can afford to generate enough samples and verify them.
Failure Mode
If the verifier has false positives (accepts incorrect solutions), the training data is contaminated. If the base model's pass rate is extremely low (below 0.1%), the cost of generating enough samples becomes prohibitive. Also, rejection sampling only selects for final-answer correctness; it does not guarantee that the reasoning chain itself is valid.
Verification Methods
External Verification
External verification uses tools outside the model to check solution correctness:
- Code execution: run the generated code against test cases (both visible and hidden). Check for correctness, not just compilation
- Formal proof assistants: Lean 4, Coq, Isabelle can type-check proofs for mathematical theorems. If the proof compiles, it is correct by construction
- Symbolic math checkers: computer algebra systems (SymPy, Mathematica) can verify numerical answers and simplify expressions
- Unit tests for reasoning: for word problems, verify the final answer against known ground truth
External verifiers are the gold standard because they do not rely on the model's own judgment, which is unreliable for hard problems.
Process Reward Models (PRMs) vs Outcome Reward Models (ORMs)
When external verification is unavailable, learned verifiers substitute:
- ORMs score the final answer only. Binary signal: correct or incorrect
- PRMs score each step in the reasoning chain. Richer signal but requires step-level correctness labels, which are expensive to collect
PRMs produce better training signal because they can identify where reasoning went wrong, enabling the model to learn which steps to avoid. The cost is that PRM training data requires human annotation of individual reasoning steps.
Self-Play for Problem Generation
Self-Play Problem Generation
Self-play generates new training problems using the model itself:
- The model generates a problem (e.g., a math question)
- The model generates a solution to the problem
- An external verifier checks the solution
- If verified, the (problem, solution) pair becomes training data
- If the model cannot solve its own problem (low pass rate), the problem is "hard" and particularly valuable for training
This creates a curriculum: the model generates problems at the frontier of its own capability. Problems it solves easily are not useful for training. Problems it fails on entirely cannot produce verified solutions. The sweet spot is problems with pass rate between 1% and 50%.
The self-play loop is inspired by AlphaZero, where the system improves by playing against itself. For reasoning, "playing against yourself" means generating problems you can barely solve, then training on the solutions you do find.
Connection to RLVR
Verification as RL Reward Signal
Statement
RL with verifiable rewards (RLVR) uses the verifier output as a sparse reward signal:
The RL objective is:
where is a reference policy (typically the SFT model) and controls the KL penalty. This is equivalent to rejection sampling followed by a KL-regularized policy update.
Intuition
Instead of training on pre-collected verified solutions (rejection sampling then SFT), RLVR trains the model online: generate, verify, update policy. The RL framework handles credit assignment automatically through the reward signal. The KL penalty prevents the model from collapsing to a narrow set of solution strategies.
Why It Matters
RLVR is how DeepSeek-R1 was trained. The advantage over rejection sampling + SFT is that the model can explore and discover new solution strategies during training, rather than being limited to strategies it already knew. The verifier provides a reliable training signal without requiring human preference labels.
Failure Mode
Sparse binary rewards make RL optimization difficult. If the model's pass rate is very low, the reward signal is almost always zero and learning stalls. Reward hacking is possible if the verifier has exploitable weaknesses (e.g., a test suite with insufficient coverage). The KL penalty is crucial: without it, the model collapses to generating only the simplest correct solutions.
Data Quality vs Data Quantity
For reasoning, quality dominates quantity. Key empirical findings:
- Correct > plentiful: training on 10,000 verified-correct math solutions outperforms training on 100,000 unverified solutions that include errors
- Hard > easy: training on problems the model finds difficult (pass rate 1-20%) produces more improvement than training on easy problems (pass rate > 80%)
- Diverse > repetitive: solutions that use different reasoning strategies for the same problem type produce more robust reasoning than many solutions using the same approach
- Process > outcome: when available, step-by-step verified chains produce better reasoners than outcome-only verification
Common Confusions
Rejection sampling is not cherry-picking results
Rejection sampling generates training data, not evaluation results. You generate many solutions, keep the correct ones, and train on them. This is the training pipeline, not the evaluation protocol. At evaluation time, the model generates a single solution (or uses majority voting), and that solution is either correct or not.
A perfect verifier does not guarantee perfect reasoning
Even with a zero-false-positive verifier, the training data only contains solutions that happen to arrive at the correct final answer. The reasoning chain may contain errors that cancel out (e.g., two sign errors that compensate). This is why process reward models, which verify each step, produce stronger reasoners than outcome-only verification.
More compute for rejection sampling has diminishing returns
Generating samples when gives probability of finding a correct solution. But generating 10,000 samples barely helps further. The marginal value of additional samples decreases rapidly once is large. The bottleneck shifts to problem diversity, not sample count.
Summary
- Reasoning data requires verified correctness, not just fluency
- Rejection sampling: generate many solutions, keep verified correct ones
- External verifiers (code execution, proof assistants, symbolic checkers) are the gold standard
- Self-play generates problems at the model's capability frontier
- RLVR uses verifier output as RL reward, enabling online exploration
- Data quality matters more than quantity: correct, hard, diverse solutions
- Process verification (step-by-step) is stronger than outcome verification (final answer only)
Exercises
Problem
A model has a 5% pass rate on competition math problems. How many samples per problem do you need to generate to have at least a 95% chance of finding one correct solution?
Problem
You are designing a self-play loop for math reasoning. The model currently solves 30% of problems at difficulty level and 2% at level . Should you primarily train on level problems (high pass rate) or level problems (low pass rate)? Justify quantitatively by considering the cost per verified correct solution.
References
Canonical:
- DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2025)
- Lightman et al., "Let's Verify Step by Step" (2023). process reward models
Current:
- Singh et al., "Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models" (2024)
- Zelikman et al., "STaR: Bootstrapping Reasoning With Reasoning" (2022)
- Polu and Sutskever, "Generative Language Modeling for Automated Theorem Proving" (2020)
Next Topics
- DPO vs GRPO vs RL reasoning: how preference optimization methods use reasoning data
- Reward models and verifiers: the systems that score and verify model outputs
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Post-Training OverviewLayer 5
- RLHF and AlignmentLayer 4
- Policy Gradient TheoremLayer 3
- Markov Decision ProcessesLayer 2
- Convex Optimization BasicsLayer 1
- Differentiation in RnLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Matrix Operations and PropertiesLayer 0A
- Concentration InequalitiesLayer 1
- Common Probability DistributionsLayer 0A
- Expectation, Variance, Covariance, and MomentsLayer 0A
- Transformer ArchitectureLayer 4
- Attention Mechanism TheoryLayer 4
- Softmax and Numerical StabilityLayer 1
- Feedforward Networks and BackpropagationLayer 2
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1