What Each Claims
Kaplan et al. (2020) fitted power-law exponents for language model loss as a function of parameters , data , and compute . Their fitted exponents suggested that loss is more sensitive to model size than to data: vs . This led to the recommendation: scale parameters faster than data. Train large models on "enough" data.
Hoffmann et al. (Chinchilla, 2022) re-fitted the scaling laws with a key methodological change: they trained smaller models to near-convergence, not just for a fixed number of steps. Their conclusion was the opposite: for a fixed compute budget , parameters and training tokens should be scaled roughly equally. The compute-optimal ratio is approximately 20 tokens per parameter.
Side-by-Side Statement
| Aspect | Kaplan (2020) | Chinchilla (2022) |
|---|---|---|
| Recommendation | Scale faster than | Scale and roughly equally |
| Optimal ratio | , | , |
| Tokens per parameter | Not explicit, but >> 20 | ~20 tokens per parameter |
| Key assumption | Fixed training budget per model size | Train each model to near-convergence |
| Implication | GPT-3: 175B params, 300B tokens | Chinchilla: 70B params, 1.4T tokens |
| Which performed better? | GPT-3 (175B, 300B tokens) | Chinchilla (70B, 1.4T tokens): matched GPT-3 |
Why They Disagreed
The disagreement came from a specific methodological issue in the Kaplan analysis. Kaplan et al. trained all model sizes for approximately the same number of tokens (or the same compute-per-parameter ratio). This meant smaller models were trained closer to convergence, while larger models were still far from convergence.
When you fit a power law using large models that are undertrained, the fitted exponent is biased upward: it looks like parameters matter more than they do, because the large models have not yet realized their full potential.
Chinchilla corrected this by training each model size with its own optimal token budget. When all models are properly trained, the data scaling exponent increases and the parameter scaling exponent decreases, converging toward equal allocation.
When Each Is Right
Kaplan's analysis is correct as a description of the loss you get from a fixed compute budget if you do not retrain at each scale. If you have one shot at training and want the best model for a fixed wall-clock budget, Kaplan's curves describe what you will observe.
Chinchilla is correct as a prescription for compute-optimal training. If you can choose how to allocate your compute budget between model size and data, the Chinchilla allocation produces lower loss per FLOP.
What Has Changed Since 2022
Post-Chinchilla practice has shifted further toward data scaling:
- Llama (2023): 7B-65B models trained on 1-1.4T tokens (close to Chinchilla-optimal)
- Llama 2 (2023): 70B model trained on 2T tokens (over-trained relative to Chinchilla)
- Llama 3 (2024): 8B model trained on 15T tokens (massively over-trained by Chinchilla standards)
The trend is toward "over-training" smaller models on much more data than Chinchilla recommends. This makes sense for inference: a smaller model that is slightly undertrained at the frontier but much cheaper to serve can be more practical than a compute-optimal larger model.
Common Confusions
Chinchilla did not prove Kaplan wrong
Both analyses are fitting empirical scaling laws. Neither has a theoretical proof that their exponents are correct. Chinchilla showed that a different experimental methodology produces different exponents. The Chinchilla exponents are more useful for planning training runs because they account for convergence behavior that Kaplan's fixed-budget approach missed.
The 20 tokens per parameter rule is not universal
The Chinchilla ratio of ~20 tokens per parameter is specific to the loss function, tokenizer, data distribution, and model family used in the Hoffmann et al. experiments. Different architectures, data mixes, or training objectives may have different optimal ratios. The ratio is a useful rule of thumb, not a physical constant.
References
- Kaplan et al., "Scaling Laws for Neural Language Models" (2020)
- Hoffmann et al., "Training Compute-Optimal Large Language Models" (2022). The Chinchilla paper.
- Touvron et al., "Llama 2: Open Foundation and Fine-Tuned Chat Models" (2023)