Kaplan vs. Chinchilla Scaling Laws: Who Was Right?

What Each Claims

Kaplan et al. (2020) fitted power-law exponents for language model loss as a function of parameters $N$ , data $D$ , and compute $C$ . Their fitted exponents suggested that loss is more sensitive to model size than to data: $\alpha_N \approx 0.076$ vs $\alpha_D \approx 0.095$ . This led to the recommendation: scale parameters faster than data. Train large models on "enough" data.

Hoffmann et al. (Chinchilla, 2022) re-fitted the scaling laws with a key methodological change: they trained smaller models to near-convergence, not just for a fixed number of steps. Their conclusion was the opposite: for a fixed compute budget $C$ , parameters and training tokens should be scaled roughly equally. The compute-optimal ratio is approximately 20 tokens per parameter.

Side-by-Side Statement

Aspect	Kaplan (2020)	Chinchilla (2022)
Recommendation	Scale $N$ faster than $D$	Scale $N$ and $D$ roughly equally
Optimal ratio	$N \propto C^{0.73}$ , $D \propto C^{0.27}$	$N \propto C^{0.50}$ , $D \propto C^{0.50}$
Tokens per parameter	Not explicit, but >> 20	~20 tokens per parameter
Key assumption	Fixed training budget per model size	Train each model to near-convergence
Implication	GPT-3: 175B params, 300B tokens	Chinchilla: 70B params, 1.4T tokens
Which performed better?	GPT-3 (175B, 300B tokens)	Chinchilla (70B, 1.4T tokens): matched GPT-3

Why They Disagreed

The disagreement came from a specific methodological issue in the Kaplan analysis. Kaplan et al. trained all model sizes for approximately the same number of tokens (or the same compute-per-parameter ratio). This meant smaller models were trained closer to convergence, while larger models were still far from convergence.

When you fit a power law $L(N) = (N_c/N)^{\alpha_N}$ using large models that are undertrained, the fitted exponent $\alpha_N$ is biased upward: it looks like parameters matter more than they do, because the large models have not yet realized their full potential.

Chinchilla corrected this by training each model size with its own optimal token budget. When all models are properly trained, the data scaling exponent increases and the parameter scaling exponent decreases, converging toward equal allocation.

When Each Is Right

Kaplan's analysis is correct as a description of the loss you get from a fixed compute budget if you do not retrain at each scale. If you have one shot at training and want the best model for a fixed wall-clock budget, Kaplan's curves describe what you will observe.

Chinchilla is correct as a prescription for compute-optimal training. If you can choose how to allocate your compute budget between model size and data, the Chinchilla allocation produces lower loss per FLOP.

What Has Changed Since 2022

Post-Chinchilla practice has shifted further toward data scaling:

Llama (2023): 7B-65B models trained on 1-1.4T tokens (close to Chinchilla-optimal)
Llama 2 (2023): 70B model trained on 2T tokens (over-trained relative to Chinchilla)
Llama 3 (2024): 8B model trained on 15T tokens (massively over-trained by Chinchilla standards)

The trend is toward "over-training" smaller models on much more data than Chinchilla recommends. This makes sense for inference: a smaller model that is slightly undertrained at the frontier but much cheaper to serve can be more practical than a compute-optimal larger model.

Common Confusions

Watch Out

Chinchilla did not prove Kaplan wrong

Both analyses are fitting empirical scaling laws. Neither has a theoretical proof that their exponents are correct. Chinchilla showed that a different experimental methodology produces different exponents. The Chinchilla exponents are more useful for planning training runs because they account for convergence behavior that Kaplan's fixed-budget approach missed.

Watch Out

The 20 tokens per parameter rule is not universal

The Chinchilla ratio of ~20 tokens per parameter is specific to the loss function, tokenizer, data distribution, and model family used in the Hoffmann et al. experiments. Different architectures, data mixes, or training objectives may have different optimal ratios. The ratio is a useful rule of thumb, not a physical constant.

References

Kaplan et al., "Scaling Laws for Neural Language Models" (2020)
Hoffmann et al., "Training Compute-Optimal Large Language Models" (2022). The Chinchilla paper.
Touvron et al., "Llama 2: Open Foundation and Fine-Tuned Chat Models" (2023)