Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Comparison

Kaplan vs. Chinchilla Scaling

Kaplan (2020) said scale up parameters faster than data. Chinchilla (2022) showed the opposite: many large models were undertrained. The disagreement came from a methodological flaw in how Kaplan fitted the scaling exponents.

What Each Claims

Kaplan et al. (2020) fitted power-law exponents for language model loss as a function of parameters NN, data DD, and compute CC. Their fitted exponents suggested that loss is more sensitive to model size than to data: αN0.076\alpha_N \approx 0.076 vs αD0.095\alpha_D \approx 0.095. This led to the recommendation: scale parameters faster than data. Train large models on "enough" data.

Hoffmann et al. (Chinchilla, 2022) re-fitted the scaling laws with a key methodological change: they trained smaller models to near-convergence, not just for a fixed number of steps. Their conclusion was the opposite: for a fixed compute budget CC, parameters and training tokens should be scaled roughly equally. The compute-optimal ratio is approximately 20 tokens per parameter.

Side-by-Side Statement

AspectKaplan (2020)Chinchilla (2022)
RecommendationScale NN faster than DDScale NN and DD roughly equally
Optimal ratioNC0.73N \propto C^{0.73}, DC0.27D \propto C^{0.27}NC0.50N \propto C^{0.50}, DC0.50D \propto C^{0.50}
Tokens per parameterNot explicit, but >> 20~20 tokens per parameter
Key assumptionFixed training budget per model sizeTrain each model to near-convergence
ImplicationGPT-3: 175B params, 300B tokensChinchilla: 70B params, 1.4T tokens
Which performed better?GPT-3 (175B, 300B tokens)Chinchilla (70B, 1.4T tokens): matched GPT-3

Why They Disagreed

The disagreement came from a specific methodological issue in the Kaplan analysis. Kaplan et al. trained all model sizes for approximately the same number of tokens (or the same compute-per-parameter ratio). This meant smaller models were trained closer to convergence, while larger models were still far from convergence.

When you fit a power law L(N)=(Nc/N)αNL(N) = (N_c/N)^{\alpha_N} using large models that are undertrained, the fitted exponent αN\alpha_N is biased upward: it looks like parameters matter more than they do, because the large models have not yet realized their full potential.

Chinchilla corrected this by training each model size with its own optimal token budget. When all models are properly trained, the data scaling exponent increases and the parameter scaling exponent decreases, converging toward equal allocation.

When Each Is Right

Kaplan's analysis is correct as a description of the loss you get from a fixed compute budget if you do not retrain at each scale. If you have one shot at training and want the best model for a fixed wall-clock budget, Kaplan's curves describe what you will observe.

Chinchilla is correct as a prescription for compute-optimal training. If you can choose how to allocate your compute budget between model size and data, the Chinchilla allocation produces lower loss per FLOP.

What Has Changed Since 2022

Post-Chinchilla practice has shifted further toward data scaling:

The trend is toward "over-training" smaller models on much more data than Chinchilla recommends. This makes sense for inference: a smaller model that is slightly undertrained at the frontier but much cheaper to serve can be more practical than a compute-optimal larger model.

Common Confusions

Watch Out

Chinchilla did not prove Kaplan wrong

Both analyses are fitting empirical scaling laws. Neither has a theoretical proof that their exponents are correct. Chinchilla showed that a different experimental methodology produces different exponents. The Chinchilla exponents are more useful for planning training runs because they account for convergence behavior that Kaplan's fixed-budget approach missed.

Watch Out

The 20 tokens per parameter rule is not universal

The Chinchilla ratio of ~20 tokens per parameter is specific to the loss function, tokenizer, data distribution, and model family used in the Hoffmann et al. experiments. Different architectures, data mixes, or training objectives may have different optimal ratios. The ratio is a useful rule of thumb, not a physical constant.

References