Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Scaling Compute-Optimal Training

Chinchilla scaling: how to optimally allocate a fixed compute budget between model size and training data, why many models were undertrained, and the post-Chinchilla reality of data quality and inference cost.

AdvancedTier 2Frontier~55 min

Prerequisites

Why This Matters

Training a large language model costs millions of dollars. The question of how to spend that compute budget is not academic. Chinchilla (Hoffmann et al., 2022) showed that many existing models, including Gopher (280B parameters), were significantly undertrained: they used too many parameters relative to the amount of training data. The practical consequence was that a smaller, properly trained model (Chinchilla, 70B) outperformed a larger undertrained one (Gopher, 280B) while using the same total compute.

This reshaped how labs allocate training budgets and shifted attention toward data quality, data efficiency, and the previously underappreciated cost of inference.

Background: Kaplan Scaling

Kaplan et al. (2020) established power-law relationships between loss LL and three quantities: parameters NN, data DD, and compute CC.

Definition

Kaplan Scaling Laws

For a fixed architecture (decoder-only transformer), cross-entropy loss on held-out data follows:

L(N)αNN0.076L(N) \approx \alpha_N \cdot N^{-0.076} L(D)αDD0.095L(D) \approx \alpha_D \cdot D^{-0.095} L(C)αCC0.050L(C) \approx \alpha_C \cdot C^{-0.050}

where the exponents are fit empirically. Kaplan concluded that loss is more sensitive to NN than to DD: you should scale parameters faster than data.

The Kaplan recommendation: for a 10x increase in compute, increase NN by 5.5x and DD by 1.8x. This led to the trend of building very large models trained on relatively little data (e.g., GPT-3 at 175B parameters trained on 300B tokens).

Chinchilla Optimal Allocation

Proposition

Chinchilla Compute-Optimal Allocation

Statement

For a fixed compute budget CC, the loss-minimizing allocation scales parameters and data roughly equally:

NoptC0.50,DoptC0.50N_{\text{opt}} \propto C^{0.50}, \quad D_{\text{opt}} \propto C^{0.50}

Equivalently, the optimal token-to-parameter ratio is approximately D/N20D/N \approx 20. A model with NN parameters should be trained on roughly 20N20N tokens.

Intuition

If you have NN parameters and very little data, you overfit. If you have vast data but a tiny model, you underfit. The optimal point balances these two sources of error. Chinchilla found this balance is roughly equal scaling, not the parameter-heavy allocation Kaplan suggested.

Proof Sketch

Model the loss as L(N,D)=E+A/Nα+B/DβL(N, D) = E + A/N^\alpha + B/D^\beta where EE is irreducible loss. Using C6NDC \approx 6ND, substitute D=C/(6N)D = C/(6N) and minimize over NN. Setting dL/dN=0dL/dN = 0 gives NoptCβ/(α+β)N_{\text{opt}} \propto C^{\beta/(\alpha + \beta)}. Chinchilla found αβ0.34\alpha \approx \beta \approx 0.34, giving exponent 0.50\approx 0.50 for both NN and DD.

Why It Matters

This directly changed how labs train models. GPT-3 (175B, 300B tokens) had a tokens-per-parameter ratio of about 1.7, far below the Chinchilla-optimal 20. Llama 1 (65B, 1.4T tokens) and Llama 2 (70B, 2T tokens) adopted Chinchilla-like ratios. The result: smaller models that perform as well as or better than larger undertrained models, at lower inference cost.

Failure Mode

The Chinchilla result assumes that all tokens are equally valuable and that data is abundant. In practice, high-quality text data is finite. When you run out of quality data, the model cannot absorb more tokens effectively. This shifts the problem from "how many tokens" to "which tokens."

Kaplan vs Chinchilla

Proposition

Kaplan vs Chinchilla Exponent Discrepancy

Statement

Kaplan found optimal allocation exponents of NC0.73N \propto C^{0.73} and DC0.27D \propto C^{0.27}, favoring parameters over data. Chinchilla found NC0.50N \propto C^{0.50} and DC0.50D \propto C^{0.50}, favoring equal allocation. The discrepancy comes from differences in experimental methodology.

Intuition

Kaplan's experiments did not train each model to convergence. Models with more parameters appeared to improve loss more per FLOP because they had not yet reached their optimal loss for that amount of data. Chinchilla trained each model fully, revealing that data had been undervalued.

Proof Sketch

Not a formal proof. The key methodological differences: (1) Kaplan used a fixed number of training steps for models of different sizes; Chinchilla varied both model size and training duration. (2) Kaplan used a fixed learning rate schedule; Chinchilla tuned the schedule per run. (3) Chinchilla used three independent estimation approaches (fixed NN varying DD, IsoFLOP profiles, and parametric loss fitting) that agreed on the equal-scaling result.

Why It Matters

This is a cautionary tale about empirical scaling research. Both groups fit power laws to real data and reached different conclusions because of experimental design choices. The lesson: scaling law exponents are not universal constants. They depend on the training protocol, architecture, and data distribution.

Failure Mode

Neither Kaplan nor Chinchilla accounts for data quality. A model trained on 20N tokens of noisy web scrapes may perform worse than one trained on 5N tokens of curated text. The Chinchilla ratio of 20 tokens per parameter is a rough guideline, not a physical law.

Post-Chinchilla Reality

Data quality dominates data quantity. Llama 3 (2024) trained on 15T tokens for a 70B model (ratio: ~214 tokens per parameter), far exceeding the Chinchilla-optimal 20. This works because the data was heavily filtered and deduplicated. The Chinchilla analysis assumed constant data quality; real training benefits from spending more compute on better data even past the "optimal" ratio.

Inference cost matters. Chinchilla minimizes training loss per training FLOP. But a model is trained once and served many times. A 70B model costs roughly 4x more per token to serve than a 20B model. If your total lifetime cost is dominated by inference (which it usually is at scale), you may prefer a smaller model trained longer, even if it uses more training FLOPs than Chinchilla-optimal. This is sometimes called "inference-aware scaling."

Overtraining is common and deliberate. Llama models are deliberately "overtrained" relative to Chinchilla: more tokens per parameter than the compute-optimal ratio. This increases training cost but yields a smaller model that is cheaper to serve. The trade-off is rational when inference volume is high.

Repeating data degrades performance. Muennighoff et al. (2023) showed that repeating training data (training for more epochs) gives diminishing returns after about 4 epochs, and performance can degrade. This creates a hard constraint: if you run out of unique high-quality data, more compute does not help much.

Common Confusions

Watch Out

Chinchilla optimal does not mean actually optimal

Chinchilla-optimal minimizes loss for a given training compute budget. It does not account for inference cost, data quality variation, or the finite supply of training data. A model that is "Chinchilla-optimal" may be suboptimal for actual deployment.

Watch Out

Scaling laws predict loss, not capability

A 0.01 nats improvement in cross-entropy loss can flip a model across a qualitative capability (like arithmetic) or might have no noticeable effect on downstream tasks. Scaling laws describe smooth loss curves, but capabilities can emerge discontinuously (though the sharpness of emergence is debated).

Exercises

ExerciseCore

Problem

You have a compute budget of C=6×1021C = 6 \times 10^{21} FLOPs and use the approximation C6NDC \approx 6ND. Using the Chinchilla-optimal ratio D/N20D/N \approx 20, compute the optimal model size NN and number of training tokens DD.

ExerciseAdvanced

Problem

Suppose inference costs cIc_I per token and you will serve TinfT_{\text{inf}} total inference tokens over the model's lifetime. Training costs cTC=cT6NDc_T \cdot C = c_T \cdot 6ND total. The total cost is cT6ND+cINTinfc_T \cdot 6ND + c_I \cdot N \cdot T_{\text{inf}} (where inference cost scales linearly with NN). For fixed target loss LL^*, how does the optimal NN change compared to pure Chinchilla-optimal?

Related Comparisons

References

Canonical:

  • Hoffmann et al., "Training Compute-Optimal Large Language Models" (2022)
  • Kaplan et al., "Scaling Laws for Neural Language Models" (2020)

Current:

  • Touvron et al., "Llama 2: Open Foundation and Fine-Tuned Chat Models" (2023)
  • Muennighoff et al., "Scaling Data-Constrained Language Models" (NeurIPS 2023)

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.