Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Scaling Laws

Power-law relationships between loss and compute, parameters, and data: Kaplan scaling, Chinchilla-optimal training, emergent abilities, and whether scaling laws are fundamental or empirical.

AdvancedTier 2Current~65 min

Why This Matters

L(C) ~ C^(-0.076)undertrained2.02.53.03.54.0Loss (cross-entropy)10^1810^2010^2210^24Training compute (FLOPs, log scale)Empirical runsPower-law fit

Scaling laws are empirical power-law fits that relate a model's loss to the number of parameters NN, the training tokens DD, and the total compute CC. They are regressions, not physical laws: they hold within the compute regime where they were fit, and they can break under architectural changes, data repetition, or distribution shift.

Within their regime they are useful. They have guided training decisions worth billions of dollars: how large to make a model, how much data to collect, and how to allocate a fixed compute budget between model size and training duration. They connect directly to compute-optimal training and to the transformer architecture that all modern LLMs share.

Mental Model

Imagine you have a fixed compute budget (say, 102410^{24} FLOPs). You must decide: train a large model for fewer steps, or a smaller model for more steps? Scaling laws answer this question precisely. They tell you that loss decreases as a power law in each resource, and that there is an optimal way to split your budget between model size and data.

The key surprise: loss follows smooth, predictable power laws over many orders of magnitude. A model with 10x more parameters trained on the same data will have predictably lower loss. This predictability is what makes scaling laws practically useful: one can extrapolate from small experiments to forecast the performance of much larger runs.

The Kaplan Scaling Laws (2020)

Definition

Kaplan Power-Law Scaling

Kaplan et al. (2020) empirically observed that cross-entropy loss LL on language modeling scales as power laws in parameters NN, data DD, and compute CC, when each is varied independently with the others held sufficient:

L(N)=(NcN)αN,L(D)=(DcD)αD,L(C)=(CcC)αCL(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}, \qquad L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}, \qquad L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}

where Nc,Dc,CcN_c, D_c, C_c are scaling constants and the exponents are approximately:

  • αN0.076\alpha_N \approx 0.076 (loss vs parameters)
  • αD0.095\alpha_D \approx 0.095 (loss vs data)
  • αC0.050\alpha_C \approx 0.050 (loss vs compute)

These power laws hold over at least 7 orders of magnitude in compute.

Proposition

Power-Law Scaling of Language Model Loss

Statement

When parameters NN and data DD are both potentially limiting, the loss follows an approximate decomposition:

L(N,D)[(NcN)αN/β+(DcD)αD/β]βL(N, D) \approx \left[\left(\frac{N_c}{N}\right)^{\alpha_N / \beta} + \left(\frac{D_c}{D}\right)^{\alpha_D / \beta}\right]^{\beta}

for fitted constants. In the regime where one factor dominates, this reduces to the individual power laws above.

The key implication of Kaplan's fitted exponents: loss is more sensitive to model size than to data. This led to the recommendation to scale NN faster than DD. train large models on relatively less data.

Intuition

A power law LNαL \propto N^{-\alpha} means that each 10x increase in NN gives a fixed percentage reduction in loss. The exponent α\alpha determines how fast loss improves. The fact that αN<αD\alpha_N < \alpha_D means it takes more data than parameters to achieve the same loss improvement. But Kaplan interpreted this as meaning you should prioritize model size because the compute scaling was more favorable.

Why It Matters

The Kaplan scaling laws directly influenced the training of GPT-3 (175B parameters trained on 300B tokens). The recommendation to favor large models with moderate data was the dominant paradigm from 2020 to 2022. It was overturned by the Chinchilla analysis.

Failure Mode

The Kaplan analysis trained all models for a relatively small number of tokens and extrapolated. Crucially, it did not train smaller models to full convergence, biasing the analysis toward large models. The Chinchilla paper corrected this methodological issue and reached the opposite conclusion about optimal allocation.

Chinchilla Scaling (2022)

Theorem

Chinchilla-Optimal Compute Allocation

Statement

Hoffmann et al. (2022) showed that for a fixed compute budget CC, the optimal number of parameters NN^* and training tokens DD^* both scale proportionally to the square root of compute:

NCa,DCbN^* \propto C^a, \qquad D^* \propto C^b

where ab0.5a \approx b \approx 0.5. Since C6NDC \approx 6ND (the compute for one training pass through DD tokens with an NN-parameter model), this implies:

NDN^* \propto D^*

The optimal number of parameters and training tokens should scale equally with compute. For every doubling of parameters, you should also double the training data.

For Hoffmann et al.'s fitted exponents (α0.34\alpha \approx 0.34, β0.28\beta \approx 0.28), the Chinchilla-optimal ratio is approximately D/N20D^* / N^* \approx 20 tokens per parameter. The "20" is an empirical artifact of those specific exponents on that data, not a universal constant. Refits on different model families and corpora yield different ratios, and most modern open models intentionally over-train relative to this ratio because inference cost is dominated by NN alone.

Intuition

Kaplan said: make the model as large as possible. Chinchilla said: balance model size and data. The difference comes from how you define "optimal." If you fix compute and ask "what achieves the lowest loss?", Chinchilla shows that an overparameterized, undertrained model wastes compute. Training a smaller model on more data reaches the same loss with the same compute budget.

Proof Sketch

The analysis fits a parametric loss function:

L(N,D)=ANα+BDβ+EL(N, D) = \frac{A}{N^\alpha} + \frac{B}{D^\beta} + E

where EE is the irreducible entropy of natural language. Subject to the constraint C=6NDC = 6ND (total compute), minimize LL over (N,D)(N, D).

Using Lagrange multipliers: at the optimum, LN/LD=D/N\frac{\partial L}{\partial N} / \frac{\partial L}{\partial D} = D/N, which gives αA/Nα+1N=βB/Dβ+1D\alpha A / N^{\alpha+1} \cdot N = \beta B / D^{\beta+1} \cdot D. Hoffmann et al. fit α0.34\alpha \approx 0.34 and β0.28\beta \approx 0.28 (close, but not equal). Because the two exponents are nearly equal, the optimum scales as NCaN^* \propto C^{a} and DCbD^* \propto C^{b} with a,ba, b both close to 0.50.5. The exact exponents depend on (α,β)(\alpha, \beta); the widely-quoted NC1/2N^* \propto C^{1/2} result is what you get in the α=β\alpha = \beta limit.

Why It Matters

Chinchilla (70B parameters, 1.4T tokens) matched the performance of Gopher (280B parameters, 300B tokens) with 4x fewer parameters and the same compute budget. This result reshaped the industry: Llama, DeepSeek, Mistral, and most subsequent models are "Chinchilla-optimal" or even "over-trained" (using more data than Chinchilla-optimal for a given size, because inference cost depends on NN while training cost depends on CC).

Failure Mode

Chinchilla-optimal minimizes loss per FLOP of training. But inference cost depends on NN, not DD. If you plan to serve a model to millions of users, it is cheaper to over-train a small model (use more data than Chinchilla recommends) than to serve a Chinchilla-optimal larger model. Llama 3 8B was trained on 15T tokens. roughly 1875 tokens per parameter, nearly 100x the Chinchilla ratio. because the inference savings from a smaller model outweigh the training compute cost.

The Compute Constraint: C6NDC \approx 6ND

Definition

Training Compute Estimate

For a decoder-only transformer with NN parameters, one forward pass on one token requires approximately 2N2N FLOPs (one multiply-add per parameter). The backward pass requires approximately 4N4N FLOPs (roughly 2x the forward pass). Training on DD tokens requires:

C6ND FLOPsC \approx 6ND \text{ FLOPs}

This is the "6ND rule." It ignores attention cost (O(n2d)O(n^2 d) per layer) but is accurate within a factor of 2 for typical model sizes and context lengths.

This formula is the bridge between the abstract scaling laws and concrete training decisions. Given a GPU cluster with a known FLOP budget, you can directly compute the Chinchilla-optimal NN and DD.

Emergent Abilities at Scale

Definition

Emergent Abilities

An ability is called emergent if it is absent in smaller models but appears in larger models. Wei et al. (2022) documented several tasks where model performance was near random below a critical scale and then sharply improved:

  • Few-shot arithmetic: near zero below ~10B parameters, then rapid improvement
  • Multi-step reasoning: absent in small models, present in large ones
  • Code generation: qualitative jump at sufficient scale

The claim: some capabilities emerge discontinuously as a function of scale, rather than improving smoothly.

The debate. Schaeffer et al. (2023) argued that emergent abilities are a mirage caused by the choice of evaluation metric. When you use a discontinuous metric (exact match accuracy), a smooth underlying improvement in log- probability can appear as a sudden jump. When you use a smooth metric (like Brier score or log-likelihood), the improvement is gradual and predictable.

This debate is unresolved. Some tasks genuinely seem to require a threshold capability (e.g., executing a multi-step algorithm correctly), while the statistical argument about metrics is also valid. Whether emergence is "real" depends on what you mean by the word.

The Decomposed Scaling Law

Definition

Decomposed Scaling Law (functional form)

Assume a transformer language model trained near convergence, with compute allocated compute-optimally per Hoffmann et al. (2022). The cross-entropy loss is modeled as:

L(N,D)=E+ANα+BDβL(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}

where:

  • NN is the parameter count, DD is the training token count.
  • E0E \geq 0 is a loss floor for the given data distribution and tokenizer. the loss of an infinite model trained on infinite data.
  • A,B>0A, B > 0 and α,β>0\alpha, \beta > 0 are fitted constants.
  • A/NαA / N^\alpha captures limited model capacity.
  • B/DβB / D^\beta captures limited dataset information.

As NN \to \infty or DD \to \infty, the corresponding term vanishes, but the loss cannot go below EE.

Example

Empirical fits (Chinchilla, Kaplan)

Hoffmann et al. (2022) fit α0.34\alpha \approx 0.34, β0.28\beta \approx 0.28, E1.69E \approx 1.69 nats on their corpus. Kaplan et al. (2020) report a different irreducible floor and different exponents on their WebText2-style corpus. These values are observations, not universal constants: they vary with data distribution, language, tokenization, architecture, and fitting procedure. Treat them as calibrated priors for a given setup, not as fundamental parameters of nature.

Even so, the decomposition is practically predictive. Once AA, BB, α\alpha, β\beta, EE are fit on small-scale runs, the extrapolated loss of much larger runs (Chinchilla, Llama, GPT-4 class) has matched measured loss within a few percent. This is what justifies the capital expenditure on large training runs.

The predictive power is restricted to pretraining loss. Downstream task accuracy can be nonlinear in loss: small loss improvements may produce large capability jumps or none at all.

Proposition

Compute-Optimal Allocation under the Decomposed Form

Statement

Minimize L(N,D)=E+ANα+BDβL(N, D) = E + A N^{-\alpha} + B D^{-\beta} subject to the compute constraint C=6NDC = 6ND. The optimal (N,D)(N^*, D^*) satisfies:

NCβ/(α+β),DCα/(α+β).N^* \propto C^{\beta / (\alpha + \beta)}, \qquad D^* \propto C^{\alpha / (\alpha + \beta)}.

The two exponents sum to 1, consistent with C=6NDC = 6 N D.

Intuition

Each term ANαA N^{-\alpha} and BDβB D^{-\beta} is convex in its variable. The constraint ND=C/6ND = C/6 is a hyperbola in (N,D)(N, D). The minimum is where the marginal loss reduction per unit compute is equal across NN and DD. Allocating more compute to the resource with the faster-decaying term gives diminishing returns sooner, so the optimum balances the two exponents.

Proof Sketch

Form the Lagrangian L=ANα+BDβ+λ(NDC/6)\mathcal{L} = A N^{-\alpha} + B D^{-\beta} + \lambda(ND - C/6). First-order conditions:

αANα1+λD=0,βBDβ1+λN=0.-\alpha A N^{-\alpha - 1} + \lambda D = 0, \qquad -\beta B D^{-\beta - 1} + \lambda N = 0.

Eliminating λ\lambda gives αADβ=βBNα\alpha A D^{\beta} = \beta B N^{\alpha}, so DβNαD^\beta \propto N^\alpha. Combined with ND=C/6ND = C/6, solve for NN^* in CC:

Nα+βCβ    NCβ/(α+β),N^{\alpha + \beta} \propto C^\beta \implies N^* \propto C^{\beta / (\alpha + \beta)},

and symmetrically DCα/(α+β)D^* \propto C^{\alpha / (\alpha + \beta)}.

Why It Matters

This is the provable skeleton behind Chinchilla-style allocation rules. Whatever the fitted (α,β)(\alpha, \beta), the Lagrangian argument fixes the functional form of optimal allocation. Only the specific exponent ratio depends on the empirical fit.

Failure Mode

The result assumes the functional form L=E+ANα+BDβL = E + A N^{-\alpha} + B D^{-\beta} holds globally. If the true loss deviates (data repetition, data-constrained regimes, curriculum effects, architectural phase transitions), the allocation prescription breaks. The 6ND6ND FLOP estimate also ignores attention cost, which grows with context length and can dominate in long-context training.

Scaling for Downstream Tasks vs Pretraining Loss

Pretraining loss scales smoothly and predictably. Downstream task performance does not always follow the same smooth curve:

  • Smooth scaling: Tasks well-correlated with language modeling (text generation quality, perplexity) scale smoothly with loss.
  • Threshold scaling: Tasks requiring specific capabilities (multi-step reasoning, tool use) may show sharp transitions at particular loss levels.
  • Saturation: Some tasks saturate quickly (sentiment analysis) while others continue improving with scale (complex reasoning).

This means you cannot simply extrapolate a scaling curve for an individual task. You can predict the loss of a 10x larger model with high confidence, but predicting whether it will pass a specific evaluation benchmark requires understanding the relationship between loss and that specific capability.

Test-Time Compute Scaling

Kaplan and Chinchilla describe how loss scales with compute spent at training time. A separate axis has become central since 2024: how performance scales with compute spent at inference time on a fixed trained model. OpenAI o1 and DeepSeek R1 are the canonical examples. models trained with reinforcement learning to produce long chains of thought that consume far more tokens per query than standard decoding.

The headline empirical result: on reasoning-heavy tasks, adding inference compute to a smaller model can match or beat the accuracy of a much larger model decoded once. The training-compute-versus-inference-compute trade is not fixed by the Chinchilla analysis, which only accounts for pretraining loss.

Definition

Best-of-N Sampling

Given a prompt xx, draw NN independent completions y1,,yNy_1, \dots, y_N from the model at temperature T>0T > 0, then select one using a scoring rule s(x,y)s(x, y):

y^=argmaxi{1,,N}s(x,yi).\hat{y} = \arg\max_{i \in \{1, \dots, N\}} s(x, y_i).

The score ss can be a learned verifier (process reward model, outcome reward model), a ground-truth checker for tasks with verifiable answers (math, code unit tests), or majority vote over final answers (self-consistency). Inference FLOPs scale as roughly NN times the cost of a single completion.

Coverage scales predictably with samples. Brown et al. (2024), "Large Language Monkeys," measured pass@NN (the fraction of problems solved by at least one of NN samples) across math and code benchmarks. Coverage follows an approximate power law in NN over several orders of magnitude: log pass@NN is roughly linear in logN\log N. This is a different object from pass@1: it isolates what the model can produce from what a selector does select. A large gap between pass@NN and best-of-NN accuracy indicates a verifier bottleneck, not a generation bottleneck.

Test-time compute can substitute for parameters. Snell et al. (2024), "Scaling LLM Test-Time Compute Optimally," studied how a fixed inference budget should be spent: one large model decoded once, a small model sampled many times with a verifier, or tree search guided by a process reward model. On MATH-class problems, a smaller model equipped with an optimal test-time strategy matched a roughly 14×14\times larger model decoded greedily, at equal inference FLOPs. The optimal strategy depends on problem difficulty: easy problems favor more parallel samples, hard problems favor sequential revision and search.

Reasoning models as RL over chains of thought. OpenAI o1 and DeepSeek R1 are trained with RL against verifiable rewards to produce long internal reasoning traces before answering. The scaling axis here is not samples but tokens per answer. Reported curves show accuracy rising smoothly with thinking-token budget, analogous to a scaling law but with inference tokens rather than training tokens on the xx-axis. The mechanism is not fully characterized: candidates include genuine search inside the trace, error correction, and in-context amortization of otherwise parametric knowledge.

Example

Inference budget trade-off

A team has a fixed per-query inference budget of BB FLOPs and wants to maximize accuracy on a reasoning benchmark. Two options:

  1. Deploy a 70B70B-parameter model and decode once. Cost per query is roughly 270109L2 \cdot 70 \cdot 10^9 \cdot L FLOPs for an output of length LL.
  2. Deploy an 8B8B-parameter model and sample N=20N = 20 candidates, then select with a verifier of comparable cost. Cost per query is roughly 2028109L20 \cdot 2 \cdot 8 \cdot 10^9 \cdot L FLOPs, also about 3×1011L3 \times 10^{11} L.

Under Snell et al.'s empirical curves on MATH, the small-model-with-verifier option wins at equal inference FLOPs when the verifier is well-calibrated and problems are not at the extreme tail of difficulty. At the hardest problems, the large model's single-shot reasoning can dominate because no amount of sampling from the weak model covers the correct solution.

The lesson is not "sampling always wins." It is that training compute and inference compute are two independent knobs, and Chinchilla only pins down the first. The optimal serving configuration depends on query distribution, verifier quality, and latency constraints.

Watch Out

Test-time scaling does not repeal training scaling

Reasoning models still benefit from larger base models. The RL-over-reasoning regime composes with pretraining scale rather than replacing it. A small model with a huge inference budget has a ceiling set by what its policy can represent. Test-time compute shifts the Pareto frontier of (training FLOPs, inference FLOPs, accuracy), it does not collapse it to a single axis.

Watch Out

Coverage is not accuracy

pass@NN measures whether any sample is correct. Best-of-NN accuracy measures whether the selected sample is correct. The two can diverge sharply when the verifier is weak, the reward model is miscalibrated, or the task has many plausible-looking wrong answers. Reporting pass@NN without a selector overstates deployable performance.

Common Confusions

Watch Out

Chinchilla-optimal is not always practically optimal

Chinchilla minimizes loss per training FLOP. But if you will serve the model to billions of requests, inference cost (which scales with NN) dominates total cost. The practical optimum trains a smaller model on more data than Chinchilla suggests. Llama 3 70B was trained on 15T tokens (roughly 214 tokens per parameter), far beyond the Chinchilla ratio of 20. This is intentional: over-training reduces NN for a given quality level, cutting inference costs.

Watch Out

Power laws do not mean linear improvement

A power law LN0.076L \propto N^{-0.076} means you need approximately 10x more parameters for each 15% reduction in loss. Improving from 3.0 to 2.5 nats requires roughly 100x scale. Improving from 2.5 to 2.0 nats requires another 100x. The returns are diminishing in absolute terms, though constant in relative (percentage) terms. This is why training frontier models costs hundreds of millions of dollars for incremental improvements.

Watch Out

Scaling exponents are not universal constants

The exponents α\alpha, β\beta depend on the architecture, data distribution, and tokenizer. Different studies report different values. The qualitative result, power-law scaling with equal allocation of compute, is robust, but the exact exponents should not be treated as fundamental constants of nature.

Summary

  • Loss scales as power laws in NN, DD, and CC: LNαL \propto N^{-\alpha}, LDβL \propto D^{-\beta}
  • Kaplan (2020): favor large models with moderate data
  • Chinchilla (2022): scale NN and DD equally with compute; D/N20D^*/N^* \approx 20 tokens per parameter
  • Compute estimate: C6NDC \approx 6ND FLOPs for training
  • Decomposed form: L=E+A/Nα+B/DβL = E + A/N^\alpha + B/D^\beta with irreducible entropy EE
  • Inference cost favors over-training smaller models beyond Chinchilla-optimal; knowledge distillation offers another path to efficient smaller models
  • Emergent abilities: possibly real, possibly metric artifacts. still debated
  • Scaling laws predict pretraining loss well but downstream task performance less reliably
  • Test-time compute is a second scaling axis: best-of-NN and RL-trained reasoning traces can substitute inference FLOPs for parameters on reasoning tasks (Snell et al. 2024, Brown et al. 2024)

Exercises

ExerciseCore

Problem

You have a compute budget of C=1023C = 10^{23} FLOPs. Using the Chinchilla-optimal ratio of 20 tokens per parameter and the C6NDC \approx 6ND rule, what are the optimal model size NN^* and data size DD^*?

ExerciseAdvanced

Problem

Suppose the loss function is L(N,D)=E+A/Nα+B/DβL(N, D) = E + A/N^\alpha + B/D^\beta with E=1.69E = 1.69, A=406.4A = 406.4, B=410.7B = 410.7, α=0.34\alpha = 0.34, β=0.28\beta = 0.28. Subject to the constraint C=6NDC = 6ND, use Lagrange multipliers to show that the optimal allocation satisfies NCaN^* \propto C^{a} and find aa in terms of α\alpha and β\beta.

ExerciseResearch

Problem

The emergent abilities debate centers on whether discontinuities in benchmark performance are real or artifacts of the evaluation metric. Design an experiment that would distinguish between these two hypotheses. What metric would you use, and what pattern in the results would support each hypothesis?

Related Comparisons

References

Canonical:

  • Kaplan et al., "Scaling Laws for Neural Language Models" (2020). arXiv:2001.08361
  • Hoffmann et al., "Training Compute-Optimal Large Language Models" (2022, Chinchilla). arXiv:2203.15556
  • Henighan et al., "Scaling Laws for Autoregressive Generative Modeling" (2020). arXiv:2010.14701

Current:

  • Wei et al., "Emergent Abilities of Large Language Models" (2022). arXiv:2206.07682
  • Schaeffer, Miranda, Koyejo, "Are Emergent Abilities of Large Language Models a Mirage?" (2023). arXiv:2304.15004
  • Muennighoff et al., "Scaling Data-Constrained Language Models" (2023). arXiv:2305.16264
  • Snell, Lee, Xu, Kumar, "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters" (2024). arXiv:2408.03314
  • Brown et al., "Large Language Monkeys: Scaling Inference Compute with Repeated Sampling" (2024). arXiv:2407.21787

Next Topics

Scaling laws connect to many other topics:

  • Optimizer theory: the optimization algorithms that make training at scale work
  • KV cache: the inference-side cost that makes Chinchilla-optimal models expensive to serve

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This