Scaling Laws

Sneiderman, Robby

LLM Construction

Scaling Laws

Power-law scaling of LLM loss in parameters, data, and compute: Kaplan, Chinchilla, the Muennighoff data-constrained law for repetition, the Schaeffer metric-induced-emergence proposition, MoE and muP extensions, and the test-time compute axis.

AdvancedTier 1CurrentCore spine~85 min

Prerequisites

Convex Optimization Basics Data Contamination and Evaluation Distributed Training Theory History of AI

Start 8-question practice · 3 available Prereq Map

Learning position

Read this page in the graph.

llm-construction | layer 4 | tier 1. This page has 8 direct prerequisites and 7 published dependents.

Open Atlas Prerequisites Leads to

What next

Chain-of-Thought and Reasoning

This is the first curated or graph-derived continuation from the current page.

Evidence badge

Source-grounded page

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Scaling Law Board

Smooth power law, expensive mistakes

The clean frontier says loss falls predictably with compute. The dashed curve shows what happens when the run is compute-heavy but data-poor: more FLOPs do not buy the expected loss.

observed runscompute-optimal fitundertrained run

Primary control: training compute

Move the budget marker to see what the fitted scaling law would predict.

1e22.1 FLOPs

Data shortage pressure

This exaggerates the penalty for spending compute without enough training tokens.

gap 0.16

law to read

$L (C) = a C^{- α} + b$

Straight on log-log means one exponent controls the trend.

Predicted loss

2.85

What the clean frontier expects at this compute level.

Undertrained loss

3.01

A run that used compute but did not scale data with it.

What to learn

Scaling laws are not magic forecasts. They assume the experiment stays on the same recipe frontier.

Scaling laws are empirical power-law fits that relate a model's loss to the number of parameters $N$ , the training tokens $D$ , and the total compute $C$ . They are regressions, not physical laws: they hold within the compute regime where they were fit, and they can break under architectural changes, data repetition, or distribution shift.

Within their regime they are useful. They guide expensive training decisions: how large to make a model, how much data to collect, and how to allocate a fixed compute budget between model size and training duration. They connect directly to compute-optimal training and to the transformer architecture used by most current decoder-only LLMs. The Kaplan et al. paper breakdown walks through the original 2020 power-law fits and the Chinchilla correction.

Mental Model

Imagine you have a fixed compute budget (say, $\,10^{24}$ FLOPs). You must decide: train a large model for fewer steps, or a smaller model for more steps? Scaling laws answer this question precisely. They tell you that loss decreases as a power law in each resource, and that there is an optimal way to split your budget between model size and data.

The key surprise: loss follows smooth, predictable power laws over many orders of magnitude. A model with 10x more parameters trained on the same data will have predictably lower loss. This predictability is what makes scaling laws practically useful: one can extrapolate from small experiments to forecast the performance of much larger runs.

The Kaplan Scaling Laws (2020)

Definition

Kaplan Power-Law Scaling

Kaplan et al. (2020) empirically observed that cross-entropy loss $L$ on language modeling scales as power laws in parameters $N$ , data $D$ , and compute $C$ , when each is varied independently with the others held sufficient:

$L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}, \qquad L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}, \qquad L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}$

where $N_c, D_c, C_c$ are scaling constants and the exponents are approximately:

$\alpha_N \approx 0.076$ (loss vs parameters)
$\alpha_D \approx 0.095$ (loss vs data)
$\alpha_C \approx 0.050$ (loss vs compute)

These power laws hold over at least 7 orders of magnitude in compute.

Proposition

Power-Law Scaling of Language Model Loss

Statement

When parameters $N$ and data $D$ are both potentially limiting, the loss follows an approximate decomposition:

$L(N, D) \approx \left[\left(\frac{N_c}{N}\right)^{\alpha_N / \beta} + \left(\frac{D_c}{D}\right)^{\alpha_D / \beta}\right]^{\beta}$

for fitted constants. In the regime where one factor dominates, this reduces to the individual power laws above.

Reading the exponents directly: a larger exponent means loss falls faster as that resource grows. Since $\alpha_D > \alpha_N$ , in the isolated power laws loss is more sensitive to data than to model size. Kaplan's recommendation to favor model size at a fixed compute budget came from the joint compute-allocation analysis (using $C \approx 6ND$ and the joint fit for $L(N, D)$ ), not from the individual exponents. Chinchilla later revisited that allocation analysis and reached the opposite conclusion.

Intuition

A power law $L \propto N^{-\alpha}$ means that each 10x increase in $N$ gives a fixed percentage reduction in loss. The exponent $\alpha$ determines how fast loss improves. With $\alpha_D > \alpha_N$ , a 10x increase in data reduces loss more than a 10x increase in parameters does, in the regime where the other factor is not the bottleneck. The compute-allocation story is separate: at fixed $C \approx 6ND$ , spending a unit of compute on more parameters vs more tokens depends on the joint fit, and that is the analysis Kaplan used to recommend scaling $N$ faster than $D$ .

Why It Matters

The Kaplan scaling laws directly influenced the training of GPT-3 (175B parameters trained on 300B tokens). The recommendation to favor large models with moderate data was the dominant training rule from 2020 to 2022. It was overturned by the Chinchilla analysis.

Failure Mode

The Kaplan analysis trained all models for a relatively small number of tokens and extrapolated. It did not train smaller models to full convergence, which biased the analysis toward large models. The Chinchilla paper corrected this methodological issue and reached the opposite conclusion about optimal allocation.

report a correction →

Chinchilla Scaling (2022)

Theorem

Chinchilla-Optimal Compute Allocation

Statement

Hoffmann et al. (2022) showed that for a fixed compute budget $C$ , the optimal number of parameters $N^*$ and training tokens $D^*$ both scale proportionally to the square root of compute:

$N^* \propto C^a, \qquad D^* \propto C^b$

with $a + b = 1$ from the FLOP constraint $C \approx 6ND$ . Hoffmann et al.'s parametric fit gives $a \approx 0.452$ and $b \approx 0.548$ (their Approach 3); the two exponents are close but not equal, so $N^*$ and $D^*$ scale together with compute but at slightly different rates rather than being strictly proportional.

Parameters and training tokens should scale roughly together with compute. At Chinchilla's compute budget, the empirical compute-optimal allocation is about $D^* / N^* \approx 20$ tokens per parameter. Because $b > a$ , this ratio drifts upward slowly with compute (as $C^{b-a} \approx C^{0.1}$ ) rather than staying fixed. The "doubling parameters means doubling tokens" slogan is the $a = b = 1/2$ idealization, accurate to leading order but not exact.

The "20 tokens per parameter" figure is an empirical artifact of those specific exponents on Hoffmann et al.'s corpus, not a universal constant. Refits on different model families and corpora yield different ratios, and most modern open models intentionally over-train relative to this ratio because inference cost is dominated by $N$ alone.

Intuition

Kaplan said: make the model as large as possible. Chinchilla said: balance model size and data. The difference comes from how you define "optimal." If you fix compute and ask "what achieves the lowest loss?", Chinchilla shows that an overparameterized, undertrained model wastes compute. Training a smaller model on more data reaches the same loss with the same compute budget.

Proof Sketch

The analysis fits a parametric loss function:

$L(N, D) = \frac{A}{N^\alpha} + \frac{B}{D^\beta} + E$

where $E$ is the irreducible entropy of natural language. Subject to the constraint $C = 6ND$ (total compute), minimize $L$ over $(N, D)$ .

Using Lagrange multipliers: at the optimum, $\frac{\partial L}{\partial N} / \frac{\partial L}{\partial D} = D/N$ , which gives $\alpha A / N^{\alpha+1} \cdot N = \beta B / D^{\beta+1} \cdot D$ . Hoffmann et al. fit $\alpha \approx 0.34$ and $\beta \approx 0.28$ (close, but not equal). Because the two exponents are nearly equal, the optimum scales as $N^* \propto C^{a}$ and $D^* \propto C^{b}$ with $a, b$ both close to $\,0.5$ . The exact exponents depend on $(\alpha, \beta)$ ; the widely-quoted $N^* \propto C^{1/2}$ result is what you get in the $\alpha = \beta$ limit.

Why It Matters

Chinchilla (70B parameters, 1.4T tokens) matched the performance of Gopher (280B parameters, 300B tokens) with 4x fewer parameters and the same compute budget. This result reshaped the industry: Llama, DeepSeek, Mistral, and most subsequent models are "Chinchilla-optimal" or even "over-trained" (using more data than Chinchilla-optimal for a given size, because inference cost depends on $N$ while training cost depends on $C$ ).

Failure Mode

Chinchilla-optimal minimizes loss per FLOP of training. But inference cost depends on $N$ , not $D$ . If you plan to serve a model to millions of users, it is cheaper to over-train a small model (use more data than Chinchilla recommends) than to serve a Chinchilla-optimal larger model. Llama 3 8B was trained on 15T tokens, roughly 1875 tokens per parameter and nearly 100x the Chinchilla ratio, because the inference savings from a smaller model outweigh the training compute cost.

report a correction →

The Compute Constraint: $C \approx 6ND$

Definition

Training Compute Estimate

For a decoder-only transformer with $N$ parameters, one forward pass on one token requires approximately $\,2N$ FLOPs (one multiply-add per parameter). The backward pass requires approximately $\,4N$ FLOPs (roughly 2x the forward pass). Training on $D$ tokens requires:

$C \approx 6ND \text{ FLOPs}$

This is the "6ND rule." It ignores attention cost ( $O(n^2 d)$ per layer) but is accurate within a factor of 2 for typical model sizes and context lengths.

This formula is the bridge between the abstract scaling laws and concrete training decisions. Given a GPU cluster with a known FLOP budget, you can directly compute the Chinchilla-optimal $N$ and $D$ .

Emergent Abilities at Scale

Definition

Emergent Abilities

An ability is described as emergent when it is absent in smaller models but appears in larger models. Wei et al. (2022) documented several tasks where model performance was near random below a critical scale and then sharply improved:

Few-shot arithmetic: near zero below ~10B parameters, then rapid improvement
Multi-step reasoning: absent in small models, present in large ones
Code generation: qualitative jump at sufficient scale

The claim: some capabilities emerge discontinuously as a function of scale, rather than improving smoothly.

Schaeffer, Miranda, and Koyejo (2023) argued that the apparent discontinuity is a property of the evaluation metric, not the model. Their argument is sharp and worth stating as a proposition.

Proposition

Metric-Induced Emergence (Schaeffer-Miranda-Koyejo)

Statement

Let $p(N)$ be the per-token probability of producing the correct next token, viewed as a smooth function of model scale $N$ (for example, a power law $p(N) = 1 - c N^{-\alpha}$ ). The probability of an exact-match correct answer on a length- $L$ sequence is approximately

$\text{Acc}(N) = p(N)^L = \left(1 - c N^{-\alpha}\right)^L.$

For any fixed $L$ this curve has an inflection point that becomes arbitrarily sharp as $L$ grows. With $L = 5$ to $\,20$ (typical for arithmetic and multi-hop reasoning), the curve transitions from $\approx 0$ to $\approx 1$ within roughly one order of magnitude in $N$ , even though $p(N)$ itself varies smoothly across many orders of magnitude.

If the same data is rescored with a continuous metric (token edit distance, Brier score, log-probability of the gold answer), the apparent discontinuity disappears and the curve recovers the smooth scaling of $p(N)$ .

Empirically, Schaeffer et al. (2023) showed that on the Big-Bench tasks exhibiting "emergence" under exact match, switching to token edit distance or Brier score eliminated the sharp transition. They reproduced the mechanism on a controlled InstructGPT integer arithmetic experiment and even induced apparent emergence on a vision benchmark by changing only the metric.

Intuition

A long-string exact-match metric is a many-to-one nonlinearity: $\text{Acc}(N) = \mathbb{1}[p(N)^L > \tau]$ for some threshold $\tau$ , which in expectation behaves like a step function of $p$ . Smooth improvements in $p$ get compressed into a narrow band of $N$ values. The model is not undergoing a phase transition; the metric is.

Proof Sketch

For independent per-token correctness with probability $p$ , $\text{Acc} = p^L$ . Differentiating, $d\text{Acc}/dp = L p^{L-1}$ , which peaks at the inflection $p = 1$ and grows linearly with $L$ . So the slope of the accuracy curve in $N$ is $L p^{L-1} \cdot p'(N)$ , scaling with $L$ . For $L \gg 1$ the slope concentrates at $p$ near $\,1$ , producing the visual impression of a sudden jump. The exact-claim experiment in Schaeffer et al. (2023) is the InstructGPT family on $k$ -digit integer addition: per-token log-prob of the gold digits scales smoothly, sequence-level accuracy shows the canonical "emergent" curve, and Brier-score rescoring recovers the smooth trend.

Why It Matters

The proposition does not say emergent abilities never happen. It says the default evaluation pipeline manufactures the appearance of emergence even when the underlying capability is improving smoothly. Two consequences for practice. (i) Forecasting: pretraining loss extrapolates well, but exact-match downstream benchmarks do not, even when the underlying capability is following the same smooth trend. (ii) Safety: a narrative of "sharp, unpredictable capability jumps" should not be supported by evidence that is generated entirely by a discontinuous metric.

Failure Mode

The $(1 - c N^{-\alpha})^L$ model is an idealization. (a) Per-token errors are not independent: a wrong token early in a chain biases later tokens, so the true accuracy is below $p^L$ . (b) Some tasks have a genuine threshold structure not captured by independent-token accuracy (e.g., a model that either knows an algorithm or does not). For these, smooth metrics may also show a visible kink. (c) The proposition rules out one explanation for emergence; it does not prove the negation. Whether a given claimed emergent ability survives a continuous-metric audit is a per-task empirical question.

report a correction →

The empirical picture after Schaeffer et al.: on BIG-Bench, the majority of tasks formerly cited as exhibiting emergence become smooth under continuous metrics. A residual fraction (some compositional reasoning tasks in particular) retains a kink even under smooth metrics. The "are emergent abilities real?" question splits cleanly: metric-induced emergence is widespread and well-explained; capability-driven threshold behavior exists but is narrower than originally claimed.

The Decomposed Scaling Law

Definition

Decomposed Scaling Law (functional form)

Assume a transformer language model trained near convergence, with compute allocated compute-optimally per Hoffmann et al. (2022). The cross-entropy loss is modeled as:

$L(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}$

where:

$N$ is the parameter count, $D$ is the training token count.
$E \geq 0$ is a loss floor for the given data distribution and tokenizer. the loss of an infinite model trained on infinite data.
$A, B > 0$ and $\alpha, \beta > 0$ are fitted constants.
$A / N^\alpha$ captures limited model capacity.
$B / D^\beta$ captures limited dataset information.

As $N \to \infty$ or $D \to \infty$ , the corresponding term vanishes, but the loss cannot go below $E$ .

Example

Empirical fits (Chinchilla, Kaplan)

Hoffmann et al. (2022) fit $\alpha \approx 0.34$ , $\beta \approx 0.28$ , $E \approx 1.69$ nats on their corpus. Kaplan et al. (2020) report a different irreducible floor and different exponents on their WebText2-style corpus. These values are observations, not universal constants: they vary with data distribution, language, tokenization, architecture, and fitting procedure. Treat them as calibrated priors for a given setup, not as fundamental parameters of nature.

Even so, the decomposition can be practically predictive inside a held-out scale experiment. Once $A$ , $B$ , $\alpha$ , $\beta$ , and $E$ are fit on smaller runs, a larger run can be reserved as a calibration check: forecast the loss before training, then report the residual after training. Public evidence supports this workflow for Chinchilla-style scaling studies; closed frontier models without public loss curves should not be used as evidence for the fit.

The predictive power is restricted to pretraining loss. Downstream task accuracy can be nonlinear in loss: small loss improvements may produce large capability jumps or none at all.

Proposition

Compute-Optimal Allocation under the Decomposed Form

Statement

Minimize $L(N, D) = E + A N^{-\alpha} + B D^{-\beta}$ subject to the compute constraint $C = 6ND$ . The optimal $(N^*, D^*)$ satisfies:

$N^* \propto C^{\beta / (\alpha + \beta)}, \qquad D^* \propto C^{\alpha / (\alpha + \beta)}.$

The two exponents sum to 1, consistent with $C = 6 N D$ .

Intuition

Each term $A N^{-\alpha}$ and $B D^{-\beta}$ is convex in its variable. The constraint $ND = C/6$ is a hyperbola in $(N, D)$ . The minimum is where the marginal loss reduction per unit compute is equal across $N$ and $D$ . Allocating more compute to the resource with the faster-decaying term gives diminishing returns sooner, so the optimum balances the two exponents.

Proof Sketch

Form the Lagrangian $\mathcal{L} = A N^{-\alpha} + B D^{-\beta} + \lambda(ND - C/6)$ . First-order conditions:

$-\alpha A N^{-\alpha - 1} + \lambda D = 0, \qquad -\beta B D^{-\beta - 1} + \lambda N = 0.$

Eliminating $\lambda$ gives $\alpha A D^{\beta} = \beta B N^{\alpha}$ , so $D^\beta \propto N^\alpha$ . Combined with $ND = C/6$ , solve for $N^*$ in $C$ :

$N^{\alpha + \beta} \propto C^\beta \implies N^* \propto C^{\beta / (\alpha + \beta)},$

and symmetrically $D^* \propto C^{\alpha / (\alpha + \beta)}$ .

Why It Matters

This is the provable skeleton behind Chinchilla-style allocation rules. Whatever the fitted $(\alpha, \beta)$ , the Lagrangian argument fixes the functional form of optimal allocation. Only the specific exponent ratio depends on the empirical fit.

Failure Mode

The result assumes the functional form $L = E + A N^{-\alpha} + B D^{-\beta}$ holds globally. If the true loss deviates (data repetition, data-constrained regimes, curriculum effects, architectural phase transitions), the allocation prescription breaks. The $\,6ND$ FLOP estimate also ignores attention cost, which grows with context length and can dominate in long-context training.

report a correction →

Data-Constrained Scaling

Chinchilla and the decomposed law assume each training token is unique. Public frontier reports have moved past the easy-to-curate web: Llama 3 reports a 15T-token training mixture, and the supply of high-quality unique text in major languages is finite. The relevant question becomes: when you must repeat data, how much does it cost?

Muennighoff et al. (2023, Sections 3, 5, and 6) ran the experiment ( $\sim 400$ model variants, up to 9B parameters, 900B training tokens) and fit a clean modification of the Chinchilla form.

Theorem

Data-Constrained Scaling Law (Muennighoff)

Statement

Let $U_D$ denote the unique tokens in the corpus and $U_N$ the compute-optimal Chinchilla parameters for those tokens. With $R_D = D / U_D$ repetitions and $R_N = N / U_N$ excess parameters, the loss is

$L(N, D) = E + \frac{A}{(N')^{\alpha}} + \frac{B}{(D')^{\beta}}$

with effective counts that exponentially saturate:

$D' = U_D + U_D \cdot R_D^{*} \cdot \left(1 - e^{-R_D / R_D^{*}}\right), \qquad N' = U_N + U_N \cdot R_N^{*} \cdot \left(1 - e^{-R_N / R_N^{*}}\right)$

The fitted half-life $R_D^{*} \approx 15$ (so each repeated token loses $\sim 1/e$ of its value after $\sim 16$ epochs). The corresponding $R_N^{*}$ is smaller, so excess parameters decay faster than repeated data. Empirically:

Up to $\sim 4$ epochs of repetition produce loss within noise of training on fresh data.
Beyond $R_D^{*}$ , the marginal value of additional repetition decays to zero; spending the same compute on more parameters or stopping training becomes preferable.
Code data tolerates more repetition than natural-language text. Filtering augmentations (e.g., perplexity-filtered repeats) extend the effective half-life modestly but do not change the qualitative shape.

Intuition

Repeating a token that the model has already memorized provides almost no new information. The first few epochs still extract residual structure (rare n-grams, long-range dependencies). After enough repetitions, the loss term $B / D'^{\beta}$ saturates because $D'$ stops growing, and additional compute is wasted unless reallocated to capacity. The exponential form is empirical, but it has a clean reading: each new epoch contributes geometrically less effective data than the previous one.

Proof Sketch

The functional form is fitted, not derived. A heuristic for why it is roughly the right form: assume each repeated pass extracts a fraction $\rho < 1$ of the residual information. The total information extracted after $R$ passes is the geometric sum

$1 + \rho + \rho^2 + \cdots + \rho^{R-1} = \frac{1 - \rho^R}{1 - \rho}.$

This has the same saturating shape as $\,1 - e^{-R / R^*}$ with $R^* = -1 / \log \rho$ . Muennighoff et al. fit the exponential rather than the geometric form because it generalized better across model sizes and corpora.

Why It Matters

The data-constrained law turns "are we running out of data?" into a precise quantity. Given a unique-token budget $U_D$ and a compute budget $C$ , you can solve for the allocation that minimizes loss subject to the constraint that $D = U_D \cdot R$ . The result: in data-constrained regimes the optimal $N$ shifts upward relative to fresh-data Chinchilla (more parameters, fewer effective tokens). It also bounds how much synthetic-data and augmentation-only strategies can buy you: if augmented samples behave like repeats, the half-life is the binding constraint. Reported Llama 3 over-training is consistent with the inference-cost motive, but it is not by itself evidence for Muennighoff et al.'s repeated-data half-life.

Failure Mode

Three caveats. (a) The fit was done on web-scale natural-language corpora with a particular filtering pipeline; the effective $R_D^{*}$ depends on data diversity and filtering. High-quality curated repeats (textbook-style) behave better than uniform random repeats. (b) The model assumes the same loss objective and tokenizer throughout. Curriculum changes (different data mixes per epoch) violate the additive form. (c) The law speaks to validation loss. Downstream task performance has its own dependence on data freshness and may degrade more or less sharply than loss does.

report a correction →

Watch Out

Repetition is not literally free up to 4 epochs

The " $\sim 4$ epochs is fine" result is about loss on a held-out distribution, fit at the granularity of $\sim 1$ % noise. Specific downstream benchmarks can degrade earlier under repetition (especially memorization- sensitive evaluations like exact-string recall) or later (compositional benchmarks where the model needs more passes to learn rare structures). Use $R \leq 4$ as a useful default, not as a guarantee.

Scaling for Downstream Tasks vs Pretraining Loss

Pretraining loss scales smoothly and predictably. Downstream task performance does not always follow the same smooth curve:

Smooth scaling: Tasks well-correlated with language modeling (text generation quality, perplexity) scale smoothly with loss.
Threshold scaling: Tasks requiring specific capabilities (multi-step reasoning, tool use) may show sharp transitions at particular loss levels.
Saturation: Some tasks saturate quickly (sentiment analysis) while others continue improving with scale (complex reasoning).

This means you cannot simply extrapolate a scaling curve for an individual task. You can predict the loss of a 10x larger model with high confidence, but predicting whether it will pass a specific evaluation benchmark requires understanding the relationship between loss and that specific capability.

Scaling Beyond Dense Transformers

Kaplan and Chinchilla fit dense decoder-only transformers. Two extensions matter in current practice: mixture-of-experts (MoE) scaling, and width-depth parametrization.

MoE scaling laws. Mixture-of-experts models activate only a subset of parameters per token. The active parameter count $N_{\text{act}}$ governs inference cost and largely governs loss, but total parameters $N_{\text{tot}}$ also contribute via increased capacity. Krajewski et al. (arXiv:2402.07871, Sections 3-4) fit a joint scaling law $L(N_{\text{act}}, N_{\text{tot}}, D) = E + A / N_{\text{act}}^{\alpha_a} + B / N_{\text{tot}}^{\alpha_t} + C / D^{\beta}$ and find that the compute-optimal allocation shifts: under a serving-cost constraint that penalizes $N_{\text{act}}$ , a given loss is reached more cheaply by a sparse model with larger $N_{\text{tot}}$ and smaller $N_{\text{act}}$ . DeepSeek MoE (Dai et al. 2024) and DeepSeek-AI (arXiv:2412.19437, Sections 2.1-2.3) push this regime with fine-grained expert segmentation, shared experts, and auxiliary-loss-free load balancing. Treat the reported DeepSeek-V3 efficiency numbers as architecture evidence for sparse serving economics, not as a direct refit of dense Chinchilla.

The practical takeaway: Chinchilla's $D^* / N^* \approx 20$ was derived for dense models. MoE runs often operate at much higher token-to-active-parameter ratios because active parameters are cheap to scale without increasing the training-time FLOP cost per token in the same way.

muP (maximal update parametrization). Yang and Hu (2022) showed that with a specific per-layer rescaling of initialization variance, learning rates, and output multipliers, the optimal hyperparameters (learning rate, weight decay, initialization scale) become approximately invariant to width. This means a small proxy model can be tuned cheaply, and the same hyperparameters transfer to a much larger target model without re-tuning. Tensor Program V (Yang et al. 2022) extends this to depth transfer and more general architectures. This is a scaling-law story in a different sense: the tuning cost no longer scales with $N$ in the same way, which changes the economics of large runs. See also convex tinkering for the broader principle of bounded-downside scale-up experiments.

Test-Time Compute Scaling

Kaplan and Chinchilla describe how loss scales with compute spent at training time. A separate axis has become central since 2024: how performance scales with compute spent at inference time on a fixed trained model. OpenAI o1 and DeepSeek R1 are the canonical examples: models trained with reinforcement learning to produce long chains of thought that consume far more tokens per query than standard decoding.

The headline empirical result: on reasoning-heavy tasks, adding inference compute to a smaller model can match or beat the accuracy of a much larger model decoded once. The training-compute-versus-inference-compute trade is not fixed by the Chinchilla analysis, which only accounts for pretraining loss.

Definition

Best-of-N Sampling

Given a prompt $x$ , draw $N$ independent completions $y_1, \dots, y_N$ from the model at temperature $T > 0$ , then select one using a scoring rule $s(x, y)$ :

$\hat{y} = \arg\max_{i \in \{1, \dots, N\}} s(x, y_i).$

The score $s$ can be a learned verifier (process reward model, outcome reward model), a ground-truth checker for tasks with verifiable answers (math, code unit tests), or majority vote over final answers (self-consistency). Inference FLOPs scale as roughly $N$ times the cost of a single completion.

Coverage scales predictably with samples. Brown et al. (arXiv:2407.21787, Section 2 and Figure 7), "Large Language Monkeys," measured pass@ $N$ (the fraction of problems solved by at least one of $N$ samples) across math and code benchmarks. Coverage follows an approximate power law in $N$ over several orders of magnitude: log pass@ $N$ is roughly linear in $\log N$ . This is a different object from pass@1: it isolates what the model can produce from what a selector does select. A large gap between pass@ $N$ and best-of- $N$ accuracy indicates a verifier bottleneck, not a generation bottleneck.

Test-time compute can substitute for parameters. Snell et al. (arXiv:2408.03314, Section 4 and Figure 1), "Scaling LLM Test-Time Compute Optimally," studied how a fixed compute budget should be spent across pretraining and inference: one large model decoded once, a small model sampled many times with a verifier, or tree search guided by a process reward model. On MATH-class problems, a smaller model with an optimal test-time strategy matched the accuracy of a roughly $\,14\times$ larger pretrained model decoded greedily once, when total compute (pretraining FLOPs plus inference FLOPs aggregated across the expected number of queries) is held equal. The optimal strategy depends on problem difficulty: easy problems favor more parallel samples, hard problems favor sequential revision and search. The small-model advantage shrinks or reverses on the hardest problems, where a single large-model rollout reasons over modes that no amount of small-model sampling can cover.

Reasoning models as RL over chains of thought. OpenAI o1 and DeepSeek R1 are trained with RL against verifiable rewards to produce long internal reasoning traces before answering. The scaling axis here is not samples but tokens per answer. Reported curves show accuracy rising smoothly with thinking-token budget, analogous to a scaling law but with inference tokens rather than training tokens on the $x$ -axis. The mechanism is not fully characterized: candidates include genuine search inside the trace, error correction, and in-context amortization of otherwise parametric knowledge.

Example

Inference budget trade-off

A team has a fixed per-query inference budget of $B$ FLOPs and wants to maximize accuracy on a reasoning benchmark. Two options:

Deploy a $\,70B$ -parameter model and decode once. Cost per query is roughly $\,2 \cdot 70 \cdot 10^9 \cdot L$ FLOPs for an output of length $L$ .
Deploy an $\,8B$ -parameter model and sample $N = 20$ candidates, then select with a verifier of comparable cost. Cost per query is roughly $\,20 \cdot 2 \cdot 8 \cdot 10^9 \cdot L$ FLOPs, also about $\,3 \times 10^{11} L$ .

Under Snell et al.'s empirical curves on MATH, the small-model-with-verifier option wins at equal inference FLOPs when the verifier is well-calibrated and problems are not at the extreme tail of difficulty. At the hardest problems, the large model's single-shot reasoning can dominate because no amount of sampling from the weak model covers the correct solution.

The lesson is not "sampling always wins." It is that training compute and inference compute are two independent knobs, and Chinchilla only pins down the first. The optimal serving configuration depends on query distribution, verifier quality, and latency constraints.

Watch Out

Test-time scaling does not repeal training scaling

Reasoning models still benefit from larger base models. The RL-over-reasoning regime composes with pretraining scale rather than replacing it. A small model with a huge inference budget has a ceiling set by what its policy can represent. Test-time compute shifts the Pareto frontier of (training FLOPs, inference FLOPs, accuracy), it does not collapse it to a single axis.

Watch Out

Coverage is not accuracy

pass@ $N$ measures whether any sample is correct. Best-of- $N$ accuracy measures whether the selected sample is correct. The two can diverge sharply when the verifier is weak, the reward model is miscalibrated, or the task has many plausible-looking wrong answers. Reporting pass@ $N$ without a selector overstates deployable performance.

Evidence Ladder for Scaling Claims

Scaling-law claims are useful only when the fitted object and the extrapolated object match. A smooth pretraining-loss fit is not the same as a forecast of deployment quality, safety, or revenue.

Claim	What supports it	What invalidates it
Loss follows a power law	Multiple model sizes, multiple data budgets, held-out loss, and fit residuals on log-log axes	A two-point fit, cherry-picked checkpoints, or training-loss-only curves
Compute-optimal allocation is known	Fitted $L(N,D)$ , stated FLOP model, and a constrained optimization derivation	Reusing Chinchilla ratios after changing data quality, context length, tokenizer, or architecture
Data repetition is acceptable	Unique-token count, repeat count, held-out loss, and memorization-sensitive evaluations	Treating total tokens as fresh tokens when the corpus is repeatedly replayed
Emergence is real for a task	Exact-match, continuous metric, and calibration curves all show a threshold	The jump disappears under token edit distance, log-probability, or Brier score
Test-time compute substitutes for parameters	Accuracy is plotted against inference FLOPs with verifier cost included	pass@ $N$ is reported without a selector, latency budget, or cost of scoring candidates

This ladder is why the page treats scaling laws as fitted engineering regularities, not laws of nature. The strongest use is planning experiments within a known regime; the weakest use is extrapolating far beyond the fitted range and attaching a capability story after the fact.

Common Confusions

Watch Out

Chinchilla-optimal is not always practically optimal

Chinchilla minimizes loss per training FLOP. But if you will serve the model to billions of requests, inference cost (which scales with $N$ ) dominates total cost. The practical optimum trains a smaller model on more data than Chinchilla suggests. Llama 3 70B was trained on 15T tokens (roughly 214 tokens per parameter), far beyond the Chinchilla ratio of 20. This is intentional: over-training reduces $N$ for a given quality level, cutting inference costs.

Watch Out

Power laws do not mean linear improvement

A power law $L \propto N^{-0.076}$ means you need approximately 10x more parameters for each 15% reduction in loss. Improving from 3.0 to 2.5 nats requires roughly 100x scale. Improving from 2.5 to 2.0 nats requires another 100x. The returns are diminishing in absolute terms, though constant in relative (percentage) terms. This is why training frontier models costs hundreds of millions of dollars for incremental improvements.

Watch Out

Scaling exponents are not universal constants

The exponents $\alpha$ , $\beta$ depend on the architecture, data distribution, and tokenizer. Different studies report different values. The qualitative result, power-law scaling with equal allocation of compute, appears across many fits, but the exact exponents should not be treated as fundamental constants of nature.

Summary

Loss scales as power laws in $N$ , $D$ , and $C$ : $L \propto N^{-\alpha}$ , $L \propto D^{-\beta}$ .
Kaplan (2020): favor large models with moderate data; the under-converged smaller models in the fit biased the conclusion toward $N$ .
Chinchilla (2022): scale $N$ and $D$ equally with compute; $D^*/N^* \approx 20$ tokens per parameter for the fitted exponents.
Compute estimate: $C \approx 6ND$ FLOPs for training; the Lagrangian gives $N^* \propto C^{\beta/(\alpha+\beta)}$ for any fitted $(\alpha, \beta)$ .
Decomposed form: $L = E + A/N^\alpha + B/D^\beta$ with irreducible entropy $E$ .
Data-constrained scaling (Muennighoff 2023): up to $\sim 4$ epochs of repetition behave like fresh data; effective tokens saturate exponentially with half-life $R_D^{*} \approx 15$ .
Metric-induced emergence (Schaeffer 2023): sharp downstream "emergence" curves are mostly artifacts of exact-match scoring of length- $L$ outputs; smooth metrics recover smooth scaling on the same model families.
Inference cost favors over-training smaller models beyond Chinchilla-optimal; knowledge distillation offers another path to efficient smaller models.
Test-time compute is a second scaling axis: best-of- $N$ and RL-trained reasoning traces can substitute inference FLOPs for parameters on reasoning tasks (Snell et al. 2024, Brown et al. 2024).

Evaluation Ladder

Question	What to measure	Failure signal
Forecast calibration	Fit on small and medium runs, then score the predicted loss on a held-out scale.	The fitted exponent looks clean in-sample but misses the next run outside error bars.
Compute accounting	Use measured training FLOPs, tokens, optimizer steps, context length, and hardware efficiency.	Nominal $6ND$ compute hides attention cost, sequence packing, or retry waste.
Data quality	Track duplicate rate, contamination, language mix, and unique-token supply.	A smooth power law masks repeated data or benchmark leakage.
Distribution shift	Evaluate on corpora and tasks not used to fit the law.	Pretraining loss extrapolates while downstream behavior does not.
Metric nonlinearity	Compare exact-match accuracy with log-probability, edit distance, Brier score, or partial-credit scoring.	Claimed emergence disappears under continuous rescoring.

Scaling Forecast Diagnostic

A credible scaling-law claim reserves at least one held-out scale. Fit the law on smaller runs, freeze the exponent and intercept, then train one larger model whose result was not used in fitting. Report the forecast interval before seeing the run, the measured loss afterward, and the residual in standard-error units. If the model family, tokenizer, data mixture, or optimizer changes between the fit runs and the held-out run, treat the forecast as a new regime test rather than a confirmation.

Worked Forecast Sheet

A useful scaling-law forecast can fit in one page. It should make the forecast auditable before the next training run starts.

Field	Example entry	Why it belongs
Fit runs	Model sizes, token counts, optimizer, data mixture, tokenizer, context length	The fitted exponent only applies to this regime
Held-out scale	One larger run withheld from the fit	In-sample straight lines are not forecast evidence
Forecast interval	Predicted validation loss with standard-error or bootstrap interval	The residual needs a denominator
Residual	Measured loss minus forecast, in interval units	This decides whether to trust or refit the law
Decision rule	Keep exponent, refit exponent, or declare regime change	Prevents post-hoc storytelling

Do not use closed frontier models without public loss curves as fit evidence. Benchmark scores, release-note claims, and broad capability impressions can motivate hypotheses, but they cannot replace the held-out loss curve needed for a scaling-law forecast.

Exercises

ExerciseCore

Problem

You have a compute budget of $C = 10^{23}$ FLOPs. Using the Chinchilla-optimal ratio of 20 tokens per parameter and the $C \approx 6ND$ rule, what are the optimal model size $N^*$ and data size $D^*$ ?

ExerciseAdvanced

Problem

Suppose the loss function is $L(N, D) = E + A/N^\alpha + B/D^\beta$ with $E = 1.69$ , $A = 406.4$ , $B = 410.7$ , $\alpha = 0.34$ , $\beta = 0.28$ . Subject to the constraint $C = 6ND$ , use Lagrange multipliers to show that the optimal allocation satisfies $N^* \propto C^{a}$ and find $a$ in terms of $\alpha$ and $\beta$ .

ExerciseAdvanced

Problem

Suppose you have $U_D = 200$ B unique tokens and a compute budget of $C = 6 \times 10^{23}$ FLOPs. The fresh-data Chinchilla recipe asks for $D^* = 20 N^*$ . Using the Muennighoff data-constrained scaling law with $R_D^{*} = 15$ , qualitatively describe how the optimal $N$ shifts when you must reuse data, and compute the rough number of epochs you would run if you set $N$ to the fresh-data Chinchilla value but had only $\,200$ B unique tokens.

ExerciseResearch

Problem

The emergent abilities debate centers on whether discontinuities in benchmark performance are real or artifacts of the evaluation metric. Using the Schaeffer-Miranda-Koyejo $(1 - c N^{-\alpha})^L$ model, derive the location of the apparent inflection point in $N$ as a function of task length $L$ . Then design a controlled experiment that distinguishes metric-induced emergence from a genuine capability threshold.

Related Comparisons

Kaplan vs. Chinchilla Scaling

Frequently Asked Questions

$What does Chinchilla-optimal mean?$: $Hoffmann et al. (2022) ran a finer-grained scaling sweep than Kaplan et al. (2020) and found compute-optimal training requires roughly 20 tokens per parameter. Pre-Chinchilla flagship models (GPT-3 included) were drastically undertrained: at the same compute, halving parameters and doubling tokens yields lower loss.$
$Why did Kaplan and Chinchilla disagree?$: $Kaplan's experimental design fixed a learning-rate schedule that did not scale appropriately with training horizon, biasing the conclusion toward parameter scaling. Chinchilla varied both N and D simultaneously with proper schedule scaling and found the loss surface had a different shape. The disagreement was experimental design, not a deeper theoretical conflict.$
$Are emergent abilities real?$: $Schaeffer et al. (2023) showed many 'emergent' phase transitions disappear under smooth metrics: nonlinear scoring (exact-match accuracy on a long answer) created the appearance of discontinuity, while smooth metrics (per-token log-prob) showed continuous scaling. Some behaviors (chain-of-thought helping at all, code generation past a quality threshold) still appear meaningfully discontinuous to users. The phenomenon is partly a measurement artifact, partly real.$
$What changes when training data is constrained?$: $Muennighoff et al. (2023): training on repeated tokens has diminishing but real returns. Up to roughly 4-5 epochs of repetition matches unique-token training; beyond that, returns flatten but stay positive for several more epochs. The 'data wall' as a hard cliff is too pessimistic; the wall is a gradient with non-trivial slope.$
$Do scaling laws apply to MoE models?$: $Yes, with extensions. Active parameters drive loss similar to dense scaling; total parameters add a separate axis with smaller exponent (sparser models capture less of the dense return). Krajewski et al. (2024), DeepSeekMoE, and the DeepSeek-V3 technical report extend the dense Chinchilla picture to sparse MoE regimes with explicit active-parameter and total-parameter tradeoffs.$

References

Core neural scaling:

Kaplan et al., Scaling Laws for Neural Language Models (2020), Sections 3 and 6 for empirical power laws and compute-optimal allocation. arXiv:2001.08361
Hoffmann et al., Training Compute-Optimal Large Language Models (Chinchilla, 2022), Sections 3 and 3.4 for allocation approaches and optimal model scaling. arXiv:2203.15556
Henighan et al., Scaling Laws for Autoregressive Generative Modeling (2020). arXiv:2010.14701

Data limits and repeated data:

Muennighoff et al., Scaling Data-Constrained Language Models (NeurIPS 2023), Sections 3, 5, and 6 for the data-constrained law, allocation results, and repeated-token returns. arXiv:2305.16264
Villalobos et al., Will We Run Out of Data? An Analysis of the Limits of Scaling Datasets in Machine Learning (2022). arXiv:2211.04325
Hernandez et al., Scaling Laws and Interpretability of Learning from Repeated Data (2022). arXiv:2205.10487

Emergence and evaluation:

Wei et al., Emergent Abilities of Large Language Models (TMLR 2022). arXiv:2206.07682
Schaeffer, Miranda, Koyejo, Are Emergent Abilities of Large Language Models a Mirage? (NeurIPS 2023), Sections 2-5 for the metric explanation, InstructGPT arithmetic experiments, BIG-Bench meta-analysis, and induced vision-task emergence. arXiv:2304.15004
Srivastava et al., Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models (2022). arXiv:2206.04615

MoE, hyperparameter transfer, and test-time compute:

Krajewski et al., Scaling Laws for Fine-Grained Mixture of Experts (2024), Sections 3-4 for active/total parameter scaling fits. arXiv:2402.07871
Dai et al., DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models (2024), Sections 2-3 for expert specialization and segmentation. arXiv:2401.06066
DeepSeek-AI, DeepSeek-V3 Technical Report (2024), Sections 2.1-2.3 for MLA, DeepSeekMoE, and auxiliary-loss-free load balancing. arXiv:2412.19437
Yang et al., Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer (muP, 2022), Sections 2-4 for maximal update parametrization and transfer. arXiv:2203.03466
Snell et al., Scaling LLM Test-Time Compute Optimally (2024), Section 4 and Figure 1 for test-time-compute allocation curves. arXiv:2408.03314
Brown et al., Large Language Monkeys: Scaling Inference Compute with Repeated Sampling (2024), Section 2 and Figure 7 for coverage and repeated-sampling scaling. arXiv:2407.21787

Next Topics

Chain-of-Thought and Reasoning: the test-time compute axis and the Merrill-Sabharwal expressiveness ceiling
Compute-Optimal Training: Chinchilla in practice
KV cache: the inference-side cost that makes Chinchilla-optimal models expensive to serve

Last reviewed: July 5, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

8

Convex Optimization Basicslayer 1 · tier 1
Optimizer Theory: SGD, Adam, and Muonlayer 3 · tier 1
Ineffable Intelligencelayer 4 · tier 2
Lazy vs Feature Learninglayer 4 · tier 2
Data Contamination and Evaluationlayer 5 · tier 2

Derived topics

7

Chain-of-Thought and Reasoninglayer 5 · tier 1
GPT Series Evolutionlayer 5 · tier 2
Inference-Time Scaling Lawslayer 5 · tier 2
LLaMA and Open Weight Modelslayer 5 · tier 2
Scaling Compute-Optimal Traininglayer 5 · tier 2

+2 more on the derived-topics page.

Graph-backed continuations

Chain-of-Thought and Reasoning Scaling Compute-Optimal Training GPT Series Evolution Inference-Time Scaling Laws LLaMA and Open Weight Models Open Problems in ML Theory Test-Time Compute and Search

Read this page in the graph.

Why This Matters

Smooth power law, expensive mistakes

Mental Model

The Kaplan Scaling Laws (2020)

Chinchilla Scaling (2022)

The Compute Constraint: C≈6NDC \approx 6NDC≈6ND

Emergent Abilities at Scale

The Decomposed Scaling Law

Data-Constrained Scaling

Scaling for Downstream Tasks vs Pretraining Loss

Scaling Beyond Dense Transformers

Test-Time Compute Scaling

Evidence Ladder for Scaling Claims

Common Confusions

Summary

Evaluation Ladder

Scaling Forecast Diagnostic

Worked Forecast Sheet

Exercises

Related Comparisons

Frequently Asked Questions

References

Next Topics

Required before and derived from this topic

Required prerequisites

Derived topics

The Compute Constraint: $C \approx 6ND$