Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Methodology

Energy Efficiency and Green AI

The compute cost of training frontier models, carbon footprint, FLOPs vs wall-clock time vs dollars, and why reporting efficiency matters. Efficient alternatives: distillation, pruning, quantization, and scaling laws for optimal compute allocation.

CoreTier 3Current~40 min
0

Why This Matters

Training GPT-4 reportedly cost over 100 million USD in compute. Training a single large language model can emit as much carbon as five cars over their lifetimes (Strubell et al. 2019, measured for BERT-scale models). The trend is accelerating: each generation of frontier models uses roughly 10x more compute than the previous one.

This is not only an environmental concern. It is an access concern. If only organizations with $100M+ budgets can train frontier models, research becomes concentrated in a few labs. Understanding compute efficiency is necessary for anyone who does not have unlimited resources.

Measuring Compute

Definition

FLOPs (Floating Point Operations)

FLOPs count the total number of floating point operations (additions and multiplications) in a computation. For a dense matrix multiplication of an m×km \times k matrix by a k×nk \times n matrix, the cost is 2mkn2mkn FLOPs (each of the mnmn output entries requires kk multiplications and k1k-1 additions).

For a transformer forward pass on a sequence of length ss with model dimension dd and LL layers, the cost is approximately 2L(12d2s+2ds2)2 \cdot L \cdot (12d^2s + 2ds^2) FLOPs (from the attention and FFN layers). The backward pass costs roughly 2×2\times the forward pass.

Definition

Compute Budget

The total training compute is:

C=6NDC = 6 N D

where NN is the number of parameters and DD is the number of training tokens (for dense transformer language models). The factor of 6 comes from: 2 FLOPs per parameter per token for the forward pass, times 3 for forward + backward + gradient accumulation.

Three distinct measurements of cost:

  1. FLOPs: hardware-independent, measures algorithmic cost
  2. Wall-clock time: depends on hardware, parallelism, communication overhead
  3. Dollars: depends on hardware prices, cloud vs. owned, electricity costs

These can diverge. A method with fewer FLOPs can be slower in wall-clock time if it is hard to parallelize. A method that is fast on GPUs may be expensive in dollars if GPUs cost more than the alternative hardware.

Scaling Laws and Optimal Allocation

Proposition

Chinchilla Optimal Compute Allocation

Statement

Hoffmann et al. (2022) found empirically that for a fixed compute budget CC, the loss is minimized when parameters NN and training tokens DD are scaled proportionally:

NoptC0.5,DoptC0.5N_{\text{opt}} \propto C^{0.5}, \quad D_{\text{opt}} \propto C^{0.5}

The Chinchilla scaling law gives NoptN0(C/C0)0.50N_{\text{opt}} \approx N_0 \cdot (C/C_0)^{0.50} and DoptD0(C/C0)0.50D_{\text{opt}} \approx D_0 \cdot (C/C_0)^{0.50}. For the specific constants fitted, this means roughly 20 tokens per parameter is optimal.

Intuition

If your model is too large for the data, you waste compute on parameters that cannot be properly trained (overfitting risk, underfitting due to insufficient data). If your data is too large for the model, you waste compute on training a model that has already saturated its capacity. The optimal point balances model capacity against data quantity.

Proof Sketch

Fit a parametric loss function L(N,D)=E+A/Nα+B/DβL(N, D) = E + A/N^\alpha + B/D^\beta to training runs at many scales. Minimize LL subject to C=6NDC = 6ND. The Lagrangian gives Nopt=(Aα6Bβ)1/(α+β)Cβ/(α+β)N_{\text{opt}} = (\frac{A \alpha}{6B\beta})^{1/(\alpha+\beta)} C^{\beta/(\alpha+\beta)}. Hoffmann et al. found αβ\alpha \approx \beta, giving NoptC0.5N_{\text{opt}} \propto C^{0.5}.

Why It Matters

Before Chinchilla, the prevailing approach (Kaplan et al. 2020) suggested making models as large as possible and training for fewer tokens. Chinchilla showed this was suboptimal: GPT-3 (175B parameters, 300B tokens) used far more parameters than necessary. Chinchilla (70B parameters, 1.4T tokens) matched GPT-3's performance with 4x less compute at inference time. This shifted the field toward training smaller models on more data.

Failure Mode

Chinchilla scaling laws are fitted to specific model families (dense transformers) and training setups. They may not apply to: mixture-of-experts models, models trained with curriculum learning, models fine-tuned for downstream tasks, or models where inference cost (not training cost) is the binding constraint. When inference dominates total cost (which it does for widely deployed models), training a larger model on less data and then distilling may be more efficient overall.

Efficient Alternatives

Knowledge distillation: train a small student model to mimic the outputs of a large teacher model. Hinton et al. (2015) showed that the teacher's soft predictions (probability distributions over classes) transfer more information than hard labels alone. The student can match 90%+ of the teacher's accuracy at a fraction of the parameter count.

Pruning: remove redundant weights or neurons after training. Structured pruning (removing entire channels or attention heads) gives real speedups on standard hardware; unstructured pruning (zeroing individual weights) requires sparse matrix support.

Quantization: reduce weight precision from 32-bit to 16-bit, 8-bit, or 4-bit (see computer architecture for ML for hardware implications). Post-training quantization (PTQ) requires no retraining. Quantization-aware training (QAT) maintains accuracy better but costs more. 4-bit quantization reduces model size by 8x with modest accuracy loss for LLMs.

Architecture efficiency: mixture-of-experts (MoE) models activate only a fraction of parameters per token. A 1T-parameter MoE model might use only 100B parameters per forward pass, giving the capacity of a large model at the inference cost of a smaller one.

Reporting Efficiency

Proposition

Efficiency Reporting Principle

Statement

Schwartz et al. (2020, "Green AI") argue that AI papers should report efficiency alongside accuracy. Specifically, every result should include:

  1. Total training FLOPs (or GPU-hours and hardware type)
  2. Training wall-clock time
  3. CO2 emissions estimate (or at minimum, data center location and energy source)
  4. Cost in dollars (approximate)

A method that achieves 1% higher accuracy at 10x the compute is not unambiguously better.

Intuition

Without efficiency reporting, the field optimizes only for accuracy, creating an arms race where only the richest labs can compete. Reporting compute cost enables Pareto-optimal comparisons: Method A beats Method B if it is better on accuracy and cheaper, or equivalent on accuracy and cheaper, or better on accuracy at equal cost.

Proof Sketch

Not a mathematical proof. This is a methodological principle supported by the observation that omitting cost makes it impossible to evaluate efficiency.

Why It Matters

If a paper reports only accuracy, you cannot tell whether the improvement came from a better algorithm or simply more compute. Efficiency reporting makes this distinction visible and rewards algorithmic innovation over brute-force scaling.

Failure Mode

Efficiency reporting is hard to standardize. GPU-hours on an A100 are not comparable to GPU-hours on a V100. FLOPs do not account for memory bandwidth bottlenecks or communication overhead. Dollars depend on cloud pricing that changes monthly. Despite these limitations, approximate reporting is far better than no reporting.

Carbon Footprint

The carbon cost of training depends on three factors:

  1. Energy consumed (kWh) = power draw (kW) ×\times training time (hours)
  2. Power Usage Effectiveness (PUE): ratio of total data center energy to compute energy. Typical values: 1.1-1.6.
  3. Carbon intensity of the electricity grid (gCO2/kWh): varies from ~20 gCO2/kWh (hydropower, nuclear) to ~900 gCO2/kWh (coal).

CO2=Energy×PUE×Carbon Intensity\text{CO}_2 = \text{Energy} \times \text{PUE} \times \text{Carbon Intensity}

Training the same model in Quebec (hydropower, ~20 gCO2/kWh) produces roughly 40x less carbon than training it in West Virginia (coal, ~800 gCO2/kWh). Data center location is a first-order decision for carbon footprint.

Common Confusions

Watch Out

FLOPs and FLOPS are different

FLOPs (lowercase s) = floating point operations (a count). FLOP/s or FLOPS = floating point operations per second (a rate). An A100 GPU has a peak throughput of ~312 TFLOP/s for FP16. Training for 10 hours at 50% utilization uses 312×1012×0.5×3600×105.6×1018312 \times 10^{12} \times 0.5 \times 3600 \times 10 \approx 5.6 \times 10^{18} FLOPs.

Watch Out

Inference cost often exceeds training cost

GPT-4 was trained once but serves millions of requests per day. Over the model's lifetime, inference FLOPs dwarf training FLOPs. Optimizing only training efficiency is insufficient; inference efficiency (smaller models, quantization, efficient serving) often matters more for total cost.

Exercises

ExerciseCore

Problem

A 7B parameter model is trained on 1.4T tokens. Estimate the total training FLOPs using the C=6NDC = 6ND approximation. If an A100 GPU achieves 150 TFLOP/s in practice, how many GPU-hours is this?

ExerciseAdvanced

Problem

Chinchilla scaling says the optimal number of training tokens is Dopt20ND_{\text{opt}} \approx 20N. A lab has a budget of C=1024C = 10^{24} FLOPs. What are the optimal model size and number of training tokens? If the lab instead trains a model 4x larger on fewer tokens, what is the efficiency loss?

References

Canonical:

  • Strubell, Ganesh, McCallum, "Energy and Policy Considerations for Deep Learning in NLP", ACL 2019
  • Schwartz et al., "Green AI", Communications of the ACM 2020

Current:

  • Hoffmann et al., "Training Compute-Optimal Large Language Models" (Chinchilla), NeurIPS 2022

  • Kaplan et al., "Scaling Laws for Neural Language Models", 2020

  • Hinton, Vinyals, Dean, "Distilling the Knowledge in a Neural Network", NeurIPS Workshop 2015

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8

Last reviewed: April 2026