Methodology
Energy Efficiency and Green AI
The compute cost of training frontier models, carbon footprint, FLOPs vs wall-clock time vs dollars, and why reporting efficiency matters. Efficient alternatives: distillation, pruning, quantization, and scaling laws for optimal compute allocation.
Why This Matters
Training GPT-4 reportedly cost over 100 million USD in compute. Training a single large language model can emit as much carbon as five cars over their lifetimes (Strubell et al. 2019, measured for BERT-scale models). The trend is accelerating: each generation of frontier models uses roughly 10x more compute than the previous one.
This is not only an environmental concern. It is an access concern. If only organizations with $100M+ budgets can train frontier models, research becomes concentrated in a few labs. Understanding compute efficiency is necessary for anyone who does not have unlimited resources.
Measuring Compute
FLOPs (Floating Point Operations)
FLOPs count the total number of floating point operations (additions and multiplications) in a computation. For a dense matrix multiplication of an matrix by a matrix, the cost is FLOPs (each of the output entries requires multiplications and additions).
For a transformer forward pass on a sequence of length with model dimension and layers, the cost is approximately FLOPs (from the attention and FFN layers). The backward pass costs roughly the forward pass.
Compute Budget
The total training compute is:
where is the number of parameters and is the number of training tokens (for dense transformer language models). The factor of 6 comes from: 2 FLOPs per parameter per token for the forward pass, times 3 for forward + backward + gradient accumulation.
Three distinct measurements of cost:
- FLOPs: hardware-independent, measures algorithmic cost
- Wall-clock time: depends on hardware, parallelism, communication overhead
- Dollars: depends on hardware prices, cloud vs. owned, electricity costs
These can diverge. A method with fewer FLOPs can be slower in wall-clock time if it is hard to parallelize. A method that is fast on GPUs may be expensive in dollars if GPUs cost more than the alternative hardware.
Scaling Laws and Optimal Allocation
Chinchilla Optimal Compute Allocation
Statement
Hoffmann et al. (2022) found empirically that for a fixed compute budget , the loss is minimized when parameters and training tokens are scaled proportionally:
The Chinchilla scaling law gives and . For the specific constants fitted, this means roughly 20 tokens per parameter is optimal.
Intuition
If your model is too large for the data, you waste compute on parameters that cannot be properly trained (overfitting risk, underfitting due to insufficient data). If your data is too large for the model, you waste compute on training a model that has already saturated its capacity. The optimal point balances model capacity against data quantity.
Proof Sketch
Fit a parametric loss function to training runs at many scales. Minimize subject to . The Lagrangian gives . Hoffmann et al. found , giving .
Why It Matters
Before Chinchilla, the prevailing approach (Kaplan et al. 2020) suggested making models as large as possible and training for fewer tokens. Chinchilla showed this was suboptimal: GPT-3 (175B parameters, 300B tokens) used far more parameters than necessary. Chinchilla (70B parameters, 1.4T tokens) matched GPT-3's performance with 4x less compute at inference time. This shifted the field toward training smaller models on more data.
Failure Mode
Chinchilla scaling laws are fitted to specific model families (dense transformers) and training setups. They may not apply to: mixture-of-experts models, models trained with curriculum learning, models fine-tuned for downstream tasks, or models where inference cost (not training cost) is the binding constraint. When inference dominates total cost (which it does for widely deployed models), training a larger model on less data and then distilling may be more efficient overall.
Efficient Alternatives
Knowledge distillation: train a small student model to mimic the outputs of a large teacher model. Hinton et al. (2015) showed that the teacher's soft predictions (probability distributions over classes) transfer more information than hard labels alone. The student can match 90%+ of the teacher's accuracy at a fraction of the parameter count.
Pruning: remove redundant weights or neurons after training. Structured pruning (removing entire channels or attention heads) gives real speedups on standard hardware; unstructured pruning (zeroing individual weights) requires sparse matrix support.
Quantization: reduce weight precision from 32-bit to 16-bit, 8-bit, or 4-bit (see computer architecture for ML for hardware implications). Post-training quantization (PTQ) requires no retraining. Quantization-aware training (QAT) maintains accuracy better but costs more. 4-bit quantization reduces model size by 8x with modest accuracy loss for LLMs.
Architecture efficiency: mixture-of-experts (MoE) models activate only a fraction of parameters per token. A 1T-parameter MoE model might use only 100B parameters per forward pass, giving the capacity of a large model at the inference cost of a smaller one.
Reporting Efficiency
Efficiency Reporting Principle
Statement
Schwartz et al. (2020, "Green AI") argue that AI papers should report efficiency alongside accuracy. Specifically, every result should include:
- Total training FLOPs (or GPU-hours and hardware type)
- Training wall-clock time
- CO2 emissions estimate (or at minimum, data center location and energy source)
- Cost in dollars (approximate)
A method that achieves 1% higher accuracy at 10x the compute is not unambiguously better.
Intuition
Without efficiency reporting, the field optimizes only for accuracy, creating an arms race where only the richest labs can compete. Reporting compute cost enables Pareto-optimal comparisons: Method A beats Method B if it is better on accuracy and cheaper, or equivalent on accuracy and cheaper, or better on accuracy at equal cost.
Proof Sketch
Not a mathematical proof. This is a methodological principle supported by the observation that omitting cost makes it impossible to evaluate efficiency.
Why It Matters
If a paper reports only accuracy, you cannot tell whether the improvement came from a better algorithm or simply more compute. Efficiency reporting makes this distinction visible and rewards algorithmic innovation over brute-force scaling.
Failure Mode
Efficiency reporting is hard to standardize. GPU-hours on an A100 are not comparable to GPU-hours on a V100. FLOPs do not account for memory bandwidth bottlenecks or communication overhead. Dollars depend on cloud pricing that changes monthly. Despite these limitations, approximate reporting is far better than no reporting.
Carbon Footprint
The carbon cost of training depends on three factors:
- Energy consumed (kWh) = power draw (kW) training time (hours)
- Power Usage Effectiveness (PUE): ratio of total data center energy to compute energy. Typical values: 1.1-1.6.
- Carbon intensity of the electricity grid (gCO2/kWh): varies from ~20 gCO2/kWh (hydropower, nuclear) to ~900 gCO2/kWh (coal).
Training the same model in Quebec (hydropower, ~20 gCO2/kWh) produces roughly 40x less carbon than training it in West Virginia (coal, ~800 gCO2/kWh). Data center location is a first-order decision for carbon footprint.
Common Confusions
FLOPs and FLOPS are different
FLOPs (lowercase s) = floating point operations (a count). FLOP/s or FLOPS = floating point operations per second (a rate). An A100 GPU has a peak throughput of ~312 TFLOP/s for FP16. Training for 10 hours at 50% utilization uses FLOPs.
Inference cost often exceeds training cost
GPT-4 was trained once but serves millions of requests per day. Over the model's lifetime, inference FLOPs dwarf training FLOPs. Optimizing only training efficiency is insufficient; inference efficiency (smaller models, quantization, efficient serving) often matters more for total cost.
Exercises
Problem
A 7B parameter model is trained on 1.4T tokens. Estimate the total training FLOPs using the approximation. If an A100 GPU achieves 150 TFLOP/s in practice, how many GPU-hours is this?
Problem
Chinchilla scaling says the optimal number of training tokens is . A lab has a budget of FLOPs. What are the optimal model size and number of training tokens? If the lab instead trains a model 4x larger on fewer tokens, what is the efficiency loss?
References
Canonical:
- Strubell, Ganesh, McCallum, "Energy and Policy Considerations for Deep Learning in NLP", ACL 2019
- Schwartz et al., "Green AI", Communications of the ACM 2020
Current:
-
Hoffmann et al., "Training Compute-Optimal Large Language Models" (Chinchilla), NeurIPS 2022
-
Kaplan et al., "Scaling Laws for Neural Language Models", 2020
-
Hinton, Vinyals, Dean, "Distilling the Knowledge in a Neural Network", NeurIPS Workshop 2015
-
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Chapters 7-8
Last reviewed: April 2026