Energy Efficiency and Green AI

Sneiderman, Robby

Methodology

Energy Efficiency and Green AI

The compute cost of training frontier models, carbon footprint, FLOPs vs wall-clock time vs dollars, and why reporting efficiency matters. Efficient alternatives: distillation, pruning, quantization, and scaling laws for optimal compute allocation.

CoreTier 3CurrentSupporting~40 min

Prereq Map

Learning position

Read this page in the graph.

methodology | layer 5 | tier 3. This page has 0 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Training GPT-4 reportedly cost over 100 million USD in compute. Strubell et al. (2019) reported that a neural architecture search pipeline over BERT variants emitted CO2 comparable to five cars over their lifetimes. This figure covers the full NAS sweep, not the training of a single BERT model. Patterson et al. (2021) and Luccioni et al. (2022) later refined per-model accounting with direct measurements for T5, GShard, Switch, and BLOOM. The trend in frontier pretraining compute is roughly 10x per generation.

This is not only an environmental concern. It is an access concern. If only organizations with 100 million USD+ budgets can train frontier models, research becomes concentrated in a few labs. Understanding compute efficiency is necessary for anyone who does not have unlimited resources.

Measuring Compute

Definition

FLOPs (Floating Point Operations)

FLOPs count the total number of floating point operations (additions and multiplications) in a computation. For a dense matrix multiplication of an $m \times k$ matrix by a $k \times n$ matrix, the cost is $2mkn$ FLOPs (each of the $mn$ output entries requires $k$ multiplications and $k-1$ additions).

For a transformer forward pass on a sequence of length $s$ with model dimension $d$ and $L$ layers, the cost is approximately $2 \cdot L \cdot (12d^2s + 2ds^2)$ FLOPs (from the attention and FFN layers). The backward pass costs roughly $2\times$ the forward pass.

Definition

Compute Budget

The total training compute is:

$C = 6 N D$

where $N$ is the number of parameters and $D$ is the number of training tokens (for dense transformer language models). Training compute decomposes as: forward pass $\approx 2ND$ FLOPs (one multiply-add per parameter per token), backward pass $\approx 4ND$ FLOPs (two matmuls, one for the input gradient and one for the weight gradient). Total $\approx 6ND$ FLOPs (Kaplan et al. 2020, Hoffmann et al. 2022). This is not gradient accumulation. Gradient accumulation is a memory-saving technique that splits a batch across micro-batches and does not change total FLOPs.

Three distinct measurements of cost:

FLOPs: hardware-independent, measures algorithmic cost
Wall-clock time: depends on hardware, parallelism, communication overhead
Dollars: depends on hardware prices, cloud vs. owned, electricity costs

These can diverge. A method with fewer FLOPs can be slower in wall-clock time if it is hard to parallelize. A method that is fast on GPUs may be expensive in dollars if GPUs cost more than the alternative hardware.

Scaling Laws and Optimal Allocation

Proposition

Chinchilla Optimal Compute Allocation

Statement

Hoffmann et al. (2022, Table 3, Approach 3) fit a parametric loss surface to 400+ training runs and found that for a fixed compute budget $C$ , the loss is minimized at

$N_{\text{opt}} \propto C^{0.45}, \quad D_{\text{opt}} \propto C^{0.55}$

using the Chinchilla scaling law exponents. For the specific constants fitted, this gives roughly 20 tokens per parameter at the scales examined.

Intuition

If your model is too large for the data, you waste compute on parameters that cannot be properly trained (overfitting risk, underfitting due to insufficient data). If your data is too large for the model, you waste compute on training a model that has already saturated its capacity. The optimal point balances model capacity against data quantity.

Proof Sketch

Fit a parametric loss function $L(N, D) = E + A/N^\alpha + B/D^\beta$ to training runs at many scales. Minimize $L$ subject to $C = 6ND$ . The Lagrangian gives $N_{\text{opt}} \propto C^{\beta/(\alpha+\beta)}$ and $D_{\text{opt}} \propto C^{\alpha/(\alpha+\beta)}$ . Hoffmann et al. fit $\alpha \approx 0.34$ and $\beta \approx 0.28$ , giving $N_{\text{opt}} \propto C^{0.45}$ and $D_{\text{opt}} \propto C^{0.55}$ .

Why It Matters

Before Chinchilla, the prevailing approach (Kaplan et al. 2020) suggested making models as large as possible and training for fewer tokens. Chinchilla showed this was suboptimal: GPT-3 (175B parameters, 300B tokens) used far more parameters than necessary. Chinchilla (70B parameters, 1.4T tokens) matched GPT-3's performance with 4x less compute at inference time. This shifted the field toward training smaller models on more data.

Failure Mode

Chinchilla scaling laws are fitted to specific model families (dense transformers) and training setups. They may not apply to: mixture-of-experts models, models trained with curriculum learning, models fine-tuned for downstream tasks, or models where inference cost (not training cost) is the binding constraint. When inference dominates total cost (which it does for widely deployed models), training a larger model on less data and then distilling may be more efficient overall.

report a correction →

Efficient Alternatives

Knowledge distillation: train a small student model to mimic the outputs of a large teacher model. Hinton et al. (2015) showed that the teacher's soft predictions (probability distributions over classes) transfer more information than hard labels alone. The student can match 90%+ of the teacher's accuracy at a fraction of the parameter count.

Pruning: remove redundant weights or neurons after training. Structured pruning (removing entire channels or attention heads) gives real speedups on standard hardware; unstructured pruning (zeroing individual weights) requires sparse matrix support.

Quantization: reduce weight precision from 32-bit to 16-bit, 8-bit, or 4-bit (see computer architecture for ML for hardware implications). Post-training quantization (PTQ) requires no retraining. Quantization-aware training (QAT) maintains accuracy better but costs more. 4-bit quantization reduces model size by 8x with modest accuracy loss for LLMs.

Architecture efficiency: mixture-of-experts (MoE) models activate only a fraction of parameters per token. A 1T-parameter MoE model might use only 100B parameters per forward pass, giving the capacity of a large model at the inference cost of a smaller one.

Reporting Efficiency

Proposition

Efficiency Reporting Principle

Statement

Schwartz et al. (2020, "Green AI") argue that AI papers should report efficiency alongside accuracy. Specifically, every result should include:

Total training FLOPs (or GPU-hours and hardware type)
Training wall-clock time
CO2 emissions estimate (or at minimum, data center location and energy source)
Cost in dollars (approximate)

A method that achieves 1% higher accuracy at 10x the compute is not unambiguously better.

Intuition

Without efficiency reporting, the field optimizes only for accuracy, creating an arms race where only the richest labs can compete. Reporting compute cost enables Pareto-optimal comparisons: Method A beats Method B if it is better on accuracy and cheaper, or equivalent on accuracy and cheaper, or better on accuracy at equal cost.

Proof Sketch

Not a mathematical proof. This is a methodological principle supported by the observation that omitting cost makes it impossible to evaluate efficiency.

Why It Matters

If a paper reports only accuracy, you cannot tell whether the improvement came from a better algorithm or simply more compute. Efficiency reporting makes this distinction visible and rewards algorithmic innovation over brute-force scaling.

Failure Mode

Efficiency reporting is hard to standardize. GPU-hours on an A100 are not comparable to GPU-hours on a V100. FLOPs do not account for memory bandwidth bottlenecks or communication overhead. Dollars depend on cloud pricing that changes monthly. Despite these limitations, approximate reporting is far better than no reporting.

report a correction →

Carbon Footprint

The carbon cost of training depends on three factors:

Energy consumed (kWh) = power draw (kW) $\times$ training time (hours)
Power Usage Effectiveness (PUE): ratio of total data center energy to compute energy. Typical values: 1.1-1.6.
Carbon intensity of the electricity grid (gCO2/kWh): varies from ~20 gCO2/kWh (hydropower, nuclear) to ~900 gCO2/kWh (coal).

$\text{CO}_2 = \text{Energy} \times \text{PUE} \times \text{Carbon Intensity}$

Training the same model in Quebec (hydropower, ~20 gCO2/kWh) produces roughly 40x less carbon than training it in West Virginia (coal, ~800 gCO2/kWh). Data center location is a first-order decision for carbon footprint.

Common Confusions

Watch Out

FLOPs and FLOPS are different

FLOPs (lowercase s) = floating point operations (a count). FLOP/s or FLOPS = floating point operations per second (a rate). An A100 GPU has a peak throughput of ~312 TFLOP/s for FP16. Training for 10 hours at 50% utilization uses $312 \times 10^{12} \times 0.5 \times 3600 \times 10 \approx 5.6 \times 10^{18}$ FLOPs.

Watch Out

Inference cost often exceeds training cost

GPT-4 was trained once but serves millions of requests per day. Over the model's lifetime, inference FLOPs dwarf training FLOPs. Optimizing only training efficiency is insufficient; inference efficiency (smaller models, quantization, efficient serving) often matters more for total cost.

Exercises

ExerciseCore

Problem

A 7B parameter model is trained on 1.4T tokens. Estimate the total training FLOPs using the $C = 6ND$ approximation. If an A100 GPU achieves 150 TFLOP/s in practice, how many GPU-hours is this?

ExerciseAdvanced

Problem

Chinchilla scaling says the optimal number of training tokens is $D_{\text{opt}} \approx 20N$ . A lab has a budget of $C = 10^{24}$ FLOPs. What are the optimal model size and number of training tokens? If the lab instead trains a model 4x larger on fewer tokens, what is the efficiency loss?

References

Canonical:

Strubell, Ganesh, McCallum, "Energy and Policy Considerations for Deep Learning in NLP", ACL 2019 (arXiv:1906.02243). Source of the five-cars figure, measured over a full NAS sweep.
Schwartz et al., "Green AI", Communications of the ACM 2020 (arXiv:1907.10597).
Kaplan et al., "Scaling Laws for Neural Language Models", 2020 (arXiv:2001.08361). Origin of the $6ND$ training compute accounting.

Current:

Hoffmann et al., "Training Compute-Optimal Large Language Models" (Chinchilla), NeurIPS 2022 (arXiv:2203.15556). Table 3, Approach 3 gives the $N^{0.45}$ , $D^{0.55}$ exponents.
Patterson et al., "Carbon Emissions and Large Neural Network Training", 2021 (arXiv:2104.10350). Refined per-model carbon accounting for T5, GShard, Switch, GPT-3.
Luccioni, Viguier, Ligozat, "Estimating the Carbon Footprint of BLOOM", JMLR 2022 (arXiv:2211.02001). Bottom-up measurement of a 176B model.
Patterson et al., "The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink", IEEE Computer 2022 (arXiv:2204.05149).
Dubey et al., "The Llama 3 Herd of Models", 2024 (arXiv:2407.21783). Modern pretraining energy accounting for a 405B dense model.
Hinton, Vinyals, Dean, "Distilling the Knowledge in a Neural Network", NeurIPS Workshop 2015 (arXiv:1503.02531).

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

0

No direct prerequisites are declared; this is treated as an entry point.

Derived topics

0

No published topic currently declares this as a prerequisite.