LLM Construction
Bits, Nats, Perplexity, and BPB
The four units people use to measure language model quality, how they relate to each other, when to use each one, and how mixing them up leads to wrong conclusions.
Prerequisites
Why This Matters
Language model papers report performance using at least four different units: cross-entropy in nats, cross-entropy in bits, perplexity, and bits-per-byte (BPB). These measure the same underlying quantity (how well the model predicts the next token) but on different scales, with different conventions, and people mix them up constantly.
If you cannot convert between these units fluently, you will miscompare models, misinterpret scaling laws, and misunderstand benchmark results. This page gives you the exact conversions and tells you which unit to use when.
The Four Units
Cross-Entropy Loss (Nats)
The standard training loss for language models. For a model predicting tokens from a true distribution over vocabulary :
where is the natural logarithm (base ). This is what PyTorch's CrossEntropyLoss returns. The unit is nats (natural units of information).
A perfect model achieves , the entropy of the true distribution. For natural language, this is typically 2 to 5 nats per token depending on the tokenizer and domain.
Cross-Entropy Loss (Bits)
The same cross-entropy, but using instead of :
The unit is bits (binary digits of information). One nat 1.4427 bits. One bit 0.6931 nats. Information theory traditionally uses bits; machine learning traditionally uses nats.
Perplexity
The exponential of the cross-entropy:
Perplexity has an intuitive interpretation: a perplexity of means the model is "as confused as if it were choosing uniformly among equally likely tokens at each step."
PPL = 1 means perfect prediction. PPL = means uniform random guessing over the vocabulary. For modern language models on standard benchmarks, PPL is typically 5 to 30.
Bits Per Byte (BPB)
Cross-entropy in bits, normalized by the number of UTF-8 bytes rather than the number of tokens:
BPB is tokenizer-independent: two models with different tokenizers can be compared directly because the normalization is by bytes, not tokens.
Typical values: 0.5 to 1.0 BPB for strong language models on English text. Shannon's estimate of English entropy is roughly 0.8 to 1.3 bits per character.
Conversion Table
Unit Conversion Identities
Statement
| From | To | Formula |
|---|---|---|
| Nats | Bits | |
| Bits | Nats | |
| Nats | Perplexity | |
| Bits | Perplexity | |
| Perplexity | Nats | |
| Perplexity | Bits | |
| Bits (per token) | BPB |
Intuition
Nats and bits are the same quantity on different logarithmic scales ( vs 2). Perplexity is the exponential transformation that converts from a logarithmic scale to a linear "effective vocabulary size" scale. BPB is a normalization that removes the tokenizer from the equation.
Why It Matters
Without these conversions, you cannot compare results across papers. A model reporting 3.2 nats is the same as 4.6 bits is the same as PPL 24.5. If you see one paper reporting "perplexity 25" and another reporting "3.2 nats per token," these are the same result.
Failure Mode
The conversion between bits-per-token and BPB depends on the tokenizer's compression ratio (tokens per byte). This ratio varies across tokenizers: GPT-2's BPE produces roughly 0.25 tokens per byte on English text; SentencePiece may differ. You cannot convert between bits-per-token and BPB without knowing the tokenizer.
Worked Example
A GPT-2 model achieves a cross-entropy of 3.4 nats per token on WikiText-103. GPT-2's tokenizer averages about 3.8 bytes per token on this dataset.
| Metric | Value | Calculation |
|---|---|---|
| Nats per token | 3.4 | (given) |
| Bits per token | 4.90 | |
| Perplexity | 29.96 | |
| BPB | 1.29 |
If a different model with a different tokenizer (averaging 4.2 bytes per token) achieves 3.1 nats per token, you cannot directly compare their nats or perplexity (different tokenizations). But you can compare BPB: BPB. The second model is better (lower BPB = better compression).
When to Use Each Unit
| Unit | Use when | Avoid when |
|---|---|---|
| Nats | Training (it is the raw loss), internal monitoring | Comparing across tokenizers |
| Bits | Information theory context, compression discussion | When your audience expects nats |
| Perplexity | Paper reporting, intuitive communication | Averaging across datasets (arithmetic mean of PPL is misleading; average the log-PPL instead) |
| BPB | Cross-tokenizer comparison, scaling law analysis | When byte-level normalization obscures token-level behavior |
Common Confusions
You cannot average perplexities
If model A has PPL 20 on dataset 1 and PPL 40 on dataset 2, the "average perplexity" is not 30. Perplexity is an exponential quantity. The correct aggregation is: average the log-perplexities (cross-entropies), then exponentiate. The geometric mean of perplexities equals the exponentiation of the arithmetic mean of cross-entropities: .
Lower perplexity does not always mean a better model
Perplexity measures how well the model predicts the specific test set. A model with lower perplexity on news articles may have higher perplexity on code. Perplexity is domain-specific. Also, a model can achieve low perplexity by being overconfident on easy tokens and terrible on hard tokens. Calibration is a separate and important property.
BPB depends on the text encoding
BPB normalizes by UTF-8 bytes. If your text is mostly ASCII (1 byte per character), BPB is close to bits-per-character. For text with many non-ASCII characters (Chinese, Arabic, emoji), UTF-8 uses 2 to 4 bytes per character, changing the BPB number without changing model quality. Always note the character set when comparing BPB.
Exercises
Problem
A language model achieves a cross-entropy of 2.1 nats per token. Convert this to bits per token and perplexity. If the tokenizer averages 4.0 bytes per token, what is the BPB?
Problem
Paper A reports PPL = 15.2 on WikiText-103 using a BPE tokenizer. Paper B reports PPL = 18.7 on WikiText-103 using a unigram tokenizer. Can you conclude that model A is better? What additional information do you need?
References
Canonical:
- Cover & Thomas, Elements of Information Theory (2006), Chapter 2
- Jelinek, Statistical Methods for Speech Recognition (1997). Early use of perplexity.
Current:
- Gao et al., "The Pile: An 800GB Dataset of Diverse Text for Language Modeling" (2020). Uses BPB for cross-tokenizer comparison.
- Hoffmann et al., "Training Compute-Optimal Large Language Models" (2022). Chinchilla reports both nats and BPB.
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Information Theory FoundationsLayer 0B
- KL DivergenceLayer 1
- Common Probability DistributionsLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A