Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Bits, Nats, Perplexity, and BPB

The four units people use to measure language model quality, how they relate to each other, when to use each one, and how mixing them up leads to wrong conclusions.

CoreTier 2Stable~35 min

Why This Matters

Language model papers report performance using at least four different units: cross-entropy in nats, cross-entropy in bits, perplexity, and bits-per-byte (BPB). These measure the same underlying quantity (how well the model predicts the next token) but on different scales, with different conventions, and people mix them up constantly.

If you cannot convert between these units fluently, you will miscompare models, misinterpret scaling laws, and misunderstand benchmark results. This page gives you the exact conversions and tells you which unit to use when.

The Four Units

Definition

Cross-Entropy Loss (Nats)

The standard training loss for language models. For a model qq predicting tokens from a true distribution pp over vocabulary VV:

Hnats(p,q)=vVp(v)logq(v)H_{\text{nats}}(p, q) = -\sum_{v \in V} p(v) \log q(v)

where log\log is the natural logarithm (base ee). This is what PyTorch's CrossEntropyLoss returns. The unit is nats (natural units of information).

A perfect model achieves Hnats=H(p)H_{\text{nats}} = H(p), the entropy of the true distribution. For natural language, this is typically 2 to 5 nats per token depending on the tokenizer and domain.

Definition

Cross-Entropy Loss (Bits)

The same cross-entropy, but using log2\log_2 instead of ln\ln:

Hbits=Hnatsln2Hnats0.6931H_{\text{bits}} = \frac{H_{\text{nats}}}{\ln 2} \approx \frac{H_{\text{nats}}}{0.6931}

The unit is bits (binary digits of information). One nat \approx 1.4427 bits. One bit \approx 0.6931 nats. Information theory traditionally uses bits; machine learning traditionally uses nats.

Definition

Perplexity

The exponential of the cross-entropy:

PPL=eHnats=2Hbits\text{PPL} = e^{H_{\text{nats}}} = 2^{H_{\text{bits}}}

Perplexity has an intuitive interpretation: a perplexity of kk means the model is "as confused as if it were choosing uniformly among kk equally likely tokens at each step."

PPL = 1 means perfect prediction. PPL = V|V| means uniform random guessing over the vocabulary. For modern language models on standard benchmarks, PPL is typically 5 to 30.

Definition

Bits Per Byte (BPB)

Cross-entropy in bits, normalized by the number of UTF-8 bytes rather than the number of tokens:

BPB=Hbits×(number of tokens)(number of bytes in text)\text{BPB} = \frac{H_{\text{bits}} \times (\text{number of tokens})}{(\text{number of bytes in text})}

BPB is tokenizer-independent: two models with different tokenizers can be compared directly because the normalization is by bytes, not tokens.

Typical values: 0.5 to 1.0 BPB for strong language models on English text. Shannon's estimate of English entropy is roughly 0.8 to 1.3 bits per character.

Conversion Table

Proposition

Unit Conversion Identities

Statement

FromToFormula
NatsBitsbits=nats/ln2nats×1.4427\text{bits} = \text{nats} / \ln 2 \approx \text{nats} \times 1.4427
BitsNatsnats=bits×ln2bits×0.6931\text{nats} = \text{bits} \times \ln 2 \approx \text{bits} \times 0.6931
NatsPerplexityPPL=enats\text{PPL} = e^{\text{nats}}
BitsPerplexityPPL=2bits\text{PPL} = 2^{\text{bits}}
PerplexityNatsnats=ln(PPL)\text{nats} = \ln(\text{PPL})
PerplexityBitsbits=log2(PPL)\text{bits} = \log_2(\text{PPL})
Bits (per token)BPBBPB=bits×(tokens/bytes)\text{BPB} = \text{bits} \times (\text{tokens} / \text{bytes})

Intuition

Nats and bits are the same quantity on different logarithmic scales (ee vs 2). Perplexity is the exponential transformation that converts from a logarithmic scale to a linear "effective vocabulary size" scale. BPB is a normalization that removes the tokenizer from the equation.

Why It Matters

Without these conversions, you cannot compare results across papers. A model reporting 3.2 nats is the same as 4.6 bits is the same as PPL 24.5. If you see one paper reporting "perplexity 25" and another reporting "3.2 nats per token," these are the same result.

Failure Mode

The conversion between bits-per-token and BPB depends on the tokenizer's compression ratio (tokens per byte). This ratio varies across tokenizers: GPT-2's BPE produces roughly 0.25 tokens per byte on English text; SentencePiece may differ. You cannot convert between bits-per-token and BPB without knowing the tokenizer.

Worked Example

A GPT-2 model achieves a cross-entropy of 3.4 nats per token on WikiText-103. GPT-2's tokenizer averages about 3.8 bytes per token on this dataset.

MetricValueCalculation
Nats per token3.4(given)
Bits per token4.903.4/0.6931=4.903.4 / 0.6931 = 4.90
Perplexity29.96e3.4=29.96e^{3.4} = 29.96
BPB1.294.90/3.8=1.294.90 / 3.8 = 1.29

If a different model with a different tokenizer (averaging 4.2 bytes per token) achieves 3.1 nats per token, you cannot directly compare their nats or perplexity (different tokenizations). But you can compare BPB: 3.1/0.6931/4.2=1.063.1 / 0.6931 / 4.2 = 1.06 BPB. The second model is better (lower BPB = better compression).

When to Use Each Unit

UnitUse whenAvoid when
NatsTraining (it is the raw loss), internal monitoringComparing across tokenizers
BitsInformation theory context, compression discussionWhen your audience expects nats
PerplexityPaper reporting, intuitive communicationAveraging across datasets (arithmetic mean of PPL is misleading; average the log-PPL instead)
BPBCross-tokenizer comparison, scaling law analysisWhen byte-level normalization obscures token-level behavior

Common Confusions

Watch Out

You cannot average perplexities

If model A has PPL 20 on dataset 1 and PPL 40 on dataset 2, the "average perplexity" is not 30. Perplexity is an exponential quantity. The correct aggregation is: average the log-perplexities (cross-entropies), then exponentiate. The geometric mean of perplexities equals the exponentiation of the arithmetic mean of cross-entropities: PPLavg=exp(12(ln20+ln40))=exp(3.39)29.6\text{PPL}_{\text{avg}} = \exp(\frac{1}{2}(\ln 20 + \ln 40)) = \exp(3.39) \approx 29.6.

Watch Out

Lower perplexity does not always mean a better model

Perplexity measures how well the model predicts the specific test set. A model with lower perplexity on news articles may have higher perplexity on code. Perplexity is domain-specific. Also, a model can achieve low perplexity by being overconfident on easy tokens and terrible on hard tokens. Calibration is a separate and important property.

Watch Out

BPB depends on the text encoding

BPB normalizes by UTF-8 bytes. If your text is mostly ASCII (1 byte per character), BPB is close to bits-per-character. For text with many non-ASCII characters (Chinese, Arabic, emoji), UTF-8 uses 2 to 4 bytes per character, changing the BPB number without changing model quality. Always note the character set when comparing BPB.

Exercises

ExerciseCore

Problem

A language model achieves a cross-entropy of 2.1 nats per token. Convert this to bits per token and perplexity. If the tokenizer averages 4.0 bytes per token, what is the BPB?

ExerciseCore

Problem

Paper A reports PPL = 15.2 on WikiText-103 using a BPE tokenizer. Paper B reports PPL = 18.7 on WikiText-103 using a unigram tokenizer. Can you conclude that model A is better? What additional information do you need?

References

Canonical:

  • Cover & Thomas, Elements of Information Theory (2006), Chapter 2
  • Jelinek, Statistical Methods for Speech Recognition (1997). Early use of perplexity.

Current:

  • Gao et al., "The Pile: An 800GB Dataset of Diverse Text for Language Modeling" (2020). Uses BPB for cross-tokenizer comparison.
  • Hoffmann et al., "Training Compute-Optimal Large Language Models" (2022). Chinchilla reports both nats and BPB.

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics