Bits, Nats, Perplexity, and BPB

Sneiderman, Robby

LLM Construction

Bits, Nats, Perplexity, and BPB

The four units people use to measure language model quality, how they relate to each other, when to use each one, and how mixing them up leads to wrong conclusions.

CoreTier 2StableCore spine~35 min

Prerequisites

Information Theory Foundations KL Divergence

Quiz (6)Pulse Check Prereq Map

Why This Matters

Language model papers report performance using at least four different units: cross-entropy in nats, cross-entropy in bits, perplexity, and bits-per-byte (BPB). These measure the same underlying quantity (how well the model predicts the next token) but on different scales, with different conventions, and people mix them up constantly.

If you cannot convert between these units fluently, you will miscompare models, misinterpret scaling laws, and misunderstand benchmark results. This page gives you the exact conversions and tells you which unit to use when.

The Four Units

Definition

Cross-Entropy Loss (Nats) $H_{nats}$

The standard training loss for language models. For a model $q$ predicting tokens from a true distribution $p$ over vocabulary $V$ :

$H_{\text{nats}}(p, q) = -\sum_{v \in V} p(v) \log q(v)$

where $\log$ is the natural logarithm (base $e$ ). This is what PyTorch's CrossEntropyLoss returns. The unit is nats (natural units of information).

A perfect model achieves $H_{\text{nats}} = H(p)$ , the entropy of the true distribution. For natural language, this is typically 2 to 5 nats per token depending on the tokenizer and domain.

Definition

Cross-Entropy Loss (Bits) $H_{bits}$

The same cross-entropy, but using $\log_2$ instead of $\ln$ :

$H_{\text{bits}} = \frac{H_{\text{nats}}}{\ln 2} \approx \frac{H_{\text{nats}}}{0.6931}$

The unit is bits (binary digits of information). One nat $\approx$ 1.4427 bits. One bit $\approx$ 0.6931 nats. Information theory traditionally uses bits; machine learning traditionally uses nats.

Definition

Perplexity $PPL$

The exponential of the cross-entropy:

$\text{PPL} = e^{H_{\text{nats}}} = 2^{H_{\text{bits}}}$

Perplexity has an intuitive interpretation: a perplexity of $k$ means the model is "as confused as if it were choosing uniformly among $k$ equally likely tokens at each step."

PPL = 1 means perfect prediction. PPL = $|V|$ means uniform random guessing over the vocabulary. For modern language models on standard benchmarks, PPL is typically 5 to 30.

Definition

Bits Per Byte (BPB) $BPB$

Cross-entropy in bits, normalized by the total number of UTF-8 bytes in the evaluated text rather than by the number of tokens. Let $L_{\text{nats}}$ be the total negative log-likelihood (summed over every token in the corpus, in nats), let $N_{\text{tok}}$ be the total token count, and let $N_{\text{byte}}$ be the total UTF-8 byte count of the underlying text. Then:

$\text{BPB} = \frac{L_{\text{nats}}}{\ln 2 \cdot N_{\text{byte}}} = \frac{L_{\text{bits}}}{N_{\text{byte}}}$

Equivalently, if $\bar{H}_{\text{bits}} = L_{\text{bits}} / N_{\text{tok}}$ is the average bits per token and $b = N_{\text{byte}} / N_{\text{tok}}$ is bytes per token for this tokenizer and corpus:

$\text{BPB} = \frac{\bar{H}_{\text{bits}}}{b}$

The loss is summed over tokens (not averaged) before dividing by bytes. BPB normalizes by byte count rather than token count, which makes comparisons across tokenizers valid: two models with different vocabularies produce different token counts for the same text, but the byte count of the text is fixed.

Typical values: 0.5 to 1.0 BPB for strong language models on English text. Shannon (1951) estimated English entropy at roughly 0.6 to 1.3 bits per character via human prediction experiments.

Conversion Table

Proposition

Unit Conversion Identities

Statement

From	To	Formula
Nats	Bits	$\text{bits} = \text{nats} / \ln 2 \approx \text{nats} \times 1.4427$
Bits	Nats	$\text{nats} = \text{bits} \times \ln 2 \approx \text{bits} \times 0.6931$
Nats	Perplexity	$\text{PPL} = e^{\text{nats}}$
Bits	Perplexity	$\text{PPL} = 2^{\text{bits}}$
Perplexity	Nats	$\text{nats} = \ln(\text{PPL})$
Perplexity	Bits	$\text{bits} = \log_2(\text{PPL})$
Bits (per token)	BPB	$\text{BPB} = \text{bits} \times (\text{tokens} / \text{bytes})$
Total nats loss	BPB	$\text{BPB} = L_{\text{nats}} / (\ln 2 \cdot N_{\text{byte}})$

Intuition

Nats and bits are the same quantity on different logarithmic scales ( $e$ vs 2). Perplexity is the exponential transformation that converts from a logarithmic scale to a linear "effective vocabulary size" scale. BPB is a normalization that removes the tokenizer from the equation.

Why It Matters

Without these conversions, you cannot compare results across papers. A model reporting 3.2 nats is the same as 4.6 bits is the same as PPL 24.5. If you see one paper reporting "perplexity 25" and another reporting "3.2 nats per token," these are the same result.

Failure Mode

The conversion between bits-per-token and BPB depends on the tokenizer's compression ratio (tokens per byte). This ratio varies across tokenizers: GPT-2's BPE produces roughly 0.25 tokens per byte on English text; SentencePiece may differ. You cannot convert between bits-per-token and BPB without knowing the tokenizer.

report a correction →

Worked Example

A GPT-2 model achieves a cross-entropy of 3.4 nats per token on WikiText-103. GPT-2's tokenizer averages about 3.8 bytes per token on this dataset.

Metric	Value	Calculation
Nats per token	3.4	(given)
Bits per token	4.90	$3.4 / 0.6931 = 4.90$
Perplexity	29.96	$e^{3.4} = 29.96$
BPB	1.29	$4.90 / 3.8 = 1.29$

If a different model with a different tokenizer (averaging 4.2 bytes per token) achieves 3.1 nats per token, you cannot directly compare their nats or perplexity (different tokenizations). But you can compare BPB: $3.1 / 0.6931 / 4.2 = 1.06$ BPB. The second model is better (lower BPB = better compression).

Character-Level Benchmarks: enwik8 and text8

Two byte-level / character-level benchmarks report results in BPB (often called BPC, bits-per-character, when the text is pure ASCII where bytes and characters coincide):

enwik8: the first 100 MB of an English Wikipedia XML dump (Matt Mahoney's Hutter Prize corpus). Contains raw XML, markup, and multi-byte UTF-8 characters. SOTA results are reported in BPB.
text8: the first 100 MB of enwik8 after stripping XML, lowercasing, and keeping only 26 letters plus space. Pure ASCII, so BPB equals BPC.

Reference results (BPB, lower is better):

Model	enwik8	text8	Source
Transformer-XL (large)	0.99	1.08	Dai et al. 2019
Compressive Transformer	0.97	1.05	Rae et al. 2020
GPT-2 (zero-shot)	0.93 (WikiText)		Radford et al. 2019

These benchmarks exist specifically to standardize comparison: because the text is a fixed byte stream, BPB is tokenizer-agnostic and directly interpretable as compression ratio.

When to Use Each Unit

Unit	Use when	Avoid when
Nats	Training (it is the raw loss), internal monitoring	Comparing across tokenizers
Bits	Information theory context, compression discussion	When your audience expects nats
Perplexity	Paper reporting, intuitive communication	Averaging across datasets (arithmetic mean of PPL is misleading; average the log-PPL instead)
BPB	Cross-tokenizer comparison, scaling law analysis	When byte-level normalization obscures token-level behavior

Common Confusions

Watch Out

You cannot average perplexities

If model A has PPL 20 on dataset 1 and PPL 40 on dataset 2, the "average perplexity" is not 30. Perplexity is an exponential quantity. The correct aggregation is: average the log-perplexities (cross-entropies), then exponentiate. The geometric mean of perplexities equals the exponentiation of the arithmetic mean of cross-entropities: $\text{PPL}_{\text{avg}} = \exp(\frac{1}{2}(\ln 20 + \ln 40)) = \exp(3.39) \approx 29.6$ .

Watch Out

Lower perplexity does not always mean a better model

Perplexity measures how well the model predicts the specific test set. A model with lower perplexity on news articles may have higher perplexity on code. Perplexity is domain-specific. Also, a model can achieve low perplexity by being overconfident on easy tokens and terrible on hard tokens. Calibration is a separate and important property.

Watch Out

BPB depends on the text encoding

BPB normalizes by UTF-8 bytes. If your text is mostly ASCII (1 byte per character), BPB is close to bits-per-character. For text with many non-ASCII characters (Chinese, Arabic, emoji), UTF-8 uses 2 to 4 bytes per character, changing the BPB number without changing model quality. Always note the character set when comparing BPB.

Watch Out

Cross-entropy is literally a compression bound

A model with cross-entropy $H$ bits per symbol can, via arithmetic coding, compress the evaluated text to $H$ bits per symbol in expectation (plus a vanishing per-sequence overhead). This is the source coding theorem applied to the model's predictive distribution. BPB is therefore not just a loss number: a model at 0.8 BPB on enwik8 compresses the file to 80 MB (100 MB text $\times$ 0.8 bits/byte / 8 bits/byte), and no coder using that predictive model can do better. This is why Hutter Prize scoring is phrased as compressed-file-size rather than perplexity. See Cover and Thomas (2006) Ch 5 or MacKay (2003) Ch 4 to 6.

Watch Out

Averaging perplexity across documents is token-weighted by convention

When a benchmark reports a single perplexity over a corpus of many documents, the standard convention (used by lm-evaluation-harness, HuggingFace evaluate, and most papers) is token-weighted, not document-weighted. Let $L_i$ be the per-token NLL on document $i$ and $n_i$ its token count. Then the reported perplexity is:

$\text{PPL} = \exp\left(\frac{\sum_i L_i \cdot n_i}{\sum_i n_i}\right)$

This is equivalent to concatenating all documents and computing a single corpus-level cross-entropy. Document-weighted averaging (giving a 10-token doc the same weight as a 10000-token doc) produces different numbers and is almost never what you want. Always check which convention a paper uses if the corpus contains documents of very unequal length.

Exercises

ExerciseCore

Problem

A language model achieves a cross-entropy of 2.1 nats per token. Convert this to bits per token and perplexity. If the tokenizer averages 4.0 bytes per token, what is the BPB?

ExerciseCore

Problem

Paper A reports PPL = 15.2 on WikiText-103 using a BPE tokenizer. Paper B reports PPL = 18.7 on WikiText-103 using a unigram tokenizer. Can you conclude that model A is better? What additional information do you need?

References

Canonical:

Shannon, C. E., "Prediction and Entropy of Printed English," Bell System Technical Journal 30:50-64 (1951). Introduces bits-per-character (BPC) for English text, estimating entropy at roughly 0.6 to 1.3 bits per character via human prediction. The historical origin of the metric.
Cover, T. M. and Thomas, J. A., Elements of Information Theory (2nd ed., 2006), Chapters 2 (entropy, KL) and 5 (source coding theorem, arithmetic coding). The compression interpretation of cross-entropy.
MacKay, D. J. C., Information Theory, Inference, and Learning Algorithms (2003), Chapters 4 to 6. Source coding, arithmetic coding, and compression bounds derived from predictive distributions. Free PDF at www.inference.org.uk/mackay/itila.
Jelinek, F., Statistical Methods for Speech Recognition (1997). Early use of perplexity in language modeling.
Brown, P. F. et al., "An Estimate of an Upper Bound for the Entropy of English," Computational Linguistics 18(1):31-40 (1992). Refines Shannon's entropy estimate using a trigram model on a large corpus.

Benchmarks:

Mahoney, M., "Large Text Compression Benchmark" (2006, updated). Defines enwik8 (first $10^8$ bytes of English Wikipedia XML dump) and text8 (cleaned lowercase ASCII variant). http://mattmahoney.net/dc/textdata.html
Dai, Z. et al., "Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context," ACL 2019 (arXiv:1901.02860). Reports 0.99 BPB on enwik8.
Rae, J. W. et al., "Compressive Transformers for Long-Range Sequence Modelling," ICLR 2020 (arXiv:1911.05507). Reports 0.97 BPB on enwik8.
Radford, A. et al., "Language Models are Unsupervised Multitask Learners" (2019). GPT-2 technical report, zero-shot perplexity and BPB on multiple benchmarks.

Current practice:

Gao, L. et al., "The Pile: An 800GB Dataset of Diverse Text for Language Modeling," arXiv:2101.00027 (2020). Uses BPB for cross-tokenizer comparison.
Hoffmann, J. et al., "Training Compute-Optimal Large Language Models," arXiv:2203.15556 (2022). Chinchilla reports both nats and BPB.
EleutherAI, lm-evaluation-harness, github.com/EleutherAI/lm-evaluation-harness. Reference implementation of token-weighted corpus perplexity.

Last reviewed: April 18, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

KL Divergencelayer 1 · tier 1
Information Theory Foundationslayer 0B · tier 2

Derived topics

2

Calibration and Uncertainty Quantificationlayer 3 · tier 2
Perplexity and Language Model Evaluationlayer 3 · tier 2

Graph-backed continuations

Perplexity and Language Model Evaluation Calibration and Uncertainty Quantification