LLM Construction
Perplexity and Language Model Evaluation
Perplexity as exp(cross-entropy): the standard intrinsic metric for language models, its information-theoretic interpretation, connection to bits-per-byte, and why low perplexity alone does not guarantee useful generation.
Prerequisites
Why This Matters
Perplexity is the default metric for comparing language models. Rooted in information theory, when a paper reports that Model A has perplexity 15.2 and Model B has perplexity 18.7 on WikiText-103, this means Model A assigns higher probability to the test data on average. Nearly every language modeling paper from n-grams to GPT-4 reports perplexity or a closely related quantity (bits-per-byte, bits-per-character).
Understanding what perplexity measures, and what it does not measure, is necessary for interpreting any language modeling result.
Formal Definitions
Cross-Entropy of a Language Model
Given a true data distribution over sequences and a model , the cross-entropy is:
In practice, we estimate this on a test corpus :
where is the model's conditional probability of token given the preceding tokens.
Perplexity
The perplexity of a language model on a test sequence is:
Perplexity is the exponential of the cross-entropy. Lower is better.
Information-Theoretic Interpretation
Perplexity as Effective Vocabulary Size
Statement
The perplexity of a model on a test sequence equals the geometric mean of the inverse probabilities:
A perplexity of means the model is, on average, as uncertain as if it were choosing uniformly among tokens at each position.
Intuition
If a model has perplexity 100 on English text with a vocabulary of 50,000 tokens, it has narrowed down its prediction to effectively 100 plausible next tokens on average. A perfect model that always predicts the correct next token with probability 1 has perplexity 1.
Proof Sketch
By definition, .
For a uniform distribution over tokens, always, so .
Why It Matters
This gives a concrete way to interpret perplexity numbers. A perplexity drop from 30 to 20 means the model went from choosing among 30 plausible tokens to 20 plausible tokens on average. This is a substantial improvement.
Failure Mode
The "effective vocabulary" interpretation assumes the uncertainty is spread uniformly, which is never true. A model might assign 0.9 probability to one token and split 0.1 across many tokens. The perplexity would be low, but the distribution is far from uniform over a small set.
Bits-Per-Character and Bits-Per-Byte
Cross-entropy with log base 2 gives bits. The conversion is:
Bits-per-byte (BPB) normalizes by the number of bytes rather than tokens, making it comparable across different tokenizers. If a model uses BPE with average token length bytes:
This matters because tokenizer choice affects perplexity. A model with a larger vocabulary will have lower token-level perplexity simply because each token carries more information. BPB removes this confound.
Perplexity on a Uniform Baseline
Uniform Model Perplexity
Statement
A model that assigns for all has perplexity exactly .
Intuition
A model that knows nothing about language and guesses uniformly is as confused as the size of the vocabulary. Any useful language model must have perplexity far below .
Proof Sketch
.
Why It Matters
This provides a sanity check. If your vocabulary has 50,000 tokens and your model's perplexity is 40,000, it is barely better than random guessing. State-of-the-art LLMs achieve perplexities in the range of 5-20 on standard benchmarks.
Failure Mode
This bound is not tight for any structured data. Even a unigram model (no context, just token frequencies) achieves much lower perplexity than on natural language because the token distribution is highly non-uniform (Zipf's law).
Why Perplexity Is Not Enough
Perplexity measures how well a model predicts the next token on held-out data. It does not measure:
- Coherence over long sequences. A model can have low perplexity by being good at local predictions (next word) while producing incoherent paragraphs.
- Factual accuracy. A model can confidently predict plausible-sounding but false continuations.
- Instruction following. Perplexity on web text says nothing about a model's ability to follow instructions or answer questions.
- Safety and alignment. Low perplexity on toxic text is not desirable.
This is why modern LLM evaluation relies on downstream benchmarks (MMLU, HumanEval, GSM8K) and human evaluation following model evaluation best practices, not perplexity alone.
Common Confusions
Perplexity depends on the tokenizer
A model using character-level tokenization and a model using BPE tokenization cannot be compared by perplexity directly. The character model predicts one character at a time; the BPE model predicts one subword at a time. Use bits-per-byte to compare across tokenization schemes.
Lower perplexity does not always mean a better model
A model trained on a narrow domain (e.g., only legal text) will have lower perplexity on legal text than a general-purpose model. This does not make it a better language model overall. Perplexity is only comparable when measured on the same test set. The gap between model perplexity and the entropy of the data source equals the KL divergence between the true distribution and the model.
Perplexity is undefined for zero-probability events
If the model assigns to any test token, perplexity is infinite. In practice, models use smoothing or a vocabulary that covers all test tokens via subword tokenization to avoid this.
Canonical Examples
Perplexity calculation
A model predicts a 4-token sequence with probabilities .
Cross-entropy:
Perplexity: .
The model is as confused as if choosing uniformly among 3.34 tokens per position.
Exercises
Problem
A language model has a vocabulary of 10,000 tokens. What is the perplexity of the uniform baseline? If the model achieves perplexity 50, by what factor has it reduced uncertainty compared to the uniform baseline?
Problem
Model A uses a character-level tokenizer (vocabulary size 256) and achieves perplexity 4.2 per character. Model B uses BPE (vocabulary size 50,000) and achieves perplexity 22.1 per token, with an average of 4.5 characters per token. Which model is better in bits-per-character?
References
Canonical:
- Jelinek et al., "Perplexity: a Measure of the Difficulty of Speech Recognition Tasks", JASA 1977
- Cover and Thomas, Elements of Information Theory, Chapter 2-4
Current:
-
Radford et al., "Language Models are Unsupervised Multitask Learners" (GPT-2), 2019
-
Merity et al., "Pointer Sentinel Mixture Models", ICLR 2017 (WikiText benchmark)
-
Jurafsky & Martin, Speech and Language Processing (3rd ed., draft), Chapters 7-12
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Information Theory FoundationsLayer 0B