Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Perplexity and Language Model Evaluation

Perplexity as exp(cross-entropy): the standard intrinsic metric for language models, its information-theoretic interpretation, connection to bits-per-byte, and why low perplexity alone does not guarantee useful generation.

CoreTier 2Stable~40 min
0

Why This Matters

Perplexity is the default metric for comparing language models. Rooted in information theory, when a paper reports that Model A has perplexity 15.2 and Model B has perplexity 18.7 on WikiText-103, this means Model A assigns higher probability to the test data on average. Nearly every language modeling paper from n-grams to GPT-4 reports perplexity or a closely related quantity (bits-per-byte, bits-per-character).

Understanding what perplexity measures, and what it does not measure, is necessary for interpreting any language modeling result.

Formal Definitions

Definition

Cross-Entropy of a Language Model

Given a true data distribution pp over sequences and a model qq, the cross-entropy is:

H(p,q)=Exp[logq(x)]H(p, q) = -\mathbb{E}_{x \sim p}[\log q(x)]

In practice, we estimate this on a test corpus x1,x2,,xNx_1, x_2, \ldots, x_N:

H^(p,q)=1Ni=1Nlogq(xix<i)\hat{H}(p, q) = -\frac{1}{N} \sum_{i=1}^{N} \log q(x_i \mid x_{<i})

where q(xix<i)q(x_i \mid x_{<i}) is the model's conditional probability of token xix_i given the preceding tokens.

Definition

Perplexity

The perplexity of a language model qq on a test sequence is:

PPL=exp(1Ni=1Nlogq(xix<i))=exp(H^(p,q))\text{PPL} = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log q(x_i \mid x_{<i})\right) = \exp(\hat{H}(p, q))

Perplexity is the exponential of the cross-entropy. Lower is better.

Information-Theoretic Interpretation

Proposition

Perplexity as Effective Vocabulary Size

Statement

The perplexity of a model on a test sequence equals the geometric mean of the inverse probabilities:

PPL=(i=1N1q(xix<i))1/N\text{PPL} = \left(\prod_{i=1}^{N} \frac{1}{q(x_i \mid x_{<i})}\right)^{1/N}

A perplexity of kk means the model is, on average, as uncertain as if it were choosing uniformly among kk tokens at each position.

Intuition

If a model has perplexity 100 on English text with a vocabulary of 50,000 tokens, it has narrowed down its prediction to effectively 100 plausible next tokens on average. A perfect model that always predicts the correct next token with probability 1 has perplexity 1.

Proof Sketch

By definition, PPL=exp(1Nilogq(xix<i))=exp(1Nilog1q(xix<i))=(i1q(xix<i))1/N\text{PPL} = \exp(-\frac{1}{N} \sum_i \log q(x_i \mid x_{<i})) = \exp(\frac{1}{N} \sum_i \log \frac{1}{q(x_i \mid x_{<i})}) = (\prod_i \frac{1}{q(x_i \mid x_{<i})})^{1/N}.

For a uniform distribution over kk tokens, q(xix<i)=1/kq(x_i \mid x_{<i}) = 1/k always, so PPL=(kN)1/N=k\text{PPL} = (k^N)^{1/N} = k.

Why It Matters

This gives a concrete way to interpret perplexity numbers. A perplexity drop from 30 to 20 means the model went from choosing among 30 plausible tokens to 20 plausible tokens on average. This is a substantial improvement.

Failure Mode

The "effective vocabulary" interpretation assumes the uncertainty is spread uniformly, which is never true. A model might assign 0.9 probability to one token and split 0.1 across many tokens. The perplexity would be low, but the distribution is far from uniform over a small set.

Bits-Per-Character and Bits-Per-Byte

Cross-entropy with log base 2 gives bits. The conversion is:

BPC=H^(p,q)ln2=log2(PPL)\text{BPC} = \frac{\hat{H}(p, q)}{\ln 2} = \log_2(\text{PPL})

Bits-per-byte (BPB) normalizes by the number of bytes rather than tokens, making it comparable across different tokenizers. If a model uses BPE with average token length Lˉ\bar{L} bytes:

BPBBPC per tokenLˉ\text{BPB} \approx \frac{\text{BPC per token}}{\bar{L}}

This matters because tokenizer choice affects perplexity. A model with a larger vocabulary will have lower token-level perplexity simply because each token carries more information. BPB removes this confound.

Perplexity on a Uniform Baseline

Proposition

Uniform Model Perplexity

Statement

A model that assigns q(xix<i)=1/Vq(x_i \mid x_{<i}) = 1/V for all ii has perplexity exactly VV.

Intuition

A model that knows nothing about language and guesses uniformly is as confused as the size of the vocabulary. Any useful language model must have perplexity far below VV.

Proof Sketch

PPL=exp(1Nilog(1/V))=exp(logV)=V\text{PPL} = \exp(-\frac{1}{N} \sum_i \log(1/V)) = \exp(\log V) = V.

Why It Matters

This provides a sanity check. If your vocabulary has 50,000 tokens and your model's perplexity is 40,000, it is barely better than random guessing. State-of-the-art LLMs achieve perplexities in the range of 5-20 on standard benchmarks.

Failure Mode

This bound is not tight for any structured data. Even a unigram model (no context, just token frequencies) achieves much lower perplexity than VV on natural language because the token distribution is highly non-uniform (Zipf's law).

Why Perplexity Is Not Enough

Perplexity measures how well a model predicts the next token on held-out data. It does not measure:

  1. Coherence over long sequences. A model can have low perplexity by being good at local predictions (next word) while producing incoherent paragraphs.
  2. Factual accuracy. A model can confidently predict plausible-sounding but false continuations.
  3. Instruction following. Perplexity on web text says nothing about a model's ability to follow instructions or answer questions.
  4. Safety and alignment. Low perplexity on toxic text is not desirable.

This is why modern LLM evaluation relies on downstream benchmarks (MMLU, HumanEval, GSM8K) and human evaluation following model evaluation best practices, not perplexity alone.

Common Confusions

Watch Out

Perplexity depends on the tokenizer

A model using character-level tokenization and a model using BPE tokenization cannot be compared by perplexity directly. The character model predicts one character at a time; the BPE model predicts one subword at a time. Use bits-per-byte to compare across tokenization schemes.

Watch Out

Lower perplexity does not always mean a better model

A model trained on a narrow domain (e.g., only legal text) will have lower perplexity on legal text than a general-purpose model. This does not make it a better language model overall. Perplexity is only comparable when measured on the same test set. The gap between model perplexity and the entropy of the data source equals the KL divergence between the true distribution and the model.

Watch Out

Perplexity is undefined for zero-probability events

If the model assigns q(xix<i)=0q(x_i \mid x_{<i}) = 0 to any test token, perplexity is infinite. In practice, models use smoothing or a vocabulary that covers all test tokens via subword tokenization to avoid this.

Canonical Examples

Example

Perplexity calculation

A model predicts a 4-token sequence with probabilities q=(0.2,0.5,0.1,0.8)q = (0.2, 0.5, 0.1, 0.8).

Cross-entropy: 14(log0.2+log0.5+log0.1+log0.8)=14(1.6090.6932.3030.223)=1.207-\frac{1}{4}(\log 0.2 + \log 0.5 + \log 0.1 + \log 0.8) = -\frac{1}{4}(-1.609 - 0.693 - 2.303 - 0.223) = 1.207

Perplexity: exp(1.207)3.34\exp(1.207) \approx 3.34.

The model is as confused as if choosing uniformly among 3.34 tokens per position.

Exercises

ExerciseCore

Problem

A language model has a vocabulary of 10,000 tokens. What is the perplexity of the uniform baseline? If the model achieves perplexity 50, by what factor has it reduced uncertainty compared to the uniform baseline?

ExerciseAdvanced

Problem

Model A uses a character-level tokenizer (vocabulary size 256) and achieves perplexity 4.2 per character. Model B uses BPE (vocabulary size 50,000) and achieves perplexity 22.1 per token, with an average of 4.5 characters per token. Which model is better in bits-per-character?

References

Canonical:

  • Jelinek et al., "Perplexity: a Measure of the Difficulty of Speech Recognition Tasks", JASA 1977
  • Cover and Thomas, Elements of Information Theory, Chapter 2-4

Current:

  • Radford et al., "Language Models are Unsupervised Multitask Learners" (GPT-2), 2019

  • Merity et al., "Pointer Sentinel Mixture Models", ICLR 2017 (WikiText benchmark)

  • Jurafsky & Martin, Speech and Language Processing (3rd ed., draft), Chapters 7-12

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.