Perplexity and Language Model Evaluation

Sneiderman, Robby

LLM Construction

Perplexity and Language Model Evaluation

Perplexity as exp(cross-entropy): the standard intrinsic metric for language models, its information-theoretic interpretation, connection to bits-per-byte, and why low perplexity alone does not guarantee useful generation.

CoreTier 2StableSupporting~40 min

Prerequisites

Information Theory Foundations Bits Nats Perplexity Bpb Log Probability Computation

Start 8-question practice · 1 available Prereq Map

Learning position

Read this page in the graph.

llm-construction | layer 3 | tier 2. This page has 3 direct prerequisites and 0 published dependents.

Open Atlas Prerequisites Leads to

What next

Take the diagnostic

No published continuation is declared yet, so the diagnostic is the clean next route.

Evidence badge

Claim status

This page has no public Lean mapping yet. Use the evidence page to inspect how claim status labels work.

Show the backing system

AtlasOpen the full prerequisite graph and run grounding traces.EvidenceInspect source support, claim labels, and public trust status.LeanReview the checked declaration list, scopes, and axiom profile.

Why This Matters

Perplexity is the default metric for comparing language models. Rooted in information theory, when a paper reports that Model A has perplexity 15.2 and Model B has perplexity 18.7 on WikiText-103, this means Model A assigns higher probability to the test data on average. Nearly every language modeling paper from n-grams to GPT-4 reports perplexity or a closely related quantity (bits-per-byte, bits-per-character).

Understanding what perplexity measures, and what it does not measure, is necessary for interpreting any language modeling result.

Formal Definitions

Definition

Cross-Entropy of a Language Model $H (p, q)$

Given a true data distribution $p$ over sequences and a model $q$ , the cross-entropy is:

$H(p, q) = -\mathbb{E}_{x \sim p}[\log q(x)]$

In practice, we estimate this on a test corpus $x_1, x_2, \ldots, x_N$ :

$\hat{H}(p, q) = -\frac{1}{N} \sum_{i=1}^{N} \log q(x_i \mid x_{<i})$

where $q(x_i \mid x_{<i})$ is the model's conditional probability of token $x_i$ given the preceding tokens.

Definition

Perplexity $PPL$

The perplexity of a language model $q$ on a test sequence is:

$\text{PPL} = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log q(x_i \mid x_{<i})\right) = \exp(\hat{H}(p, q))$

Perplexity is the exponential of the cross-entropy. Lower is better.

Information-Theoretic Interpretation

Proposition

Perplexity as Effective Vocabulary Size

Statement

The perplexity of a model on a test sequence equals the geometric mean of the inverse probabilities:

$\text{PPL} = \left(\prod_{i=1}^{N} \frac{1}{q(x_i \mid x_{<i})}\right)^{1/N}$

A perplexity of $k$ means the model is, on average, as uncertain as if it were choosing uniformly among $k$ tokens at each position.

Intuition

If a model has perplexity 100 on English text with a vocabulary of 50,000 tokens, it has narrowed down its prediction to effectively 100 plausible next tokens on average. A perfect model that always predicts the correct next token with probability 1 has perplexity 1.

Proof Sketch

By definition, $\text{PPL} = \exp(-\frac{1}{N} \sum_i \log q(x_i \mid x_{<i})) = \exp(\frac{1}{N} \sum_i \log \frac{1}{q(x_i \mid x_{<i})}) = (\prod_i \frac{1}{q(x_i \mid x_{<i})})^{1/N}$ .

For a uniform distribution over $k$ tokens, $q(x_i \mid x_{<i}) = 1/k$ always, so $\text{PPL} = (k^N)^{1/N} = k$ .

Why It Matters

This gives a concrete way to interpret perplexity numbers. A perplexity drop from 30 to 20 means the model went from choosing among 30 plausible tokens to 20 plausible tokens on average. This is a substantial improvement.

Failure Mode

The "effective vocabulary" interpretation assumes the uncertainty is spread uniformly, which is never true. A model might assign 0.9 probability to one token and split 0.1 across many tokens. The perplexity would be low, but the distribution is far from uniform over a small set.

report a correction →

Bits-Per-Character and Bits-Per-Byte

Cross-entropy with log base 2 gives bits. The conversion is:

$\text{BPC} = \frac{\hat{H}(p, q)}{\ln 2} = \log_2(\text{PPL})$

Bits-per-byte (BPB) normalizes by the number of bytes rather than tokens, making it comparable across different tokenizers. If a model uses BPE with average token length $\bar{L}$ bytes:

$\text{BPB} \approx \frac{\text{BPC per token}}{\bar{L}}$

This matters because tokenizer choice affects perplexity. A model with a larger vocabulary will have lower token-level perplexity simply because each token carries more information. BPB removes this confound.

Perplexity on a Uniform Baseline

Proposition

Uniform Model Perplexity

Statement

A model that assigns $q(x_i \mid x_{<i}) = 1/V$ for all $i$ has perplexity exactly $V$ .

Intuition

A model that knows nothing about language and guesses uniformly is as confused as the size of the vocabulary. Any useful language model must have perplexity far below $V$ .

Proof Sketch

$\text{PPL} = \exp(-\frac{1}{N} \sum_i \log(1/V)) = \exp(\log V) = V$ .

Why It Matters

This provides a sanity check. If your vocabulary has 50,000 tokens and your model's perplexity is 40,000, it is barely better than random guessing. Frontier LLMs (GPT-4, Llama 3, Gemini) achieve perplexities of 5-20 on standard benchmarks like WikiText-103.

Failure Mode

This bound is not tight for any structured data. Even a unigram model (no context, just token frequencies) achieves much lower perplexity than $V$ on natural language because the token distribution is highly non-uniform (Zipf's law).

report a correction →

Why Perplexity Is Not Enough

Perplexity measures how well a model predicts the next token on held-out data. It does not measure:

Coherence over long sequences. A model can have low perplexity by being good at local predictions (next word) while producing incoherent paragraphs.
Factual accuracy. A model can confidently predict plausible-sounding but false continuations.
Instruction following. Perplexity on web text says nothing about a model's ability to follow instructions or answer questions.
Safety and alignment. Low perplexity on toxic text is not desirable.

This is why modern LLM evaluation relies on downstream benchmarks (MMLU, HumanEval, GSM8K) and human evaluation following model evaluation best practices, not perplexity alone.

Evaluation Checklist

Check	What to verify	Failure signal
Test distribution	Name the corpus, split, filtering, and contamination screen.	A low score on a familiar benchmark may be memorization or preprocessing overlap.
Tokenizer comparability	Convert token-level perplexity to bits-per-byte or bits-per-character when tokenizers differ.	A larger-token vocabulary looks better without reducing information cost.
Calibration	Compare negative log-likelihood with Brier score, ECE, and reliability curves for next-token probabilities.	The model ranks tokens well but assigns probabilities that are too sharp.
Downstream transfer	Pair perplexity with task metrics that match the intended use: retrieval, instruction following, coding, or long-context QA.	Lower perplexity fails to improve the product metric.
Slice analysis	Break out results by language, domain, length, rare tokens, and prompt style.	The average improves while a deployment-critical slice regresses.

Worked Diagnostic Pattern

For perplexity-only model selection, compare two candidate models on the same held-out corpus, then repeat the comparison on at least one downstream task and one shifted corpus. If model A has lower perplexity but worse instruction following, the correct conclusion is not "perplexity is wrong"; it is that next-token likelihood is only one axis of model quality. Keep the perplexity result as an intrinsic measurement, then report where it transfers and where it does not.

Common Confusions

Watch Out

Perplexity depends on the tokenizer

A model using character-level tokenization and a model using BPE tokenization cannot be compared by perplexity directly. The character model predicts one character at a time; the BPE model predicts one subword at a time. Use bits-per-byte to compare across tokenization schemes.

Watch Out

Lower perplexity does not always mean a better model

A model trained on a narrow domain (e.g., only legal text) will have lower perplexity on legal text than a general-purpose model. This does not make it a better language model overall. Perplexity is only comparable when measured on the same test set. The gap between model perplexity and the entropy of the data source equals the KL divergence between the true distribution and the model.

Watch Out

Perplexity is undefined for zero-probability events

If the model assigns $q(x_i \mid x_{<i}) = 0$ to any test token, perplexity is infinite. In practice, models use smoothing or a vocabulary that covers all test tokens via subword tokenization to avoid this.

Canonical Examples

Example

Perplexity calculation

A model predicts a 4-token sequence with probabilities $q = (0.2, 0.5, 0.1, 0.8)$ .

Cross-entropy: $-\frac{1}{4}(\log 0.2 + \log 0.5 + \log 0.1 + \log 0.8) = -\frac{1}{4}(-1.609 - 0.693 - 2.303 - 0.223) = 1.207$

Perplexity: $\exp(1.207) \approx 3.34$ .

The model is as confused as if choosing uniformly among 3.34 tokens per position.

Exercises

ExerciseCore

Problem

A language model has a vocabulary of 10,000 tokens. What is the perplexity of the uniform baseline? If the model achieves perplexity 50, by what factor has it reduced uncertainty compared to the uniform baseline?

ExerciseAdvanced

Problem

Model A uses a character-level tokenizer (vocabulary size 256) and achieves perplexity 4.2 per character. Model B uses BPE (vocabulary size 50,000) and achieves perplexity 22.1 per token, with an average of 4.5 characters per token. Which model is better in bits-per-character?

References

Canonical:

Jelinek et al., "Perplexity: a Measure of the Difficulty of Speech Recognition Tasks", JASA 1977
Cover and Thomas, Elements of Information Theory, Chapter 2-4

Current:

Radford et al., "Language Models are Unsupervised Multitask Learners" (GPT-2), 2019
Merity et al., "Pointer Sentinel Mixture Models", ICLR 2017 (WikiText benchmark)
Jurafsky & Martin, Speech and Language Processing (3rd ed., draft), Chapters 7-12

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Log-Probability Computationlayer 1 · tier 1
Information Theory Foundationslayer 0B · tier 2
Bits, Nats, Perplexity, and BPBlayer 3 · tier 2

Derived topics

0

No published topic currently declares this as a prerequisite.