Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Tokenization and Information Theory

Tokenization determines an LLM's vocabulary and shapes everything from compression efficiency to multilingual ability. Information theory explains what good tokenization looks like.

AdvancedTier 3Current~50 min
0

Why This Matters

Before a language model sees any text, a tokenizer chops it into pieces. This seemingly mundane preprocessing step has profound consequences. The tokenizer determines what the model can represent, how efficiently it compresses text, whether it handles non-English languages well, and even whether it can do arithmetic.

A model with 32K tokens per context window is not limited by characters or words but by tokens. If your tokenizer wastes tokens on inefficient encodings (e.g., splitting every Chinese character into 3 byte-level tokens), your effective context shrinks by 3x for that language. Information theory gives us the tools to reason about what "efficient" means.

Mental Model

Think of tokenization as building a codebook. You have a source language (raw text) and you want to compress it into a shorter sequence of symbols (tokens). Shannon's source coding theorem says the best you can do is the entropy of the source. BPE approximately achieves this by greedily building a codebook that captures frequent patterns.

A good tokenizer compresses common text heavily (short token sequences) while gracefully handling rare text (falling back to character or byte-level tokens). The vocabulary size VV is a design choice that trades off between compression (large VV means shorter sequences) and the difficulty of learning token embeddings (large VV means more parameters and sparser training signal).

Formal Setup and Notation

Let Σ\Sigma be an alphabet (e.g., UTF-8 bytes) and Σ\Sigma^* the set of all strings over Σ\Sigma. A tokenizer is a function tok:ΣV\text{tok}: \Sigma^* \to \mathcal{V}^* that maps strings to sequences of tokens from a vocabulary V\mathcal{V} with V=V|\mathcal{V}| = V.

Definition

Byte Pair Encoding (BPE)

BPE builds a vocabulary by iteratively merging the most frequent adjacent pair of symbols:

  1. Start with a vocabulary of individual bytes (or characters)
  2. Count all adjacent pairs in the training corpus
  3. Merge the most frequent pair into a single new token
  4. Repeat steps 2-3 until the vocabulary reaches the desired size VV

At inference time, apply the learned merges greedily in order of priority to tokenize new text.

Definition

Compression Ratio

The compression ratio of a tokenizer on a text corpus is:

CR=number of bytes in raw textnumber of tokens after tokenization\text{CR} = \frac{\text{number of bytes in raw text}}{\text{number of tokens after tokenization}}

A higher compression ratio means each token carries more information on average. English text typically achieves CR around 3.5-4.5 with modern tokenizers.

Definition

Fertility

The fertility of a tokenizer for a language or domain is the average number of tokens per word (or per character):

fertility=number of tokensnumber of words\text{fertility} = \frac{\text{number of tokens}}{\text{number of words}}

Lower fertility means more efficient representation. English has fertility around 1.3 with GPT-4's tokenizer; some languages exceed 3.0.

Core Definitions

SentencePiece is a widely used tokenization library that treats the input as a raw byte stream (no pre-tokenization by spaces or rules). It supports BPE and unigram language model tokenization. The unigram approach assigns probabilities to all substrings and finds the tokenization that maximizes the likelihood, which is more principled than greedy BPE but gives similar results in practice.

The information-theoretic view: let HH be the per-character entropy of the source text. A tokenizer with vocabulary size VV and average token length LL characters produces sequences of length n/Ln/L tokens for input of length nn characters. Each token carries log2V\log_2 V bits. The minimum achievable sequence length is bounded by the entropy:

tokens per characterHlog2V\text{tokens per character} \geq \frac{H}{\log_2 V}

Good tokenization pushes toward this bound. In practice, BPE gets surprisingly close for the language it was trained on.

Main Theorems

Proposition

Tokenization Length and Source Entropy

Statement

For a stationary ergodic source with entropy rate HH (bits per character), any tokenizer with vocabulary size VV satisfies:

E[number of tokensnumber of characters]Hlog2V\mathbb{E}\left[\frac{\text{number of tokens}}{\text{number of characters}}\right] \geq \frac{H}{\log_2 V}

Equality is achieved by an optimal variable-length code. BPE approximates this bound for frequent patterns in the training distribution.

Intuition

Each token is one of VV symbols, so it carries at most log2V\log_2 V bits of information. The source produces HH bits per character. You need at least H/log2VH / \log_2 V tokens per character to avoid losing information, just as you cannot compress a file below its entropy.

Proof Sketch

This follows from Shannon's source coding theorem. The total information in nn characters is approximately nHnH bits. Encoding this into a sequence of mm tokens from an alphabet of size VV provides at most mlog2Vm \log_2 V bits. For lossless encoding, mlog2VnHm \log_2 V \geq nH, giving m/nH/log2Vm/n \geq H / \log_2 V.

Why It Matters

This bound tells you the theoretical limit of tokenization efficiency. If your tokenizer is far from this bound for some language, it is wasting context window on that language. This directly explains the multilingual tokenization gap: tokenizers trained mostly on English are near-optimal for English but far from optimal for underrepresented languages.

Proposition

BPE Approximates Optimal Compression

Statement

BPE with vocabulary size VV achieves a compression ratio that approaches the optimal (entropy) rate as VV grows. Specifically, the redundancy (excess tokens per character beyond the entropy bound) of BPE is bounded by:

redundancyO(1logV)\text{redundancy} \leq O\left(\frac{1}{\log V}\right)

for a memoryless source, and empirically scales similarly for natural language.

Intuition

BPE greedily captures the most frequent patterns first. As the vocabulary grows, it captures longer and longer common substrings, approaching the entropy rate. Each merge step removes the most redundancy per added vocabulary entry, which is a greedy approximation to optimal dictionary coding.

Proof Sketch

For a memoryless source, the optimal code length per symbol is H/log2VH / \log_2 V tokens per character. BPE merges capture character pairs with probability pp, replacing two tokens with one and saving (11/p)(1-1/p) bits asymptotically. After VV merges, the total savings approach the entropy. The formal proof relies on connecting BPE to Lempel-Ziv-style dictionary methods and using the known redundancy bounds for those algorithms.

Why It Matters

This justifies BPE as a practical choice: it is simple, fast, and approximately optimal. You do not need sophisticated coding theory to build a good tokenizer; greedy merging gets you most of the way there.

Failure Mode

BPE is optimized for its training distribution. On out-of-distribution text (rare languages, code, mathematical notation), it can produce very long token sequences. The compression bound holds in expectation over the training distribution, not worst-case.

Canonical Examples

Example

BPE merge example

Starting vocabulary: . Corpus: "abab cdcd abab".

Step 1: Most frequent pair is (a, b) with count 4. Merge into token "ab". Corpus becomes: "ab ab cd cd ab ab". Step 2: Most frequent pair is (c, d) with count 2. Merge into "cd". Step 3: Most frequent pair is (ab, space) or other patterns.

After enough merges, common words become single tokens. The string "abab" that initially took 4 tokens now takes 2 (or 1 with another merge).

Example

Tokenization failure: arithmetic

GPT-style tokenizers merge digit sequences inconsistently. The number "123456" might be tokenized as ["123", "456"] or ["12", "3456"] depending on context. The model must learn that "123" followed by "456" represents one hundred twenty-three thousand four hundred fifty-six, not two separate numbers. This is why LLMs struggle with arithmetic: the tokenizer imposes an unnatural decomposition of numbers.

Common Confusions

Watch Out

BPE is not word-level tokenization

BPE operates on subword units. Common words become single tokens, but rare words are split into pieces. This is not a bug but a feature: it gives the model a fallback for any string (down to individual bytes) while keeping common text compact. The vocabulary never encounters an out-of-vocabulary token.

Watch Out

Larger vocabulary is not always better

Increasing VV improves compression (fewer tokens per text) but increases the embedding matrix size (V×dV \times d parameters) and makes each token's training signal sparser. There is an optimal VV that balances compression, parameter count, and training efficiency. Most modern LLMs use VV between 32K and 256K.

Watch Out

Tokenization is not language-neutral

A tokenizer trained on 90% English text learns English-optimal merges. Non-Latin scripts may be split at the byte level, tripling token counts. This means the model's effective context length and per-token information content vary by language. Multilingual tokenizers address this by balancing the training corpus or using character-aware methods.

Summary

  • BPE builds a vocabulary by greedily merging frequent character pairs; it approximates entropy-optimal compression
  • Good tokenization minimizes expected sequence length, approaching the information-theoretic bound H/log2VH / \log_2 V tokens per character
  • Tokenizer quality varies by language: underrepresented languages get worse compression and shorter effective context
  • Tokenization failures (numbers, rare scripts, code) directly cause model failures because the model only sees the tokenized representation
  • Vocabulary size VV is a tradeoff: larger VV means better compression but more parameters and sparser training

Exercises

ExerciseCore

Problem

English text has approximately H1.2H \approx 1.2 bits per character. A tokenizer with V=50,000V = 50{,}000 tokens achieves an average token length of 4.1 characters. How close is this to the entropy bound?

ExerciseAdvanced

Problem

Explain why byte-level BPE (starting from 256 byte tokens) is strictly more general than character-level BPE (starting from Unicode characters). What does it gain and what does it lose?

ExerciseResearch

Problem

The unigram language model tokenizer (used in SentencePiece) finds the tokenization that maximizes the log-likelihood ilogp(ti)\sum_i \log p(t_i) where p(t)p(t) are learned token probabilities. How does this relate to the information-theoretic view of tokenization as compression?

References

Canonical:

  • Sennrich, Haddow, Birch, Neural Machine Translation of Rare Words with Subword Units (ACL 2016)
  • Kudo & Richardson, SentencePiece: A Simple and Language Independent Subword Tokenizer (EMNLP 2018)

Current:

  • Rust et al., How Good is Your Tokenizer? (ACL 2021)

  • Petrov et al., Language Model Tokenizers Introduce Unfairness Between Languages (NeurIPS 2023)

  • Jurafsky & Martin, Speech and Language Processing (3rd ed., draft), Chapters 7-12

Next Topics

Tokenization feeds directly into:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Next Topics