LLM Construction
Tokenization and Information Theory
Tokenization determines an LLM's vocabulary and shapes everything from compression efficiency to multilingual ability. Information theory explains what good tokenization looks like.
Prerequisites
Why This Matters
Before a language model sees any text, a tokenizer chops it into pieces. This seemingly mundane preprocessing step has profound consequences. The tokenizer determines what the model can represent, how efficiently it compresses text, whether it handles non-English languages well, and even whether it can do arithmetic.
A model with 32K tokens per context window is not limited by characters or words but by tokens. If your tokenizer wastes tokens on inefficient encodings (e.g., splitting every Chinese character into 3 byte-level tokens), your effective context shrinks by 3x for that language. Information theory gives us the tools to reason about what "efficient" means.
Mental Model
Think of tokenization as building a codebook. You have a source language (raw text) and you want to compress it into a shorter sequence of symbols (tokens). Shannon's source coding theorem says the best you can do is the entropy of the source. BPE approximately achieves this by greedily building a codebook that captures frequent patterns.
A good tokenizer compresses common text heavily (short token sequences) while gracefully handling rare text (falling back to character or byte-level tokens). The vocabulary size is a design choice that trades off between compression (large means shorter sequences) and the difficulty of learning token embeddings (large means more parameters and sparser training signal).
Formal Setup and Notation
Let be an alphabet (e.g., UTF-8 bytes) and the set of all strings over . A tokenizer is a function that maps strings to sequences of tokens from a vocabulary with .
Byte Pair Encoding (BPE)
BPE builds a vocabulary by iteratively merging the most frequent adjacent pair of symbols:
- Start with a vocabulary of individual bytes (or characters)
- Count all adjacent pairs in the training corpus
- Merge the most frequent pair into a single new token
- Repeat steps 2-3 until the vocabulary reaches the desired size
At inference time, apply the learned merges greedily in order of priority to tokenize new text.
Compression Ratio
The compression ratio of a tokenizer on a text corpus is:
A higher compression ratio means each token carries more information on average. English text typically achieves CR around 3.5-4.5 with modern tokenizers.
Fertility
The fertility of a tokenizer for a language or domain is the average number of tokens per word (or per character):
Lower fertility means more efficient representation. English has fertility around 1.3 with GPT-4's tokenizer; some languages exceed 3.0.
Core Definitions
SentencePiece is a widely used tokenization library that treats the input as a raw byte stream (no pre-tokenization by spaces or rules). It supports BPE and unigram language model tokenization. The unigram approach assigns probabilities to all substrings and finds the tokenization that maximizes the likelihood, which is more principled than greedy BPE but gives similar results in practice.
The information-theoretic view: let be the per-character entropy of the source text. A tokenizer with vocabulary size and average token length characters produces sequences of length tokens for input of length characters. Each token carries bits. The minimum achievable sequence length is bounded by the entropy:
Good tokenization pushes toward this bound. In practice, BPE gets surprisingly close for the language it was trained on.
Main Theorems
Tokenization Length and Source Entropy
Statement
For a stationary ergodic source with entropy rate (bits per character), any tokenizer with vocabulary size satisfies:
Equality is achieved by an optimal variable-length code. BPE approximates this bound for frequent patterns in the training distribution.
Intuition
Each token is one of symbols, so it carries at most bits of information. The source produces bits per character. You need at least tokens per character to avoid losing information, just as you cannot compress a file below its entropy.
Proof Sketch
This follows from Shannon's source coding theorem. The total information in characters is approximately bits. Encoding this into a sequence of tokens from an alphabet of size provides at most bits. For lossless encoding, , giving .
Why It Matters
This bound tells you the theoretical limit of tokenization efficiency. If your tokenizer is far from this bound for some language, it is wasting context window on that language. This directly explains the multilingual tokenization gap: tokenizers trained mostly on English are near-optimal for English but far from optimal for underrepresented languages.
BPE Approximates Optimal Compression
Statement
BPE with vocabulary size achieves a compression ratio that approaches the optimal (entropy) rate as grows. Specifically, the redundancy (excess tokens per character beyond the entropy bound) of BPE is bounded by:
for a memoryless source, and empirically scales similarly for natural language.
Intuition
BPE greedily captures the most frequent patterns first. As the vocabulary grows, it captures longer and longer common substrings, approaching the entropy rate. Each merge step removes the most redundancy per added vocabulary entry, which is a greedy approximation to optimal dictionary coding.
Proof Sketch
For a memoryless source, the optimal code length per symbol is tokens per character. BPE merges capture character pairs with probability , replacing two tokens with one and saving bits asymptotically. After merges, the total savings approach the entropy. The formal proof relies on connecting BPE to Lempel-Ziv-style dictionary methods and using the known redundancy bounds for those algorithms.
Why It Matters
This justifies BPE as a practical choice: it is simple, fast, and approximately optimal. You do not need sophisticated coding theory to build a good tokenizer; greedy merging gets you most of the way there.
Failure Mode
BPE is optimized for its training distribution. On out-of-distribution text (rare languages, code, mathematical notation), it can produce very long token sequences. The compression bound holds in expectation over the training distribution, not worst-case.
Canonical Examples
BPE merge example
Starting vocabulary: . Corpus: "abab cdcd abab".
Step 1: Most frequent pair is (a, b) with count 4. Merge into token "ab". Corpus becomes: "ab ab cd cd ab ab". Step 2: Most frequent pair is (c, d) with count 2. Merge into "cd". Step 3: Most frequent pair is (ab, space) or other patterns.
After enough merges, common words become single tokens. The string "abab" that initially took 4 tokens now takes 2 (or 1 with another merge).
Tokenization failure: arithmetic
GPT-style tokenizers merge digit sequences inconsistently. The number "123456" might be tokenized as ["123", "456"] or ["12", "3456"] depending on context. The model must learn that "123" followed by "456" represents one hundred twenty-three thousand four hundred fifty-six, not two separate numbers. This is why LLMs struggle with arithmetic: the tokenizer imposes an unnatural decomposition of numbers.
Common Confusions
BPE is not word-level tokenization
BPE operates on subword units. Common words become single tokens, but rare words are split into pieces. This is not a bug but a feature: it gives the model a fallback for any string (down to individual bytes) while keeping common text compact. The vocabulary never encounters an out-of-vocabulary token.
Larger vocabulary is not always better
Increasing improves compression (fewer tokens per text) but increases the embedding matrix size ( parameters) and makes each token's training signal sparser. There is an optimal that balances compression, parameter count, and training efficiency. Most modern LLMs use between 32K and 256K.
Tokenization is not language-neutral
A tokenizer trained on 90% English text learns English-optimal merges. Non-Latin scripts may be split at the byte level, tripling token counts. This means the model's effective context length and per-token information content vary by language. Multilingual tokenizers address this by balancing the training corpus or using character-aware methods.
Summary
- BPE builds a vocabulary by greedily merging frequent character pairs; it approximates entropy-optimal compression
- Good tokenization minimizes expected sequence length, approaching the information-theoretic bound tokens per character
- Tokenizer quality varies by language: underrepresented languages get worse compression and shorter effective context
- Tokenization failures (numbers, rare scripts, code) directly cause model failures because the model only sees the tokenized representation
- Vocabulary size is a tradeoff: larger means better compression but more parameters and sparser training
Exercises
Problem
English text has approximately bits per character. A tokenizer with tokens achieves an average token length of 4.1 characters. How close is this to the entropy bound?
Problem
Explain why byte-level BPE (starting from 256 byte tokens) is strictly more general than character-level BPE (starting from Unicode characters). What does it gain and what does it lose?
Problem
The unigram language model tokenizer (used in SentencePiece) finds the tokenization that maximizes the log-likelihood where are learned token probabilities. How does this relate to the information-theoretic view of tokenization as compression?
References
Canonical:
- Sennrich, Haddow, Birch, Neural Machine Translation of Rare Words with Subword Units (ACL 2016)
- Kudo & Richardson, SentencePiece: A Simple and Language Independent Subword Tokenizer (EMNLP 2018)
Current:
-
Rust et al., How Good is Your Tokenizer? (ACL 2021)
-
Petrov et al., Language Model Tokenizers Introduce Unfairness Between Languages (NeurIPS 2023)
-
Jurafsky & Martin, Speech and Language Processing (3rd ed., draft), Chapters 7-12
Next Topics
Tokenization feeds directly into:
- Distributed training theory: how token sequence length affects batch size and communication
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Information Theory FoundationsLayer 0B