LLM Construction
Token Prediction and Language Modeling
Language modeling as probability assignment over sequences. Autoregressive and masked prediction objectives, perplexity evaluation, and the connection between prediction and compression.
Prerequisites
Why This Matters
Every large language model is, at its foundation, a next-token predictor. GPT models are trained to predict the next token given all previous tokens. BERT predicts randomly masked tokens given their context. This seemingly simple objective turns out to be extraordinarily powerful: to predict the next word well, a model must represent syntax, semantics, facts, reasoning patterns, and even aspects of common sense. Understanding language modeling is prerequisite to understanding everything built on top of it.
Mental Model
A language model assigns a probability to every possible sequence of tokens. Good sequences ("The cat sat on the mat") get high probability. Bad sequences ("Mat the on sat cat the") get low probability. The model does not need explicit rules for grammar or meaning. It learns them implicitly by being trained to predict what comes next.
Formal Setup
Let be a vocabulary of tokens. A language model defines a probability distribution over sequences where each .
Language Model
A language model is a probability distribution over token sequences. For any sequence of length , it assigns a probability with the constraint that probabilities over all possible sequences sum to 1.
Autoregressive Factorization
Autoregressive Language Model
By the chain rule of probability, any joint distribution over sequences can be factored as:
An autoregressive language model parameterizes each conditional with a neural network. The model processes the prefix and outputs a distribution over for the next token.
This is the factorization used by GPT and all decoder-only transformers. It is causal: each token can only depend on previous tokens, not future ones. Generation is straightforward: sample from , then from , and so on.
Main Theorems
Chain Rule Factorization for Language Models
Statement
For any probability distribution over sequences with :
This factorization is exact with no approximation. The autoregressive decomposition is left-to-right by convention, but any permutation of the conditioning order yields a valid factorization.
Intuition
This is just the chain rule of probability applied sequentially. The content is not in the identity itself but in the modeling choice: parameterize each conditional with a shared neural network. This weight sharing across positions is what makes the approach tractable.
Proof Sketch
Direct application of the definition of conditional probability: . Each factor is well-defined as long as the conditioning event has positive probability.
Why It Matters
This factorization means a language model only needs to solve one type of problem: predict the next token given a prefix. The training objective (next-token cross-entropy) directly optimizes this. No separate objectives for syntax, semantics, or world knowledge are needed.
Failure Mode
Left-to-right factorization is arbitrary. For tasks like fill-in-the-blank or text editing, bidirectional context is more natural. This motivates masked language modeling (BERT), which conditions on both left and right context but sacrifices the clean autoregressive factorization.
Perplexity Equals Exponentiated Cross-Entropy
Statement
The perplexity of a language model on a sequence is:
where is the per-token cross-entropy loss. Perplexity can be interpreted as the effective vocabulary size the model is choosing from at each step.
Intuition
If the model assigns probability to the correct next token on average, the perplexity is . A perplexity of 20 means the model is "as uncertain" as if it were choosing uniformly among 20 options at each step. Lower perplexity means better prediction.
Proof Sketch
By definition, cross-entropy . Perplexity is . If for all , then and .
Why It Matters
Perplexity is the standard evaluation metric for language models. It is directly comparable across models with the same tokenizer and test set. The connection to cross-entropy means minimizing the training loss is equivalent to minimizing perplexity.
Failure Mode
Perplexity depends on the tokenizer. A model using byte-pair encoding with 32K tokens and a model using character-level tokenization with 256 tokens have incomparable perplexities. Always compare perplexity with the same tokenizer or convert to bits-per-character for fair comparison.
Masked Language Modeling
Masked Language Model (MLM)
Instead of predicting left-to-right, a masked language model randomly masks a fraction (typically 15%) of input tokens and predicts the original token at each masked position given all unmasked tokens:
where denotes the sequence with position masked. This is the training objective of BERT.
MLM uses bidirectional context: both left and right tokens inform the prediction. This is an advantage for understanding tasks (classification, question answering) but a disadvantage for generation, since the model does not define a proper autoregressive distribution.
Causal vs. Bidirectional
| Property | Autoregressive (GPT) | Masked (BERT) |
|---|---|---|
| Context | Left-to-right only | Full bidirectional |
| Defines | Yes, via chain rule | No |
| Natural for generation | Yes | No |
| Natural for understanding | Weaker | Stronger |
| Training signal | Every token is predicted | Only masked tokens (~15%) |
Prediction and Compression
Good prediction implies good compression. A language model that assigns probability to each token can encode that token using bits (arithmetic coding). The total message length for a sequence is:
A model with lower perplexity compresses text more efficiently. This connection to information theory explains why next-token prediction is so powerful: to compress well, the model must capture all regularities in the data, including grammar, facts, reasoning patterns, and style.
Common Confusions
Next-token prediction is not shallow pattern matching
A common objection is that predicting the next word is too simple to produce genuine understanding. But consider: to predict the next token in a math proof, the model must understand the proof. To predict the next token in a code function, the model must understand the algorithm. The simplicity is in the objective, not in what the model must learn to achieve it.
MLM does not define a generative model
BERT's masked language modeling does not give a consistent joint probability . The conditionals for different masks may be inconsistent (they do not correspond to any single joint distribution in general). This is why BERT cannot generate text as cleanly as GPT.
Perplexity does not measure downstream task quality
A model with lower perplexity predicts text better but may not perform better on specific tasks like question answering or classification. Perplexity measures the average prediction quality across all tokens, while downstream tasks may depend on specific types of knowledge or reasoning.
Canonical Examples
Perplexity calculation
A language model with vocabulary size 50,000 evaluates a test sentence of tokens. The log-probabilities assigned to each correct token are: . The average negative log-probability is . Perplexity = . The model behaves as if choosing among about 12 equally likely options at each step, far better than the uniform baseline perplexity of 50,000.
Exercises
Problem
A language model with vocabulary size assigns uniform probability to all tokens at every position. What is its perplexity? If another model achieves perplexity 50, by what factor is it better at compression?
Problem
Prove that for an autoregressive language model, the perplexity on a sequence cannot be lower than the perplexity of the true data distribution (up to finite-sample effects). That is, show where is the entropy of the true distribution.
References
Canonical:
- Shannon, "A Mathematical Theory of Communication" (1948), the foundation
- Bengio et al., "A Neural Probabilistic Language Model" (2003), JMLR
- Mikolov et al., "Recurrent Neural Network Based Language Model" (2010)
Current:
-
Radford et al., "Language Models are Unsupervised Multitask Learners" (2019), GPT-2 paper
-
Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers" (2019), NAACL
-
Jurafsky & Martin, Speech and Language Processing (3rd ed., draft), Chapters 7-12
Next Topics
- BERT and the pretrain-finetune paradigm: masked LM applied at scale
- Transformer architecture: the architecture that makes large-scale language modeling practical
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Information Theory FoundationsLayer 0B