Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

Token Prediction and Language Modeling

Language modeling as probability assignment over sequences. Autoregressive and masked prediction objectives, perplexity evaluation, and the connection between prediction and compression.

CoreTier 2Stable~45 min
0

Why This Matters

Every large language model is, at its foundation, a next-token predictor. GPT models are trained to predict the next token given all previous tokens. BERT predicts randomly masked tokens given their context. This seemingly simple objective turns out to be extraordinarily powerful: to predict the next word well, a model must represent syntax, semantics, facts, reasoning patterns, and even aspects of common sense. Understanding language modeling is prerequisite to understanding everything built on top of it.

Mental Model

A language model assigns a probability to every possible sequence of tokens. Good sequences ("The cat sat on the mat") get high probability. Bad sequences ("Mat the on sat cat the") get low probability. The model does not need explicit rules for grammar or meaning. It learns them implicitly by being trained to predict what comes next.

Formal Setup

Let VV be a vocabulary of tokens. A language model defines a probability distribution over sequences x1,x2,,xTx_1, x_2, \ldots, x_T where each xtVx_t \in V.

Definition

Language Model

A language model is a probability distribution over token sequences. For any sequence of length TT, it assigns a probability p(x1,,xT)0p(x_1, \ldots, x_T) \geq 0 with the constraint that probabilities over all possible sequences sum to 1.

Autoregressive Factorization

Definition

Autoregressive Language Model

By the chain rule of probability, any joint distribution over sequences can be factored as:

p(x1,,xT)=t=1Tp(xtx1,,xt1)p(x_1, \ldots, x_T) = \prod_{t=1}^{T} p(x_t \mid x_1, \ldots, x_{t-1})

An autoregressive language model parameterizes each conditional p(xtx<t)p(x_t \mid x_{<t}) with a neural network. The model processes the prefix x1,,xt1x_1, \ldots, x_{t-1} and outputs a distribution over VV for the next token.

This is the factorization used by GPT and all decoder-only transformers. It is causal: each token can only depend on previous tokens, not future ones. Generation is straightforward: sample x1x_1 from p(x1)p(x_1), then x2x_2 from p(x2x1)p(x_2 \mid x_1), and so on.

Main Theorems

Theorem

Chain Rule Factorization for Language Models

Statement

For any probability distribution over sequences x1,,xTx_1, \ldots, x_T with xtVx_t \in V:

p(x1,,xT)=t=1Tp(xtx<t)p(x_1, \ldots, x_T) = \prod_{t=1}^{T} p(x_t \mid x_{<t})

This factorization is exact with no approximation. The autoregressive decomposition is left-to-right by convention, but any permutation of the conditioning order yields a valid factorization.

Intuition

This is just the chain rule of probability applied sequentially. The content is not in the identity itself but in the modeling choice: parameterize each conditional with a shared neural network. This weight sharing across positions is what makes the approach tractable.

Proof Sketch

Direct application of the definition of conditional probability: p(x1,,xT)=p(x1)p(x2x1)p(x3x1,x2)p(xTx<T)p(x_1, \ldots, x_T) = p(x_1) \cdot p(x_2 \mid x_1) \cdot p(x_3 \mid x_1, x_2) \cdots p(x_T \mid x_{<T}). Each factor is well-defined as long as the conditioning event has positive probability.

Why It Matters

This factorization means a language model only needs to solve one type of problem: predict the next token given a prefix. The training objective (next-token cross-entropy) directly optimizes this. No separate objectives for syntax, semantics, or world knowledge are needed.

Failure Mode

Left-to-right factorization is arbitrary. For tasks like fill-in-the-blank or text editing, bidirectional context is more natural. This motivates masked language modeling (BERT), which conditions on both left and right context but sacrifices the clean autoregressive factorization.

Proposition

Perplexity Equals Exponentiated Cross-Entropy

Statement

The perplexity of a language model pp on a sequence x1,,xTx_1, \ldots, x_T is:

PPL=exp(1Tt=1Tlogp(xtx<t))=exp(HCE)\text{PPL} = \exp\left(-\frac{1}{T}\sum_{t=1}^{T}\log p(x_t \mid x_{<t})\right) = \exp(H_{\text{CE}})

where HCEH_{\text{CE}} is the per-token cross-entropy loss. Perplexity can be interpreted as the effective vocabulary size the model is choosing from at each step.

Intuition

If the model assigns probability 1/k1/k to the correct next token on average, the perplexity is kk. A perplexity of 20 means the model is "as uncertain" as if it were choosing uniformly among 20 options at each step. Lower perplexity means better prediction.

Proof Sketch

By definition, cross-entropy HCE=1Ttlogp(xtx<t)H_{\text{CE}} = -\frac{1}{T}\sum_t \log p(x_t \mid x_{<t}). Perplexity is exp(HCE)\exp(H_{\text{CE}}). If p(xtx<t)=1/kp(x_t \mid x_{<t}) = 1/k for all tt, then HCE=logkH_{\text{CE}} = \log k and PPL=k\text{PPL} = k.

Why It Matters

Perplexity is the standard evaluation metric for language models. It is directly comparable across models with the same tokenizer and test set. The connection to cross-entropy means minimizing the training loss is equivalent to minimizing perplexity.

Failure Mode

Perplexity depends on the tokenizer. A model using byte-pair encoding with 32K tokens and a model using character-level tokenization with 256 tokens have incomparable perplexities. Always compare perplexity with the same tokenizer or convert to bits-per-character for fair comparison.

Masked Language Modeling

Definition

Masked Language Model (MLM)

Instead of predicting left-to-right, a masked language model randomly masks a fraction (typically 15%) of input tokens and predicts the original token at each masked position given all unmasked tokens:

LMLM=tmaskedlogp(xtxt)\mathcal{L}_{\text{MLM}} = -\sum_{t \in \text{masked}} \log p(x_t \mid x_{\setminus t})

where xtx_{\setminus t} denotes the sequence with position tt masked. This is the training objective of BERT.

MLM uses bidirectional context: both left and right tokens inform the prediction. This is an advantage for understanding tasks (classification, question answering) but a disadvantage for generation, since the model does not define a proper autoregressive distribution.

Causal vs. Bidirectional

PropertyAutoregressive (GPT)Masked (BERT)
ContextLeft-to-right onlyFull bidirectional
Defines p(x1,,xT)p(x_1, \ldots, x_T)Yes, via chain ruleNo
Natural for generationYesNo
Natural for understandingWeakerStronger
Training signalEvery token is predictedOnly masked tokens (~15%)

Prediction and Compression

Good prediction implies good compression. A language model that assigns probability p(xtx<t)p(x_t \mid x_{<t}) to each token can encode that token using log2p(xtx<t)-\log_2 p(x_t \mid x_{<t}) bits (arithmetic coding). The total message length for a sequence is:

L=t=1Tlog2p(xtx<t)=THCElog2eL = -\sum_{t=1}^{T}\log_2 p(x_t \mid x_{<t}) = T \cdot H_{\text{CE}} \cdot \log_2 e

A model with lower perplexity compresses text more efficiently. This connection to information theory explains why next-token prediction is so powerful: to compress well, the model must capture all regularities in the data, including grammar, facts, reasoning patterns, and style.

Common Confusions

Watch Out

Next-token prediction is not shallow pattern matching

A common objection is that predicting the next word is too simple to produce genuine understanding. But consider: to predict the next token in a math proof, the model must understand the proof. To predict the next token in a code function, the model must understand the algorithm. The simplicity is in the objective, not in what the model must learn to achieve it.

Watch Out

MLM does not define a generative model

BERT's masked language modeling does not give a consistent joint probability p(x1,,xT)p(x_1, \ldots, x_T). The conditionals p(xtxt)p(x_t \mid x_{\setminus t}) for different masks may be inconsistent (they do not correspond to any single joint distribution in general). This is why BERT cannot generate text as cleanly as GPT.

Watch Out

Perplexity does not measure downstream task quality

A model with lower perplexity predicts text better but may not perform better on specific tasks like question answering or classification. Perplexity measures the average prediction quality across all tokens, while downstream tasks may depend on specific types of knowledge or reasoning.

Canonical Examples

Example

Perplexity calculation

A language model with vocabulary size 50,000 evaluates a test sentence of T=10T = 10 tokens. The log-probabilities assigned to each correct token are: 2.3,1.1,4.5,0.8,3.2,1.5,2.0,5.1,1.8,2.7-2.3, -1.1, -4.5, -0.8, -3.2, -1.5, -2.0, -5.1, -1.8, -2.7. The average negative log-probability is 25.010=2.5\frac{25.0}{10} = 2.5. Perplexity = exp(2.5)12.18\exp(2.5) \approx 12.18. The model behaves as if choosing among about 12 equally likely options at each step, far better than the uniform baseline perplexity of 50,000.

Exercises

ExerciseCore

Problem

A language model with vocabulary size V=10,000|V| = 10{,}000 assigns uniform probability to all tokens at every position. What is its perplexity? If another model achieves perplexity 50, by what factor is it better at compression?

ExerciseAdvanced

Problem

Prove that for an autoregressive language model, the perplexity on a sequence cannot be lower than the perplexity of the true data distribution (up to finite-sample effects). That is, show HCE(ptrue,pmodel)H(ptrue)H_{\text{CE}}(p_{\text{true}}, p_{\text{model}}) \geq H(p_{\text{true}}) where HH is the entropy of the true distribution.

References

Canonical:

  • Shannon, "A Mathematical Theory of Communication" (1948), the foundation
  • Bengio et al., "A Neural Probabilistic Language Model" (2003), JMLR
  • Mikolov et al., "Recurrent Neural Network Based Language Model" (2010)

Current:

  • Radford et al., "Language Models are Unsupervised Multitask Learners" (2019), GPT-2 paper

  • Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers" (2019), NAACL

  • Jurafsky & Martin, Speech and Language Processing (3rd ed., draft), Chapters 7-12

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics