Token Prediction and Language Modeling

Sneiderman, Robby

LLM Construction

Token Prediction and Language Modeling

Language modeling as probability assignment over sequences. Autoregressive and masked prediction objectives, perplexity evaluation, and the connection between prediction and compression.

CoreTier 2StableSupporting~45 min

Prerequisites

Information Theory Foundations Softmax and Numerical Stability Feedforward Networks and Backpropagation Common Probability Distributions

Quiz (2)Prereq Map

Why This Matters

Every large language model is, at its foundation, a next-token predictor. GPT models are trained to predict the next token given all previous tokens. BERT predicts randomly masked tokens given their context. This seemingly simple objective turns out to be extraordinarily powerful: to predict the next word well, a model must represent syntax, semantics, facts, reasoning patterns, and even aspects of common sense. Understanding language modeling is prerequisite to understanding everything built on top of it.

Mental Model

A language model assigns a probability to every possible sequence of tokens. Good sequences ("The cat sat on the mat") get high probability. Bad sequences ("Mat the on sat cat the") get low probability. The model does not need explicit rules for grammar or meaning. It learns them implicitly by being trained to predict what comes next.

Formal Setup

Let $V$ be a vocabulary of tokens. A language model defines a probability distribution over sequences $x_1, x_2, \ldots, x_T$ where each $x_t \in V$ .

Definition

Language Model $p (x_{1}, x_{2}, \dots, x_{T})$

A language model is a probability distribution over token sequences. For any sequence of length $T$ , it assigns a probability $p(x_1, \ldots, x_T) \geq 0$ with the constraint that probabilities over all possible sequences sum to 1.

Autoregressive Factorization

Definition

Autoregressive Language Model

By the chain rule of probability, any joint distribution over sequences can be factored as:

$p(x_1, \ldots, x_T) = \prod_{t=1}^{T} p(x_t \mid x_1, \ldots, x_{t-1})$

An autoregressive language model parameterizes each conditional $p(x_t \mid x_{<t})$ with a neural network. The model processes the prefix $x_1, \ldots, x_{t-1}$ and outputs a distribution over $V$ for the next token.

This is the factorization used by GPT and all decoder-only transformers. It is causal: each token can only depend on previous tokens, not future ones. Generation is straightforward: sample $x_1$ from $p(x_1)$ , then $x_2$ from $p(x_2 \mid x_1)$ , and so on.

Main Theorems

Theorem

Chain Rule Factorization for Language Models

Statement

For any probability distribution over sequences $x_1, \ldots, x_T$ with $x_t \in V$ :

$p(x_1, \ldots, x_T) = \prod_{t=1}^{T} p(x_t \mid x_{<t})$

This factorization is exact with no approximation. The autoregressive decomposition is left-to-right by convention, but any permutation of the conditioning order yields a valid factorization.

Intuition

This is just the chain rule of probability applied sequentially. The content is not in the identity itself but in the modeling choice: parameterize each conditional with a shared neural network. This weight sharing across positions is what makes the approach tractable.

Proof Sketch

Direct application of the definition of conditional probability: $p(x_1, \ldots, x_T) = p(x_1) \cdot p(x_2 \mid x_1) \cdot p(x_3 \mid x_1, x_2) \cdots p(x_T \mid x_{<T})$ . Each factor is well-defined as long as the conditioning event has positive probability.

Why It Matters

This factorization means a language model only needs to solve one type of problem: predict the next token given a prefix. The training objective (next-token cross-entropy) directly optimizes this. No separate objectives for syntax, semantics, or world knowledge are needed.

Failure Mode

Left-to-right factorization is arbitrary. For tasks like fill-in-the-blank or text editing, bidirectional context is more natural. This motivates masked language modeling (BERT), which conditions on both left and right context but sacrifices the clean autoregressive factorization.

report a correction →

Proposition

Perplexity Equals Exponentiated Cross-Entropy

Statement

The perplexity of a language model $p$ on a sequence $x_1, \ldots, x_T$ is:

$\text{PPL} = \exp\left(-\frac{1}{T}\sum_{t=1}^{T}\log p(x_t \mid x_{<t})\right) = \exp(H_{\text{CE}})$

where $H_{\text{CE}}$ is the per-token cross-entropy loss. Perplexity can be interpreted as the effective vocabulary size the model is choosing from at each step.

Intuition

If the model assigns probability $1/k$ to the correct next token on average, the perplexity is $k$ . A perplexity of 20 means the model is "as uncertain" as if it were choosing uniformly among 20 options at each step. Lower perplexity means better prediction.

Proof Sketch

By definition, cross-entropy $H_{\text{CE}} = -\frac{1}{T}\sum_t \log p(x_t \mid x_{<t})$ . Perplexity is $\exp(H_{\text{CE}})$ . If $p(x_t \mid x_{<t}) = 1/k$ for all $t$ , then $H_{\text{CE}} = \log k$ and $\text{PPL} = k$ .

Why It Matters

Perplexity is the standard evaluation metric for language models. It is directly comparable across models with the same tokenizer and test set. The connection to cross-entropy means minimizing the training loss is equivalent to minimizing perplexity.

Failure Mode

Perplexity depends on the tokenizer. A model using byte-pair encoding with 32K tokens and a model using character-level tokenization with 256 tokens have incomparable perplexities. Always compare perplexity with the same tokenizer or convert to bits-per-character for fair comparison.

report a correction →

Masked Language Modeling

Definition

Masked Language Model (MLM)

Instead of predicting left-to-right, a masked language model randomly masks a fraction (typically 15%) of input tokens and predicts the original token at each masked position given all unmasked tokens:

$\mathcal{L}_{\text{MLM}} = -\sum_{t \in \text{masked}} \log p(x_t \mid x_{\setminus t})$

where $x_{\setminus t}$ denotes the sequence with position $t$ masked. This is the training objective of BERT.

MLM uses bidirectional context: both left and right tokens inform the prediction. This is an advantage for understanding tasks (classification, question answering) but a disadvantage for generation, since the model does not define a proper autoregressive distribution.

Causal vs. Bidirectional

Property	Autoregressive (GPT)	Masked (BERT)
Context	Left-to-right only	Full bidirectional
Defines $p(x_1, \ldots, x_T)$	Yes, via chain rule	No
Natural for generation	Yes	No
Natural for understanding	Weaker	Stronger
Training signal	Every token is predicted	Only masked tokens (~15%)

Prediction and Compression

Good prediction implies good compression. A language model that assigns probability $p(x_t \mid x_{<t})$ to each token can encode that token using $-\log_2 p(x_t \mid x_{<t})$ bits (arithmetic coding). The total message length for a sequence is:

$L = -\sum_{t=1}^{T}\log_2 p(x_t \mid x_{<t}) = T \cdot H_{\text{CE}} \cdot \log_2 e$

A model with lower perplexity compresses text more efficiently. This connection to information theory explains why next-token prediction is so powerful: to compress well, the model must capture all regularities in the data, including grammar, facts, reasoning patterns, and style.

Common Confusions

Watch Out

Next-token prediction is not shallow pattern matching

A common objection is that predicting the next word is too simple to produce genuine understanding. But consider: to predict the next token in a math proof, the model must understand the proof. To predict the next token in a code function, the model must understand the algorithm. The simplicity is in the objective, not in what the model must learn to achieve it.

Watch Out

MLM does not define a generative model

BERT's masked language modeling does not give a consistent joint probability $p(x_1, \ldots, x_T)$ . The conditionals $p(x_t \mid x_{\setminus t})$ for different masks may be inconsistent (they do not correspond to any single joint distribution in general). This is why BERT cannot generate text as cleanly as GPT.

Watch Out

Perplexity does not measure downstream task quality

A model with lower perplexity predicts text better but may not perform better on specific tasks like question answering or classification. Perplexity measures the average prediction quality across all tokens, while downstream tasks may depend on specific types of knowledge or reasoning.

Canonical Examples

Example

Perplexity calculation

A language model with vocabulary size 50,000 evaluates a test sentence of $T = 10$ tokens. The log-probabilities assigned to each correct token are: $-2.3, -1.1, -4.5, -0.8, -3.2, -1.5, -2.0, -5.1, -1.8, -2.7$ . The average negative log-probability is $\frac{25.0}{10} = 2.5$ . Perplexity = $\exp(2.5) \approx 12.18$ . The model behaves as if choosing among about 12 equally likely options at each step, far better than the uniform baseline perplexity of 50,000.

Exercises

ExerciseCore

Problem

A language model with vocabulary size $|V| = 10{,}000$ assigns uniform probability to all tokens at every position. What is its perplexity? If another model achieves perplexity 50, by what factor is it better at compression, measured in bits per token?

ExerciseAdvanced

Problem

Prove that for an autoregressive language model, the perplexity on a sequence cannot be lower than the perplexity of the true data distribution (up to finite-sample effects). That is, show $H_{\text{CE}}(p_{\text{true}}, p_{\text{model}}) \geq H(p_{\text{true}})$ where $H$ is the entropy of the true distribution.

References

Canonical:

Shannon, "A Mathematical Theory of Communication" (1948), the foundation
Bengio et al., "A Neural Probabilistic Language Model" (2003), JMLR
Mikolov et al., "Recurrent Neural Network Based Language Model" (2010)

Current:

Radford et al., "Language Models are Unsupervised Multitask Learners" (2019), GPT-2 paper
Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers" (2019), NAACL
Jurafsky & Martin, Speech and Language Processing (3rd ed., draft), Chapters 7-12

Next Topics

BERT and the pretrain-finetune paradigm: masked LM applied at scale
Transformer architecture: the architecture that makes large-scale language modeling practical

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

4

Common Probability Distributionslayer 0A · tier 1
Softmax and Numerical Stabilitylayer 1 · tier 1
Feedforward Networks and Backpropagationlayer 2 · tier 1
Information Theory Foundationslayer 0B · tier 2

Derived topics

3

BERT and the Pretrain-Finetune Paradigmlayer 4 · tier 2
Transformer Architecturelayer 4 · tier 2
LLaMA and Open Weight Modelslayer 5 · tier 2

Graph-backed continuations

BERT and the Pretrain-Finetune Paradigm Transformer Architecture LLaMA and Open Weight Models