Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

ML Methods

Natural Language Processing Foundations

The progression from bag-of-words to transformers: tokenization, language modeling, TF-IDF, sequence-to-sequence, attention, and why the pre-train then fine-tune paradigm replaced task-specific architectures.

CoreTier 2Stable~55 min

Prerequisites

0

Why This Matters

NLP has converged to a single paradigm: train a large language model on text, then adapt it to downstream tasks. Understanding the foundations, from tokenization through language modeling to the attention mechanism, explains why this paradigm works and what its limitations are.

The mathematical core is surprisingly simple: predict the next token. The complexity comes from the scale of the model and data, not from the objective.

Mental Model

Text is a sequence of discrete symbols. Every NLP system must (1) convert text to numbers (tokenization), (2) represent the sequence in a way that captures dependencies (the model), and (3) define an objective that drives useful representations (the loss). The history of NLP is the history of increasingly powerful solutions to step (2).

Tokenization

Definition

Byte Pair Encoding

A subword tokenization algorithm. Start with individual characters as the vocabulary. Repeatedly merge the most frequent adjacent pair of tokens into a new token until the vocabulary reaches a target size. This gives a vocabulary that balances character-level and word-level granularity.

BPE handles rare words by decomposing them into known subwords. The word "unhappiness" might tokenize as ["un", "happiness"] or ["un", "happ", "iness"] depending on the learned merges.

SentencePiece extends this to treat the input as a raw byte stream, making tokenization language-agnostic. GPT-2, GPT-3, and GPT-4 all use variants of BPE.

The vocabulary size V|V| is a design choice. Larger vocabularies mean shorter sequences (fewer tokens per text) but larger embedding matrices. Typical values: 30,000 to 100,000 tokens.

Language Modeling

Definition

Autoregressive Language Model

Given a sequence of tokens x1,x2,,xTx_1, x_2, \ldots, x_T, model the joint probability using the chain rule:

p(x1,,xT)=t=1Tp(xtx1,,xt1)p(x_1, \ldots, x_T) = \prod_{t=1}^{T} p(x_t \mid x_1, \ldots, x_{t-1})

Each conditional p(xtx<t)p(x_t \mid x_{<t}) is parameterized by a neural network that outputs a distribution over the vocabulary.

Proposition

Cross-Entropy Loss and Perplexity

Statement

The per-token cross-entropy loss of a language model pp evaluated against the true distribution qq is:

L=1Tt=1Tlogp(xtx<t)L = -\frac{1}{T}\sum_{t=1}^{T} \log p(x_t \mid x_{<t})

Perplexity is PPL=eL\text{PPL} = e^L. A perplexity of kk means the model is on average as uncertain as a uniform distribution over kk tokens. Lower perplexity means better prediction.

Intuition

Cross-entropy measures how many bits the model needs to encode text from the true distribution. Perplexity converts this to an "effective vocabulary size" the model is choosing from at each step. A perfect model has perplexity equal to the entropy of natural language (estimated at roughly 20-50 for English).

Proof Sketch

By definition, cross-entropy H(q,p)=Eq[logp]H(q)H(q, p) = -\mathbb{E}_q[\log p] \geq H(q) with equality when p=qp = q. The empirical estimate replaces the expectation with an average over the observed sequence. Perplexity is the exponential of cross-entropy, converting from log-space to a linear scale.

Why It Matters

This is the training objective for GPT, LLaMA, and every autoregressive language model. Minimizing cross-entropy is equivalent to maximizing the log-likelihood of the training data. Perplexity provides a scale-invariant way to compare models across different datasets.

Failure Mode

Low perplexity does not guarantee useful generation. A model can achieve low perplexity by memorizing frequent patterns while failing on rare but important constructions. Perplexity also does not measure factual accuracy, coherence over long passages, or instruction-following ability.

The NLP Pipeline

Every NLP system, from classical to modern, implements some version of the following pipeline:

1. Text normalization. Convert raw text into a standard form: lowercasing, removing punctuation, handling Unicode, expanding contractions ("don't" to "do not"). Modern subword tokenizers (BPE, SentencePiece) absorb much of this, but explicit normalization still matters for retrieval and matching tasks.

2. Tokenization. Split text into discrete units. Character-level tokenization has a small vocabulary (\sim256 for UTF-8 bytes) but produces long sequences. Word-level tokenization has a large vocabulary (\sim100k+) and cannot handle unseen words. Subword tokenization (BPE) balances these tradeoffs. The tokenizer is trained separately from the model and frozen during model training.

3. Representation. Map token sequences to vectors. Classical: bag-of-words count vectors (TF-IDF). Neural: learned embedding vectors, then contextualized representations via RNNs or transformers. The key shift in modern NLP was from static word embeddings (Word2Vec, GloVe: one vector per word regardless of context) to contextual embeddings (BERT, GPT: one vector per word per sentence).

4. Task head. Map representations to outputs. Classification: linear layer + softmax. Sequence labeling (NER, POS tagging): per-token classification. Generation: autoregressive next-token prediction. The task head is often trivially simple; the representation does the heavy lifting.

Example

Classical NLP vs modern NLP on sentiment analysis

Classical pipeline (pre-2018):

  1. Tokenize: split on whitespace and punctuation
  2. Remove stopwords ("the", "is", "at")
  3. Compute TF-IDF vectors (sparse, ~50k dimensions)
  4. Train a logistic regression or SVM classifier

Modern pipeline (post-2018):

  1. Tokenize with BPE (subword level)
  2. Encode with a pretrained transformer (dense, 768 dimensions)
  3. Fine-tune a linear head on top of the [CLS] token representation

The modern pipeline typically achieves 3-5 percentage points higher accuracy on standard benchmarks (SST-2, IMDB) while requiring far less feature engineering. The pretrained transformer has already learned syntactic structure, semantic similarity, and sentiment-related patterns from the pretraining corpus.

Bag-of-Words and TF-IDF

Before neural methods, the standard text representation was a count vector.

Definition

TF-IDF

For term tt in document dd:

tf-idf(t,d)=tf(t,d)×logNdf(t)\text{tf-idf}(t, d) = \text{tf}(t, d) \times \log\frac{N}{\text{df}(t)}

where tf(t,d)\text{tf}(t,d) is the term frequency in document dd, NN is the total number of documents, and df(t)\text{df}(t) is the number of documents containing term tt. TF-IDF upweights terms that are frequent in a document but rare across the corpus.

TF-IDF representations are sparse, high-dimensional, and ignore word order. They remain competitive for document retrieval and as baselines for classification tasks. The mathematical simplicity (linear algebra on count matrices) makes them interpretable in ways neural models are not.

Sequence-to-Sequence Models

Map an input sequence to an output sequence of potentially different length. The encoder reads the input into a fixed-length context vector; the decoder generates the output token by token, conditioned on this context.

The bottleneck problem: the entire input sequence must be compressed into a single vector. For long sequences, this vector loses information about early tokens. This motivated the attention mechanism.

Attention in NLP

Definition

Scaled Dot-Product Attention

Given queries QQ, keys KK, and values VV:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

The scaling by dk\sqrt{d_k} prevents the dot products from growing large in magnitude, which would push softmax into regions of tiny gradients.

Attention lets the decoder look at all encoder hidden states, not just the final one. Multi-head attention runs hh parallel attention functions with different learned projections, allowing the model to attend to different types of relationships simultaneously.

The transformer architecture (Vaswani et al., 2017) replaces recurrence entirely with self-attention. Each layer attends to all positions in the previous layer's output. This makes the transformer O(T2d)O(T^2 d) per layer but removes the sequential bottleneck of RNNs.

Why Transformers Replaced RNNs

RNNs process tokens sequentially: position tt depends on position t1t-1. This creates two problems. First, training cannot be parallelized across time steps. Second, gradients must flow through TT steps, leading to vanishing or exploding gradients for long sequences.

Transformers compute all positions in parallel. Attention connects every pair of positions directly, so information flows in O(1)O(1) layers rather than O(T)O(T) steps. The quadratic cost in sequence length is the tradeoff, addressed by sparse attention and linear attention variants for very long sequences.

Pre-training and Fine-tuning

BERT (2018): pre-train a bidirectional transformer by masking random tokens and predicting them. Fine-tune on downstream tasks by adding a task-specific head.

GPT (2018-present): pre-train an autoregressive transformer on next-token prediction. Adapt to tasks via fine-tuning, few-shot prompting, or instruction tuning.

The insight: pretraining on large corpora learns general linguistic and world knowledge. Fine-tuning (or prompting) steers this knowledge toward specific tasks. This is more sample-efficient than training task-specific models from scratch.

Common Confusions

Watch Out

Language modeling is not just text generation

The next-token prediction objective learns representations useful for classification, retrieval, translation, and reasoning, not just for generating text. The power of language modeling is as a pretraining objective that produces general-purpose features.

Watch Out

BERT and GPT solve different problems

BERT uses bidirectional context and is designed for understanding tasks (classification, extraction). GPT uses left-to-right context and is designed for generation. BERT cannot generate text autoregressively. GPT cannot natively use right context for prediction. The architectures reflect different intended use cases.

Watch Out

Tokenization affects model behavior

A token boundary is a hard information boundary. The model cannot look inside a token. If "New York" is two tokens, the model must learn their association. If a rare word is split into subwords, the model must compose meaning from pieces. Tokenization decisions propagate into model behavior in non-obvious ways.

Key Takeaways

  • BPE tokenization balances vocabulary size against sequence length
  • Language modeling (next-token prediction) is the dominant pretraining objective
  • Cross-entropy loss trains the model; perplexity evaluates it
  • TF-IDF remains a strong baseline for document-level tasks
  • Attention solved the bottleneck of fixed-length context vectors
  • Transformers replaced RNNs by enabling parallel training and direct long-range connections
  • Pre-train then fine-tune (or prompt) is more sample-efficient than task-specific training

Exercises

ExerciseCore

Problem

A language model with vocabulary size 50,000 achieves perplexity 25 on a test set. What is the per-token cross-entropy loss in nats? In bits?

ExerciseAdvanced

Problem

Explain why the scaling factor dk\sqrt{d_k} in scaled dot-product attention is necessary. What happens to the softmax output without it when dkd_k is large?

References

Canonical:

  • Vaswani et al., "Attention Is All You Need" (NeurIPS 2017)
  • Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers" (NAACL 2019)

Current:

  • Jurafsky and Martin, Speech and Language Processing (3rd edition draft), Chapters 3, 6, 10

  • Radford et al., "Language Models are Unsupervised Multitask Learners" (GPT-2, 2019)

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapters 1-14

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.