Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

LLM Construction

BERT and the Pretrain-Finetune Paradigm

BERT introduced bidirectional pretraining with masked language modeling. The pretrain-finetune paradigm it established, train once on a large corpus then adapt to many tasks, became the default approach for NLP and beyond.

CoreTier 2Stable~50 min

Why This Matters

Before BERT (2018), NLP practitioners trained separate models for each task: one for sentiment analysis, one for named entity recognition, one for question answering. Each required task-specific architectures and labeled data. BERT showed that a single pretrained model could be fine-tuned for dozens of tasks with small amounts of labeled data, often surpassing task-specific models. This pretrain-finetune paradigm is now standard across NLP, vision, and multimodal systems.

Mental Model

Think of pretraining as building a general-purpose understanding of language. The model reads billions of words and learns syntax, semantics, factual associations, and reasoning patterns. Fine-tuning is adaptation: take this general-purpose representation and slightly adjust it for a specific task using a small labeled dataset. The key insight is that language understanding is mostly shared across tasks. A model that understands English well can be adapted to detect sentiment, answer questions, or classify documents with minimal additional training.

BERT Architecture

BERT uses the encoder portion of the transformer architecture. The key difference from GPT: BERT uses bidirectional self-attention. Each token attends to all other tokens in the sequence, both left and right. GPT uses causal (left-to-right only) attention.

BERT-Base: 12 layers, 768 hidden dimension, 12 attention heads, 110M parameters. BERT-Large: 24 layers, 1024 hidden dimension, 16 attention heads, 340M parameters.

Pretraining Objectives

BERT is pretrained with two objectives simultaneously.

Definition

Masked Language Modeling (MLM)

Randomly select 15% of input tokens. Of these, replace 80% with a [MASK] token, 10% with a random token, and 10% unchanged. The model predicts the original token at each selected position using bidirectional context:

LMLM=tmaskedlogp(xtxmasked)\mathcal{L}_{\text{MLM}} = -\sum_{t \in \text{masked}} \log p(x_t \mid x_{\setminus \text{masked}})

The 80/10/10 split prevents the model from learning that [MASK] is the only signal for prediction, since [MASK] never appears at inference time.

Definition

Next Sentence Prediction (NSP)

Given two segments A and B, predict whether B follows A in the original text (positive) or is a random segment (negative). A binary classification head on the [CLS] token produces the prediction.

NSP was intended to help with tasks requiring cross-sentence reasoning (e.g., question answering, natural language inference). Later work (RoBERTa, 2019) showed NSP provides minimal benefit and can be removed.

The Pretrain-Finetune Paradigm

Stage 1: Pretraining. Train the model on a large unlabeled corpus (BooksCorpus + English Wikipedia, approximately 3.3 billion words) using MLM and NSP. This is computationally expensive: BERT-Large required 4 days on 64 TPUs.

Stage 2: Fine-tuning. Add a task-specific output layer on top of the pretrained model. Train the entire model (pretrained weights + new layer) on task-specific labeled data for a few epochs with a small learning rate. Fine-tuning is cheap: a few hours on a single GPU.

This works because the pretrained representations transfer. The lower layers capture general linguistic features (syntax, word similarity). The higher layers capture more abstract features (semantic roles, reasoning). Fine-tuning adjusts these representations for the specific task.

Main Theorems

Proposition

Transfer Learning via Pretrained Representations

Statement

Let h^pretrain\hat{h}_{\text{pretrain}} be a pretrained representation and h^scratch\hat{h}_{\text{scratch}} be a model trained from scratch on nn task-specific examples. If the pretrained representation captures features relevant to the target task, then the sample complexity of fine-tuning is determined by the complexity of the task-specific head (often a single linear layer), not by the full model complexity. Concretely, fine-tuning requires O(dhead/ϵ2)O(d_{\text{head}} / \epsilon^2) samples for ϵ\epsilon-excess risk on the target task, where dheaddfulld_{\text{head}} \ll d_{\text{full}} is the dimension of the task-specific head.

Intuition

Pretraining "freezes in" a good feature extractor. Fine-tuning only needs to learn how to use these features for the new task. Learning a linear classifier on good features requires far fewer examples than learning both the features and the classifier from scratch.

Proof Sketch

Standard generalization bounds depend on the complexity of the function class being optimized. If the pretrained features are fixed (or nearly fixed with small learning rate), the effective hypothesis class during fine-tuning is the class of linear functions on top of the features, which has complexity O(dhead)O(d_{\text{head}}) rather than O(dfull)O(d_{\text{full}}).

Why It Matters

This explains BERT's practical impact: tasks with only a few thousand labeled examples can benefit from knowledge acquired during pretraining on billions of tokens. The pretrain-finetune paradigm converts the problem of limited labeled data into the problem of sufficient unlabeled data, which is far easier to obtain.

Failure Mode

Transfer fails when the pretraining distribution and target task distribution are too different. A model pretrained on English Wikipedia may not transfer well to medical radiology reports or legal contracts without domain-adaptive pretraining. The "shared structure" assumption is necessary and can be violated.

Encoder-Only vs. Decoder-Only

BERT (encoder-only) and GPT (decoder-only) represent two design choices for transformers.

PropertyBERT (Encoder-Only)GPT (Decoder-Only)
AttentionBidirectionalCausal (left-to-right)
PretrainingMLM (predict masked tokens)Next-token prediction
StrengthUnderstanding tasksGeneration tasks
Fine-tuningAdd classification headPrompt-based or fine-tune
Context usageFull context for each tokenOnly preceding context

Why decoder-only won for generation. Autoregressive models define a proper probability distribution over sequences and generate text naturally by sampling one token at a time. BERT's bidirectional attention makes it poorly suited for generation because it does not define p(xtx<t)p(x_t \mid x_{<t}) cleanly.

Why encoder-only was initially better for understanding. Bidirectional context gives BERT more information for each prediction. For classifying a sentence or extracting an answer span, seeing the whole input is strictly more informative than seeing only the left context.

The convergence: with sufficient scale, decoder-only models (GPT-3 and beyond) match or exceed BERT-style models on understanding tasks too, by using in-context learning or instruction tuning. The decoder-only architecture became dominant because it handles both generation and understanding.

BERT's Impact and Legacy

BERT established several patterns that persist today:

  1. Pretrain on unlabeled data, fine-tune on labeled data. This is the standard recipe for NLP, vision (ViT), speech (wav2vec 2.0), and multimodal systems (CLIP).

  2. WordPiece tokenization. BERT used a 30,522-token vocabulary built with WordPiece, a subword tokenization algorithm. Subword tokenization handles rare and out-of-vocabulary words and became standard in all subsequent models.

  3. [CLS] token for classification. A special token whose final hidden state is used as the sequence representation for classification tasks.

  4. The fine-tuning recipe. Learning rate 2×1052 \times 10^{-5} to 5×1055 \times 10^{-5}, 2-4 epochs, batch size 16-32, with warm-up. This recipe was remarkably consistent across tasks.

Key Successors

RoBERTa (2019): removed NSP, trained longer with more data and larger batches. Showed BERT was significantly undertrained.

ALBERT (2019): parameter sharing across layers and factored embedding to reduce model size.

ELECTRA (2020): replaced MLM with a "replaced token detection" objective. More sample-efficient because every token position provides a training signal, not just the 15% that are masked.

Common Confusions

Watch Out

BERT does not generate text

BERT is not a generative model. It can fill in blanks ([MASK] prediction) but cannot generate coherent sequences because its masked language modeling objective does not define a consistent joint distribution over sequences. For text generation, use autoregressive models like GPT.

Watch Out

Fine-tuning updates ALL weights, not just the new layer

A common misconception is that fine-tuning freezes the pretrained weights and only trains the task-specific head. Standard BERT fine-tuning updates all parameters with a small learning rate. The pretrained weights shift slightly to better serve the target task. Freezing all pretrained weights (feature extraction) works for some tasks but typically underperforms full fine-tuning.

Watch Out

NSP was not the key innovation

RoBERTa showed that removing NSP and training longer with more data gives better results. The true contributions of BERT were (1) bidirectional pretraining with MLM and (2) demonstrating that a single pretrained model transfers to many tasks. NSP was a design choice that turned out to be unnecessary.

Canonical Examples

Example

Fine-tuning for sentiment classification

Take BERT-Base (110M parameters). Add a linear layer mapping the [CLS] token's 768-dimensional representation to 2 classes (positive/negative). Train on 25,000 labeled movie reviews (SST-2) for 3 epochs with learning rate 2×1052 \times 10^{-5}. Achieves 93.5% accuracy, compared to previous state of the art of 90.7% using task-specific architectures. The entire fine-tuning takes about 1 hour on a single GPU.

Exercises

ExerciseCore

Problem

In BERT's MLM objective, 15% of tokens are selected for prediction. Of these, 80% are replaced with [MASK], 10% with a random token, and 10% left unchanged. If a sentence has 100 tokens, how many tokens contribute to the MLM loss? Why use the 80/10/10 split instead of masking 100% of selected tokens?

ExerciseAdvanced

Problem

BERT's MLM trains on only 15% of tokens per sequence, while GPT's autoregressive objective trains on 100% of tokens. Assuming equal computational budgets (same number of sequences processed), how many more tokens of training signal does GPT get? What are the implications for sample efficiency?

Related Comparisons

References

Canonical:

  • Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (2019), NAACL
  • Vaswani et al., "Attention Is All You Need" (2017), NeurIPS

Current:

  • Liu et al., "RoBERTa: A Robustly Optimized BERT Pretraining Approach" (2019)

  • Clark et al., "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators" (2020), ICLR

  • Jurafsky & Martin, Speech and Language Processing (3rd ed., draft), Chapters 7-12

  • Goodfellow, Bengio, Courville, Deep Learning (2016), Chapters 10-12

Next Topics

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics