LLM Construction
BERT and the Pretrain-Finetune Paradigm
BERT introduced bidirectional pretraining with masked language modeling. The pretrain-finetune paradigm it established, train once on a large corpus then adapt to many tasks, became the default approach for NLP and beyond.
Why This Matters
Before BERT (2018), NLP practitioners trained separate models for each task: one for sentiment analysis, one for named entity recognition, one for question answering. Each required task-specific architectures and labeled data. BERT showed that a single pretrained model could be fine-tuned for dozens of tasks with small amounts of labeled data, often surpassing task-specific models. This pretrain-finetune paradigm is now standard across NLP, vision, and multimodal systems.
Mental Model
Think of pretraining as building a general-purpose understanding of language. The model reads billions of words and learns syntax, semantics, factual associations, and reasoning patterns. Fine-tuning is adaptation: take this general-purpose representation and slightly adjust it for a specific task using a small labeled dataset. The key insight is that language understanding is mostly shared across tasks. A model that understands English well can be adapted to detect sentiment, answer questions, or classify documents with minimal additional training.
BERT Architecture
BERT uses the encoder portion of the transformer architecture. The key difference from GPT: BERT uses bidirectional self-attention. Each token attends to all other tokens in the sequence, both left and right. GPT uses causal (left-to-right only) attention.
BERT-Base: 12 layers, 768 hidden dimension, 12 attention heads, 110M parameters. BERT-Large: 24 layers, 1024 hidden dimension, 16 attention heads, 340M parameters.
Pretraining Objectives
BERT is pretrained with two objectives simultaneously.
Masked Language Modeling (MLM)
Randomly select 15% of input tokens. Of these, replace 80% with a [MASK] token, 10% with a random token, and 10% unchanged. The model predicts the original token at each selected position using bidirectional context:
The 80/10/10 split prevents the model from learning that [MASK] is the only signal for prediction, since [MASK] never appears at inference time.
Next Sentence Prediction (NSP)
Given two segments A and B, predict whether B follows A in the original text (positive) or is a random segment (negative). A binary classification head on the [CLS] token produces the prediction.
NSP was intended to help with tasks requiring cross-sentence reasoning (e.g., question answering, natural language inference). Later work (RoBERTa, 2019) showed NSP provides minimal benefit and can be removed.
The Pretrain-Finetune Paradigm
Stage 1: Pretraining. Train the model on a large unlabeled corpus (BooksCorpus + English Wikipedia, approximately 3.3 billion words) using MLM and NSP. This is computationally expensive: BERT-Large required 4 days on 64 TPUs.
Stage 2: Fine-tuning. Add a task-specific output layer on top of the pretrained model. Train the entire model (pretrained weights + new layer) on task-specific labeled data for a few epochs with a small learning rate. Fine-tuning is cheap: a few hours on a single GPU.
This works because the pretrained representations transfer. The lower layers capture general linguistic features (syntax, word similarity). The higher layers capture more abstract features (semantic roles, reasoning). Fine-tuning adjusts these representations for the specific task.
Main Theorems
Transfer Learning via Pretrained Representations
Statement
Let be a pretrained representation and be a model trained from scratch on task-specific examples. If the pretrained representation captures features relevant to the target task, then the sample complexity of fine-tuning is determined by the complexity of the task-specific head (often a single linear layer), not by the full model complexity. Concretely, fine-tuning requires samples for -excess risk on the target task, where is the dimension of the task-specific head.
Intuition
Pretraining "freezes in" a good feature extractor. Fine-tuning only needs to learn how to use these features for the new task. Learning a linear classifier on good features requires far fewer examples than learning both the features and the classifier from scratch.
Proof Sketch
Standard generalization bounds depend on the complexity of the function class being optimized. If the pretrained features are fixed (or nearly fixed with small learning rate), the effective hypothesis class during fine-tuning is the class of linear functions on top of the features, which has complexity rather than .
Why It Matters
This explains BERT's practical impact: tasks with only a few thousand labeled examples can benefit from knowledge acquired during pretraining on billions of tokens. The pretrain-finetune paradigm converts the problem of limited labeled data into the problem of sufficient unlabeled data, which is far easier to obtain.
Failure Mode
Transfer fails when the pretraining distribution and target task distribution are too different. A model pretrained on English Wikipedia may not transfer well to medical radiology reports or legal contracts without domain-adaptive pretraining. The "shared structure" assumption is necessary and can be violated.
Encoder-Only vs. Decoder-Only
BERT (encoder-only) and GPT (decoder-only) represent two design choices for transformers.
| Property | BERT (Encoder-Only) | GPT (Decoder-Only) |
|---|---|---|
| Attention | Bidirectional | Causal (left-to-right) |
| Pretraining | MLM (predict masked tokens) | Next-token prediction |
| Strength | Understanding tasks | Generation tasks |
| Fine-tuning | Add classification head | Prompt-based or fine-tune |
| Context usage | Full context for each token | Only preceding context |
Why decoder-only won for generation. Autoregressive models define a proper probability distribution over sequences and generate text naturally by sampling one token at a time. BERT's bidirectional attention makes it poorly suited for generation because it does not define cleanly.
Why encoder-only was initially better for understanding. Bidirectional context gives BERT more information for each prediction. For classifying a sentence or extracting an answer span, seeing the whole input is strictly more informative than seeing only the left context.
The convergence: with sufficient scale, decoder-only models (GPT-3 and beyond) match or exceed BERT-style models on understanding tasks too, by using in-context learning or instruction tuning. The decoder-only architecture became dominant because it handles both generation and understanding.
BERT's Impact and Legacy
BERT established several patterns that persist today:
-
Pretrain on unlabeled data, fine-tune on labeled data. This is the standard recipe for NLP, vision (ViT), speech (wav2vec 2.0), and multimodal systems (CLIP).
-
WordPiece tokenization. BERT used a 30,522-token vocabulary built with WordPiece, a subword tokenization algorithm. Subword tokenization handles rare and out-of-vocabulary words and became standard in all subsequent models.
-
[CLS] token for classification. A special token whose final hidden state is used as the sequence representation for classification tasks.
-
The fine-tuning recipe. Learning rate to , 2-4 epochs, batch size 16-32, with warm-up. This recipe was remarkably consistent across tasks.
Key Successors
RoBERTa (2019): removed NSP, trained longer with more data and larger batches. Showed BERT was significantly undertrained.
ALBERT (2019): parameter sharing across layers and factored embedding to reduce model size.
ELECTRA (2020): replaced MLM with a "replaced token detection" objective. More sample-efficient because every token position provides a training signal, not just the 15% that are masked.
Common Confusions
BERT does not generate text
BERT is not a generative model. It can fill in blanks ([MASK] prediction) but cannot generate coherent sequences because its masked language modeling objective does not define a consistent joint distribution over sequences. For text generation, use autoregressive models like GPT.
Fine-tuning updates ALL weights, not just the new layer
A common misconception is that fine-tuning freezes the pretrained weights and only trains the task-specific head. Standard BERT fine-tuning updates all parameters with a small learning rate. The pretrained weights shift slightly to better serve the target task. Freezing all pretrained weights (feature extraction) works for some tasks but typically underperforms full fine-tuning.
NSP was not the key innovation
RoBERTa showed that removing NSP and training longer with more data gives better results. The true contributions of BERT were (1) bidirectional pretraining with MLM and (2) demonstrating that a single pretrained model transfers to many tasks. NSP was a design choice that turned out to be unnecessary.
Canonical Examples
Fine-tuning for sentiment classification
Take BERT-Base (110M parameters). Add a linear layer mapping the [CLS] token's 768-dimensional representation to 2 classes (positive/negative). Train on 25,000 labeled movie reviews (SST-2) for 3 epochs with learning rate . Achieves 93.5% accuracy, compared to previous state of the art of 90.7% using task-specific architectures. The entire fine-tuning takes about 1 hour on a single GPU.
Exercises
Problem
In BERT's MLM objective, 15% of tokens are selected for prediction. Of these, 80% are replaced with [MASK], 10% with a random token, and 10% left unchanged. If a sentence has 100 tokens, how many tokens contribute to the MLM loss? Why use the 80/10/10 split instead of masking 100% of selected tokens?
Problem
BERT's MLM trains on only 15% of tokens per sequence, while GPT's autoregressive objective trains on 100% of tokens. Assuming equal computational budgets (same number of sequences processed), how many more tokens of training signal does GPT get? What are the implications for sample efficiency?
Related Comparisons
References
Canonical:
- Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (2019), NAACL
- Vaswani et al., "Attention Is All You Need" (2017), NeurIPS
Current:
-
Liu et al., "RoBERTa: A Robustly Optimized BERT Pretraining Approach" (2019)
-
Clark et al., "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators" (2020), ICLR
-
Jurafsky & Martin, Speech and Language Processing (3rd ed., draft), Chapters 7-12
-
Goodfellow, Bengio, Courville, Deep Learning (2016), Chapters 10-12
Next Topics
- GPT series evolution: the decoder-only path that ultimately dominated
- Post-training overview: what happens after pretraining (RLHF, instruction tuning)
Last reviewed: April 2026
Prerequisites
Foundations this topic depends on.
- Transformer ArchitectureLayer 4
- Attention Mechanism TheoryLayer 4
- Matrix Operations and PropertiesLayer 0A
- Sets, Functions, and RelationsLayer 0A
- Basic Logic and Proof TechniquesLayer 0A
- Softmax and Numerical StabilityLayer 1
- Feedforward Networks and BackpropagationLayer 2
- Differentiation in RnLayer 0A
- Matrix CalculusLayer 1
- The Jacobian MatrixLayer 0A
- The Hessian MatrixLayer 0A
- Activation FunctionsLayer 1
- Convex Optimization BasicsLayer 1
- Token Prediction and Language ModelingLayer 3
- Information Theory FoundationsLayer 0B
Builds on This
- GPT Series EvolutionLayer 5