Paper breakdown

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova · 2018 · NAACL 2019

Combines a transformer encoder with a masked-language-model objective to learn deep bidirectional context, then fine-tunes on downstream NLP tasks. Establishes pre-train-then-fine-tune as the dominant paradigm for two years and the technical scaffolding for everything after.

arXiv:1810.04805

Overview

Devlin et al. (2018) showed that a transformer encoder pretrained on a token-level fill-in-the-blank task could, after light task-specific fine-tuning, beat the previous state of the art on eleven standard NLP benchmarks. The paper makes two design decisions that had not been combined before. The model is a bidirectional transformer encoder, so each token's representation conditions on the entire input rather than a left-to-right prefix. The pretraining objective is a masked-language-model (MLM) loss: a subset of input tokens is replaced with a [MASK] symbol, and the network is trained to predict the original tokens.

The bidirectionality is the technical hinge. Left-to-right language models (the original GPT) and concatenation of left-to-right and right-to-left models (ELMo) both restrict the receptive field. BERT's encoder lets every token attend to every other token without leakage, because the masked tokens are not present in the input the model conditions on. The paper combines this with a sentence-pair classification head used in a "next-sentence prediction" auxiliary task.

Mathematical Contributions

The masked-language-model objective

Given an input sequence $x_1, \ldots, x_n$ , BERT samples a random subset $M \subset \{1, \ldots, n\}$ of size roughly $0.15 n$ . For each $i \in M$ , the input token $x_i$ is replaced as follows: with probability 0.8, by [MASK]; with probability 0.1, by a random token; with probability 0.1, left unchanged. The encoder produces hidden states $h_1, \ldots, h_n$ , and the loss is the cross-entropy of predicting $x_i$ from $h_i$ for $i \in M$ :

$\mathcal{L}_{\text{MLM}} = -\sum_{i \in M} \log p(x_i \mid h_i)$

The 80/10/10 split addresses a fine-tuning mismatch: at fine-tuning time there is no [MASK] token, so always-mask training would teach the encoder to lean on a feature absent at downstream evaluation. Random replacement and identity tokens introduce noise and force every position to maintain a useful representation.

Next-sentence prediction

Each pretraining example is a pair of segments $A$ and $B$ separated by [SEP], prepended with a [CLS] token. With probability 0.5, $B$ is the actual next segment in the document; otherwise $B$ is sampled from another document. A binary classifier on the [CLS] representation predicts which case applies:

$\mathcal{L}_{\text{NSP}} = -\log p(\text{IsNext} \mid h_{\text{[CLS]}})$

The total pretraining loss is $\mathcal{L}_{\text{MLM}} + \mathcal{L}_{\text{NSP}}$ . Subsequent work (RoBERTa, ALBERT) showed that NSP is largely unnecessary if MLM is run for long enough; the paper's contribution is in making the protocol work, not in NSP being optimal.

Why bidirectionality is non-trivial

A standard left-to-right language model trains $p(x_i \mid x_{<i})$ . Making it bidirectional by simply attending to all positions would let each token's prediction trivially copy itself: position $i$ can attend to itself and decode $x_i$ from $h_i$ . The MLM workaround replaces the token at position $i$ with [MASK] so the model cannot see it in the input. This is what unlocks unrestricted attention.

Architecture and scale

The paper releases two configurations:

Model	Layers	Hidden	Heads	Params
BERT-Base	12	768	12	110M
BERT-Large	24	1024	16	340M

Pretraining uses the BookCorpus (800M words) and English Wikipedia (2.5B words), totalling about 3.3B tokens. Training is roughly 1 million steps at batch size 256 on 64 TPU chips for four days for BERT-Large. The paper is also the first major release of pretrained checkpoints under a permissive licence; that, as much as the technical content, is why it propagated so quickly.

Fine-tuning protocol

Downstream tasks fine-tune all parameters. For sentence classification (GLUE, MNLI), a linear head reads the [CLS] representation. For span tasks (SQuAD), two pointer heads predict start and end positions. For sequence labelling (NER), a per-token classifier reads each position. Fine-tuning runs 2-4 epochs at small learning rates (typically 2e-5 to 5e-5).

Connections to TheoremPath Topics

BERT and the pre-train fine-tune paradigm — the modern presentation including RoBERTa, ALBERT, ELECTRA, and DeBERTa.
Transformer architecture — the encoder block BERT inherits from Vaswani et al. (2017).
Attention mechanism theory — the bidirectional attention pattern BERT uses without a causal mask.
Tokenization and information theory — WordPiece, the subword tokenizer BERT shipped with.
Transfer learning — the broader context for pretrain-fine-tune workflows.
Fine-tuning and adaptation — the modern landscape including LoRA, prefix tuning, and adapter methods.
Word embeddings — the static-embedding baseline (word2vec, GloVe) that BERT's contextual embeddings replaced.

Why It Matters Now

BERT itself is no longer a frontier model; it has been displaced by larger autoregressive language models for most generative tasks. But the paper still matters for three reasons.

First, the pre-train-then-fine-tune protocol is now the standard ML workflow. It is so dominant that the alternative — training a fresh model on each downstream task — is treated as a baseline rather than the default. Almost every modern model release follows the BERT recipe, even when the model itself is autoregressive: pretrain on a large unlabelled corpus, fine-tune on a smaller labelled or instruction-following dataset.

Second, BERT-style encoders remain the right choice for retrieval, reranking, classification, and any setting where a dense representation of a fixed input is more useful than a generative posterior. The two-tower retrieval models that power search at scale, including most production embedding APIs, descend directly from BERT-style sentence encoders.

Third, the masked-language-model objective generalised. Modern decoder-only models still use a denoising-style auxiliary objective in their post-training pipelines, masked-image-modelling powers MAE and BEiT for vision, and protein language models (ESM, ProGen) use BERT-style masking on amino-acid sequences. The MLM idea is the longest-living technical contribution of the paper.

References

Canonical:

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL. arXiv:1810.04805.

Direct precursors:

Peters, M. E. et al. (2018). "Deep contextualized word representations." NAACL. arXiv:1802.05365. ELMo.
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). "Improving Language Understanding by Generative Pre-Training." OpenAI report. GPT-1.
Howard, J., & Ruder, S. (2018). "Universal Language Model Fine-tuning for Text Classification." ACL. arXiv:1801.06146. ULMFiT.

Refinements and ablations:

Liu, Y. et al. (2019). "RoBERTa: A Robustly Optimized BERT Pretraining Approach." arXiv:1907.11692.
Lan, Z. et al. (2020). "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations." ICLR. arXiv:1909.11942.
Clark, K., Luong, M.-T., Le, Q. V., & Manning, C. D. (2020). "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators." ICLR. arXiv:2003.10555.
He, P. et al. (2021). "DeBERTa: Decoding-enhanced BERT with Disentangled Attention." ICLR. arXiv:2006.03654.

Probing and analysis:

Tenney, I., Das, D., & Pavlick, E. (2019). "BERT Rediscovers the Classical NLP Pipeline." ACL. arXiv:1905.05950.
Rogers, A., Kovaleva, O., & Rumshisky, A. (2020). "A Primer in BERTology: What We Know About How BERT Works." TACL. arXiv:2002.12327.

Connected topics

Last reviewed: May 5, 2026