Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Comparison

Encoder-Only vs. Decoder-Only vs. Encoder-Decoder

Encoder-only models (BERT) use bidirectional attention for classification and extraction. Decoder-only models (GPT) use causal masking for autoregressive generation. Encoder-decoder models (T5) use cross-attention to condition generation on a fully encoded input. The architecture choice determines what tasks the model can perform natively.

What Each Architecture Does

All three are built from the same transformer building blocks (multi-head attention, feedforward layers, residual connections, layer normalization) but differ in how attention is masked and how input-output flows are structured.

Encoder-only (BERT, RoBERTa, DeBERTa): Every token attends to every other token in the input (bidirectional attention). The model produces contextual representations of the input. No generation capability without additional decoder machinery.

Decoder-only (GPT, LLaMA, PaLM): Token tt can only attend to tokens 1,,t1, \ldots, t (causal masking). The model is trained to predict the next token autoregressively. Input and output share the same sequence.

Encoder-decoder (T5, BART, mBART): An encoder processes the input with bidirectional attention. A decoder generates output autoregressively, attending both to previous output tokens (causal self-attention) and to the encoder output (cross-attention).

Attention Masking: The Core Difference

For a sequence of length nn, define the attention mask M{0,}n×nM \in \{0, -\infty\}^{n \times n} applied before softmax:

Attention(Q,K,V)=softmax(QKTdk+M)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V

Bidirectional (encoder): Mij=0M_{ij} = 0 for all i,ji, j. Every position attends to every other position. Token representations incorporate full context.

Causal (decoder): Mij=0M_{ij} = 0 if jij \leq i, else Mij=M_{ij} = -\infty. Position ii only sees positions 11 through ii. This enables autoregressive generation: each token's probability depends only on preceding tokens.

Cross-attention (encoder-decoder): The decoder has two attention layers per block. Self-attention uses causal masking over the output. Cross-attention uses the decoder's queries with the encoder's keys and values, with no masking (the decoder sees the full encoded input at every step).

Pretraining Objectives

ArchitecturePretraining ObjectiveWhat It Learns
Encoder-onlyMasked language modeling (MLM): predict randomly masked tokensBidirectional representations, good for understanding
Decoder-onlyNext-token prediction: predict P(xtx1,,xt1)P(x_t \mid x_1, \ldots, x_{t-1})Autoregressive generation, in-context learning
Encoder-decoderSpan corruption or denoising: reconstruct corrupted spansSequence-to-sequence mapping, conditional generation

MLM for BERT masks 15% of tokens and predicts them from surrounding context. This is not a valid generative model because the masked positions are predicted independently, not autoregressively.

Next-token prediction decomposes the joint probability as P(x1,,xn)=t=1nP(xtx<t)P(x_1, \ldots, x_n) = \prod_{t=1}^n P(x_t \mid x_{<t}). This is a valid generative model. Scaling this objective to billions of parameters produces in-context learning.

T5's span corruption randomly selects contiguous spans of text, replaces them with sentinel tokens, and trains the decoder to output the missing spans. This teaches the model to map from corrupted input to clean output.

Side-by-Side Comparison

PropertyEncoder-OnlyDecoder-OnlyEncoder-Decoder
AttentionBidirectionalCausal (left-to-right)Bidirectional (enc) + causal (dec) + cross
Representative modelsBERT, RoBERTa, DeBERTaGPT-2/3/4, LLaMA, PaLMT5, BART, mBART, Flan-T5
PretrainingMLM (+ NSP for BERT)Next-token predictionSpan denoising
Native generationNoYesYes
Input-output structureSame sequenceInput prefix + generated continuationSeparate input and output sequences
KV cache for inferenceNot applicableCaches all prior tokensEncoder output cached, decoder caches incrementally
Parameter count (same quality)Smallest for understanding tasksLargest, but most versatileMiddle, but requires 2x architecture
In-context learningWeakStrong (emergent at scale)Moderate
Best forClassification, NER, retrieval, extractionOpen-ended generation, chat, reasoningTranslation, summarization, structured generation

When Each Wins

Encoder-only: classification and extraction

When the task is to assign a label or extract a span from existing text, bidirectional attention is strictly more informative than causal attention. BERT-style models dominate in token classification (NER), sentence classification (sentiment), and semantic similarity. The representation at position ii incorporates information from both left and right context.

Decoder-only: generation and in-context learning

The GPT scaling trajectory showed that decoder-only models exhibit emergent capabilities at scale that encoder-only models do not: few-shot learning, chain-of-thought reasoning, instruction following. The causal architecture naturally supports open-ended generation. Every major LLM since GPT-3 (PaLM, LLaMA, Claude, Gemini) is decoder-only.

Encoder-decoder: structured input-output mapping

Translation, summarization, and any task with a clear input/output separation benefits from the encoder-decoder split. The encoder can build a full bidirectional representation of the input. The decoder can attend to this representation at every generation step via cross-attention. T5 cast all NLP tasks as text-to-text and showed competitive performance across the board.

Why Decoder-Only Dominates in 2024+

Three factors drove convergence toward decoder-only:

  1. Simplicity. One architecture, one objective (next-token prediction), one training pipeline. Encoder-decoder requires managing two sets of parameters and cross-attention adds complexity.

  2. Scaling behavior. Decoder-only models show smoother scaling curves. The scaling laws of Kaplan et al. and Hoffmann et al. were derived for decoder-only architectures.

  3. Unification. With sufficient scale, decoder-only models handle classification (generate the label), extraction (generate the span), translation (generate the target), and reasoning (generate the chain). Specializing the architecture to the task becomes unnecessary.

The cost is that decoder-only models are less parameter-efficient for pure understanding tasks. A BERT-base (110M parameters) fine-tuned for sentiment analysis outperforms a GPT-2-base (124M parameters) on the same task, because bidirectional attention extracts more information per parameter for classification.

Common Confusions

Watch Out

Encoder-only does not mean no generation is possible

You can build a generative model on top of BERT-style encoders. Masked language models can iteratively predict masked tokens (as in non-autoregressive generation). BERT itself is not autoregressive, but the encoder architecture is not inherently incompatible with generation. It is simply less natural and less effective than causal generation at scale.

Watch Out

Decoder-only models still encode the input

A decoder-only model processing a prompt is encoding that prompt into internal representations via causal attention. The distinction is not that GPT lacks an encoder. The distinction is that the encoding is causal (each token only sees prior tokens), whereas a dedicated encoder uses bidirectional attention. The prompt prefix in a decoder-only model serves as a pseudo-encoder.

Watch Out

Cross-attention is not the same as concatenating encoder and decoder inputs

Cross-attention uses the decoder's queries and the encoder's keys and values. This is distinct from simply concatenating the input and output sequences and running a single decoder. Cross-attention allows the decoder to attend to the full encoder representation at every layer, while concatenation forces the model to route information through causal attention over the concatenated sequence.

References

  1. Vaswani, A. et al. (2017). "Attention Is All You Need." NeurIPS 2017. (Original transformer with encoder-decoder architecture.)
  2. Devlin, J. et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL 2019. (Encoder-only, masked language modeling.)
  3. Radford, A. et al. (2019). "Language Models are Unsupervised Multitask Learners." (GPT-2, decoder-only autoregressive pretraining.)
  4. Raffel, C. et al. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." JMLR, 21(140), 1-67. (T5, encoder-decoder, text-to-text framework.)
  5. Brown, T. B. et al. (2020). "Language Models are Few-Shot Learners." NeurIPS 2020. (GPT-3, establishing decoder-only in-context learning.)
  6. Wang, A. et al. (2019). "SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems." NeurIPS 2019. (Benchmark comparing encoder-only models on understanding tasks.)
  7. Tay, Y. et al. (2023). "UL2: Unifying Language Learning Paradigms." ICLR 2023. (Systematic comparison of pretraining objectives across architectures.)