What Each Architecture Does
All three are built from the same transformer building blocks (multi-head attention, feedforward layers, residual connections, layer normalization) but differ in how attention is masked and how input-output flows are structured.
Encoder-only (BERT, RoBERTa, DeBERTa): Every token attends to every other token in the input (bidirectional attention). The model produces contextual representations of the input. No generation capability without additional decoder machinery.
Decoder-only (GPT, LLaMA, PaLM): Token can only attend to tokens (causal masking). The model is trained to predict the next token autoregressively. Input and output share the same sequence.
Encoder-decoder (T5, BART, mBART): An encoder processes the input with bidirectional attention. A decoder generates output autoregressively, attending both to previous output tokens (causal self-attention) and to the encoder output (cross-attention).
Attention Masking: The Core Difference
For a sequence of length , define the attention mask applied before softmax:
Bidirectional (encoder): for all . Every position attends to every other position. Token representations incorporate full context.
Causal (decoder): if , else . Position only sees positions through . This enables autoregressive generation: each token's probability depends only on preceding tokens.
Cross-attention (encoder-decoder): The decoder has two attention layers per block. Self-attention uses causal masking over the output. Cross-attention uses the decoder's queries with the encoder's keys and values, with no masking (the decoder sees the full encoded input at every step).
Pretraining Objectives
| Architecture | Pretraining Objective | What It Learns |
|---|---|---|
| Encoder-only | Masked language modeling (MLM): predict randomly masked tokens | Bidirectional representations, good for understanding |
| Decoder-only | Next-token prediction: predict | Autoregressive generation, in-context learning |
| Encoder-decoder | Span corruption or denoising: reconstruct corrupted spans | Sequence-to-sequence mapping, conditional generation |
MLM for BERT masks 15% of tokens and predicts them from surrounding context. This is not a valid generative model because the masked positions are predicted independently, not autoregressively.
Next-token prediction decomposes the joint probability as . This is a valid generative model. Scaling this objective to billions of parameters produces in-context learning.
T5's span corruption randomly selects contiguous spans of text, replaces them with sentinel tokens, and trains the decoder to output the missing spans. This teaches the model to map from corrupted input to clean output.
Side-by-Side Comparison
| Property | Encoder-Only | Decoder-Only | Encoder-Decoder |
|---|---|---|---|
| Attention | Bidirectional | Causal (left-to-right) | Bidirectional (enc) + causal (dec) + cross |
| Representative models | BERT, RoBERTa, DeBERTa | GPT-2/3/4, LLaMA, PaLM | T5, BART, mBART, Flan-T5 |
| Pretraining | MLM (+ NSP for BERT) | Next-token prediction | Span denoising |
| Native generation | No | Yes | Yes |
| Input-output structure | Same sequence | Input prefix + generated continuation | Separate input and output sequences |
| KV cache for inference | Not applicable | Caches all prior tokens | Encoder output cached, decoder caches incrementally |
| Parameter count (same quality) | Smallest for understanding tasks | Largest, but most versatile | Middle, but requires 2x architecture |
| In-context learning | Weak | Strong (emergent at scale) | Moderate |
| Best for | Classification, NER, retrieval, extraction | Open-ended generation, chat, reasoning | Translation, summarization, structured generation |
When Each Wins
Encoder-only: classification and extraction
When the task is to assign a label or extract a span from existing text, bidirectional attention is strictly more informative than causal attention. BERT-style models dominate in token classification (NER), sentence classification (sentiment), and semantic similarity. The representation at position incorporates information from both left and right context.
Decoder-only: generation and in-context learning
The GPT scaling trajectory showed that decoder-only models exhibit emergent capabilities at scale that encoder-only models do not: few-shot learning, chain-of-thought reasoning, instruction following. The causal architecture naturally supports open-ended generation. Every major LLM since GPT-3 (PaLM, LLaMA, Claude, Gemini) is decoder-only.
Encoder-decoder: structured input-output mapping
Translation, summarization, and any task with a clear input/output separation benefits from the encoder-decoder split. The encoder can build a full bidirectional representation of the input. The decoder can attend to this representation at every generation step via cross-attention. T5 cast all NLP tasks as text-to-text and showed competitive performance across the board.
Why Decoder-Only Dominates in 2024+
Three factors drove convergence toward decoder-only:
-
Simplicity. One architecture, one objective (next-token prediction), one training pipeline. Encoder-decoder requires managing two sets of parameters and cross-attention adds complexity.
-
Scaling behavior. Decoder-only models show smoother scaling curves. The scaling laws of Kaplan et al. and Hoffmann et al. were derived for decoder-only architectures.
-
Unification. With sufficient scale, decoder-only models handle classification (generate the label), extraction (generate the span), translation (generate the target), and reasoning (generate the chain). Specializing the architecture to the task becomes unnecessary.
The cost is that decoder-only models are less parameter-efficient for pure understanding tasks. A BERT-base (110M parameters) fine-tuned for sentiment analysis outperforms a GPT-2-base (124M parameters) on the same task, because bidirectional attention extracts more information per parameter for classification.
Common Confusions
Encoder-only does not mean no generation is possible
You can build a generative model on top of BERT-style encoders. Masked language models can iteratively predict masked tokens (as in non-autoregressive generation). BERT itself is not autoregressive, but the encoder architecture is not inherently incompatible with generation. It is simply less natural and less effective than causal generation at scale.
Decoder-only models still encode the input
A decoder-only model processing a prompt is encoding that prompt into internal representations via causal attention. The distinction is not that GPT lacks an encoder. The distinction is that the encoding is causal (each token only sees prior tokens), whereas a dedicated encoder uses bidirectional attention. The prompt prefix in a decoder-only model serves as a pseudo-encoder.
Cross-attention is not the same as concatenating encoder and decoder inputs
Cross-attention uses the decoder's queries and the encoder's keys and values. This is distinct from simply concatenating the input and output sequences and running a single decoder. Cross-attention allows the decoder to attend to the full encoder representation at every layer, while concatenation forces the model to route information through causal attention over the concatenated sequence.
References
- Vaswani, A. et al. (2017). "Attention Is All You Need." NeurIPS 2017. (Original transformer with encoder-decoder architecture.)
- Devlin, J. et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL 2019. (Encoder-only, masked language modeling.)
- Radford, A. et al. (2019). "Language Models are Unsupervised Multitask Learners." (GPT-2, decoder-only autoregressive pretraining.)
- Raffel, C. et al. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." JMLR, 21(140), 1-67. (T5, encoder-decoder, text-to-text framework.)
- Brown, T. B. et al. (2020). "Language Models are Few-Shot Learners." NeurIPS 2020. (GPT-3, establishing decoder-only in-context learning.)
- Wang, A. et al. (2019). "SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems." NeurIPS 2019. (Benchmark comparing encoder-only models on understanding tasks.)
- Tay, Y. et al. (2023). "UL2: Unifying Language Learning Paradigms." ICLR 2023. (Systematic comparison of pretraining objectives across architectures.)