Encoder-Only vs Decoder-Only vs Encoder-Decoder

What Each Architecture Does

All three are built from the same transformer building blocks (multi-head attention, feedforward layers, residual connections, layer normalization) but differ in how attention is masked and how input-output flows are structured.

Encoder-only (BERT, RoBERTa, DeBERTa): Every token attends to every other token in the input (bidirectional attention). The model produces contextual representations of the input. No generation capability without additional decoder machinery.

Decoder-only (GPT, LLaMA, PaLM): Token $t$ can only attend to tokens $1, \ldots, t$ (causal masking). The model is trained to predict the next token autoregressively. Input and output share the same sequence.

Encoder-decoder (T5, BART, mBART): An encoder processes the input with bidirectional attention. A decoder generates output autoregressively, attending both to previous output tokens (causal self-attention) and to the encoder output (cross-attention).

Attention Masking: The Core Difference

For a sequence of length $n$ , define the attention mask $M \in \{0, -\infty\}^{n \times n}$ applied before softmax:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V$

Bidirectional (encoder): $M_{ij} = 0$ for all $i, j$ . Every position attends to every other position. Token representations incorporate full context.

Causal (decoder): $M_{ij} = 0$ if $j \leq i$ , else $M_{ij} = -\infty$ . Position $i$ only sees positions $1$ through $i$ . This enables autoregressive generation: each token's probability depends only on preceding tokens.

Cross-attention (encoder-decoder): The decoder has two attention layers per block. Self-attention uses causal masking over the output. Cross-attention uses the decoder's queries with the encoder's keys and values, with no masking (the decoder sees the full encoded input at every step).

Pretraining Objectives

Architecture	Pretraining Objective	What It Learns
Encoder-only	Masked language modeling (MLM): predict randomly masked tokens	Bidirectional representations, good for understanding
Decoder-only	Next-token prediction: predict $P(x_t \mid x_1, \ldots, x_{t-1})$	Autoregressive generation, in-context learning
Encoder-decoder	Span corruption or denoising: reconstruct corrupted spans	Sequence-to-sequence mapping, conditional generation

MLM for BERT masks 15% of tokens and predicts them from surrounding context. This is not a valid generative model because the masked positions are predicted independently, not autoregressively.

Next-token prediction decomposes the joint probability as $P(x_1, \ldots, x_n) = \prod_{t=1}^n P(x_t \mid x_{<t})$ . This is a valid generative model. Scaling this objective to billions of parameters produces in-context learning.

T5's span corruption randomly selects contiguous spans of text, replaces them with sentinel tokens, and trains the decoder to output the missing spans. This teaches the model to map from corrupted input to clean output.

Side-by-Side Comparison

Property	Encoder-Only	Decoder-Only	Encoder-Decoder
Attention	Bidirectional	Causal (left-to-right)	Bidirectional (enc) + causal (dec) + cross
Representative models	BERT, RoBERTa, DeBERTa	GPT-2/3/4, LLaMA, PaLM	T5, BART, mBART, Flan-T5
Pretraining	MLM (+ NSP for BERT)	Next-token prediction	Span denoising
Native generation	No	Yes	Yes
Input-output structure	Same sequence	Input prefix + generated continuation	Separate input and output sequences
KV cache for inference	Not applicable	Caches all prior tokens	Encoder output cached, decoder caches incrementally
Parameter count (same quality)	Smallest for understanding tasks	Largest, but most versatile	Middle, but requires 2x architecture
In-context learning	Weak	Strong (emergent at scale)	Moderate
Best for	Classification, NER, retrieval, extraction	Open-ended generation, chat, reasoning	Translation, summarization, structured generation

When Each Wins

Encoder-only: classification and extraction

When the task is to assign a label or extract a span from existing text, bidirectional attention is strictly more informative than causal attention. BERT-style models dominate in token classification (NER), sentence classification (sentiment), and semantic similarity. The representation at position $i$ incorporates information from both left and right context.

Decoder-only: generation and in-context learning

The GPT scaling trajectory showed that decoder-only models exhibit emergent capabilities at scale that encoder-only models do not: few-shot learning, chain-of-thought reasoning, instruction following. The causal architecture naturally supports open-ended generation. Every major LLM since GPT-3 (PaLM, LLaMA, Claude, Gemini) is decoder-only.

Encoder-decoder: structured input-output mapping

Translation, summarization, and any task with a clear input/output separation benefits from the encoder-decoder split. The encoder can build a full bidirectional representation of the input. The decoder can attend to this representation at every generation step via cross-attention. T5 cast all NLP tasks as text-to-text and showed competitive performance across the board.

Why Decoder-Only Dominates in 2024+

Three factors drove convergence toward decoder-only:

Simplicity. One architecture, one objective (next-token prediction), one training pipeline. Encoder-decoder requires managing two sets of parameters and cross-attention adds complexity.
Scaling behavior. Decoder-only models show smoother scaling curves. The scaling laws of Kaplan et al. and Hoffmann et al. were derived for decoder-only architectures.
Unification. With sufficient scale, decoder-only models handle classification (generate the label), extraction (generate the span), translation (generate the target), and reasoning (generate the chain). Specializing the architecture to the task becomes unnecessary.

The cost is that decoder-only models are less parameter-efficient for pure understanding tasks. A BERT-base (110M parameters) fine-tuned for sentiment analysis outperforms a GPT-2-base (124M parameters) on the same task, because bidirectional attention extracts more information per parameter for classification.

Common Confusions

Watch Out

Encoder-only does not mean no generation is possible

You can build a generative model on top of BERT-style encoders. Masked language models can iteratively predict masked tokens (as in non-autoregressive generation). BERT itself is not autoregressive, but the encoder architecture is not inherently incompatible with generation. It is simply less natural and less effective than causal generation at scale.

Watch Out

Decoder-only models still encode the input

A decoder-only model processing a prompt is encoding that prompt into internal representations via causal attention. The distinction is not that GPT lacks an encoder. The distinction is that the encoding is causal (each token only sees prior tokens), whereas a dedicated encoder uses bidirectional attention. The prompt prefix in a decoder-only model serves as a pseudo-encoder.

Watch Out

Cross-attention is not the same as concatenating encoder and decoder inputs

Cross-attention uses the decoder's queries and the encoder's keys and values. This is distinct from simply concatenating the input and output sequences and running a single decoder. Cross-attention allows the decoder to attend to the full encoder representation at every layer, while concatenation forces the model to route information through causal attention over the concatenated sequence.

References

Vaswani, A. et al. (2017). "Attention Is All You Need." NeurIPS 2017. (Original transformer with encoder-decoder architecture.)
Devlin, J. et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL 2019. (Encoder-only, masked language modeling.)
Radford, A. et al. (2019). "Language Models are Unsupervised Multitask Learners." (GPT-2, decoder-only autoregressive pretraining.)
Raffel, C. et al. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." JMLR, 21(140), 1-67. (T5, encoder-decoder, text-to-text framework.)
Brown, T. B. et al. (2020). "Language Models are Few-Shot Learners." NeurIPS 2020. (GPT-3, establishing decoder-only in-context learning.)
Wang, A. et al. (2019). "SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems." NeurIPS 2019. (Benchmark comparing encoder-only models on understanding tasks.)
Tay, Y. et al. (2023). "UL2: Unifying Language Learning Paradigms." ICLR 2023. (Systematic comparison of pretraining objectives across architectures.)