Skip to main content

Paper breakdown

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy et al. · 2020 · ICLR 2021

Splits an image into a sequence of fixed-size patches, embeds each patch linearly, and feeds them into a standard transformer encoder. Matches or exceeds CNN accuracy on ImageNet given enough pretraining data, ending the assumption that vision needs convolution.

Overview

Dosovitskiy and collaborators (2020) took a standard transformer encoder — the same one Vaswani et al. (2017) wrote down for translation — and applied it directly to images. The image is divided into fixed-size patches (16×16 pixels for ViT-Base on 224×224 input), each patch is flattened and linearly projected to the transformer's working dimension, a learned positional embedding is added, and the resulting sequence is fed through twelve transformer encoder blocks. A classification token ([class] à la BERT) is prepended; its final-layer representation is the image classifier's input.

The paper's headline is that this works as long as you have enough data. ViT trained on ImageNet-1k from scratch underperforms a comparable ResNet. ViT trained on JFT-300M (Google's internal 300-million-image dataset) and fine-tuned on ImageNet-1k outperforms state-of-the-art CNNs at the same parameter count and FLOPs. The implicit inductive bias of convolution — local connectivity, translation equivariance — pays off below a data threshold; above it, the transformer's flexibility wins.

This is the architectural unification paper. It made plausible the picture that one architecture (the transformer) could subsume what previously needed task-specific designs. By 2024 transformers are the default in vision, language, audio, code, and robotics; the paper's claim is the empirical turning point that motivated that consolidation.

Mathematical Contributions

Patchify

Given an image IRH×W×CI \in \mathbb{R}^{H \times W \times C} with C=3C = 3 for RGB, divide into non-overlapping patches of size P×PP \times P. The number of patches is N=HW/P2N = HW/P^2. For H=W=224,P=16H = W = 224, P = 16, N=196N = 196. Each patch is flattened to RP2C=R768\mathbb{R}^{P^2 C} = \mathbb{R}^{768} for P=16,C=3P = 16, C = 3, then linearly projected:

z0(i)=Ex(i)+epos(i)z_0^{(i)} = E\, x^{(i)} + e^{(i)}_{\text{pos}}

where ERD×P2CE \in \mathbb{R}^{D \times P^2 C} is the patch projection, x(i)x^{(i)} is the flattened ii-th patch, and epos(i)e^{(i)}_{\text{pos}} is a learned position embedding. The transformer's hidden dimension DD is typically 768 (ViT-Base), 1024 (ViT-Large), or 1280 (ViT-Huge).

Class token

A learnable vector z0(0)RDz^{(0)}_0 \in \mathbb{R}^D is prepended to the patch sequence, giving inputs z0=[z0(0);z0(1);;z0(N)]R(N+1)×Dz_0 = [z^{(0)}_0; z^{(1)}_0; \ldots; z^{(N)}_0] \in \mathbb{R}^{(N+1) \times D}. This is the BERT trick: after LL encoder blocks, zL(0)z^{(0)}_L is a single vector summarizing the image, and the classifier is a linear layer on top of it. The patch tokens are not pooled; the model is forced to write the classification information into the class-token position.

Encoder block

Each block applies multi-head self-attention then a position-wise MLP, both with residual connections and pre-norm:

z=MSA(LN(z1))+z1,z=MLP(LN(z))+zz'_\ell = \text{MSA}(\text{LN}(z_{\ell-1})) + z_{\ell-1}, \qquad z_\ell = \text{MLP}(\text{LN}(z'_\ell)) + z'_\ell

The MSA uses h=12h = 12 heads (ViT-Base) with per-head dimension D/hD/h. The MLP is two linear layers with GELU, hidden expansion 4× DD. This is functionally identical to the BERT-Base encoder block; the only thing that changes is the input.

Inductive bias inventory

The paper is careful about what bias is and is not present:

  • Translation equivariance: lost. A patch at position ii has its own learned epos(i)e^{(i)}_{\text{pos}}; the same patch content at position jij \ne i produces a different input. CNNs have this for free via weight sharing.
  • Locality: lost in the same way. Attention sees every patch from layer 1; there is no built-in preference for nearby patches.
  • Two-dimensional structure: lost. The position embedding is learned per index, with no architectural knowledge that index ii and index i+W/Pi+W/P are vertical neighbors. The paper experiments with explicit 2D-aware positional encodings and reports no improvement.

The argument is that with enough data, the model learns these invariances from examples. Pretraining on JFT-300M is what supplies the examples; ImageNet-1k alone does not.

Data-scale crossover

Section 4.2 reports the central empirical result: ViT-L/16 trained on ImageNet-1k from scratch reaches 76.5% top-1; on ImageNet-21k it reaches 85.3%; on JFT-300M and fine-tuned to ImageNet, 87.8%. A comparable ResNet-152x4 BiT model is at 87.5%. Below a few hundred million pretraining images, the CNN wins; above, the transformer does. The exact threshold has come down since 2020 (ViT trained from scratch on ImageNet-1k now matches ResNets with better recipes) but the directional claim survives.

Hybrid variant

Section 3.1 also tests a hybrid that replaces the linear patch embedding with the early layers of a ResNet, then feeds the resulting feature map into the transformer. At small scale the hybrid outperforms the pure ViT; at large scale the gap closes. This is the architectural pattern that became Swin (Liu et al., 2021) and ConvNeXt (Liu et al., 2022) — local convolutional bias added back where it helps.

Connections to TheoremPath Topics

Why It Matters Now

The vision-transformer line of work reset the architectural defaults across vision. ViT and its descendants (Swin, DeiT, BEiT, MAE, DINO, DINOv2) replaced ResNet as the standard image backbone for new model families. Multimodal models (CLIP, Flamingo, GPT-4V, Gemini) all use ViT-style image encoders.

The paper's "no inductive bias" argument did not survive cleanly. Pure ViT at small data is dominated by hierarchical/local-attention variants (Swin, MaxViT) and by recipes that re-introduce convolution (ConvNeXt). The 2024 vision picture is mixed: the dominant architectures are ViTs with some local-attention or convolutional structure, not pure-attention ViTs. The paper's stronger claim — that scale alone subsumes architectural design — is true at the largest scales (JFT-, LAION-class) and false at the academic-benchmark scales most papers run at.

The methodological lesson is more enduring. The paper is a clean demonstration that pretraining-scale matters more than architectural prior, in the regime where pretraining-scale is available. That observation, repeated across language, vision, and now multimodal models, is the through-line of the foundation-model era.

The other thing to note: the paper hardly does anything novel architecturally. Its contribution is empirical clarity — at this scale, this thing happens — and the willingness to train at JFT-300M scale to demonstrate it. That kind of work is harder than algorithmic novelty and is often what moves the field.

References

Canonical:

  • Dosovitskiy, A. et al. (2021). "An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale." ICLR. arXiv:2010.11929.

Direct precursors:

  • Vaswani, A. et al. (2017). "Attention Is All You Need." NeurIPS. arXiv:1706.03762. The transformer encoder ViT reuses.
  • Devlin, J. et al. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL. arXiv:1810.04805. The class-token + transformer-encoder + classifier-head template.
  • Carion, N. et al. (2020). "End-to-End Object Detection with Transformers." ECCV. arXiv:2005.12872. DETR — earlier vision-transformer for detection that ViT is the classification analog of.
  • Kolesnikov, A. et al. (2020). "Big Transfer (BiT): General Visual Representation Learning." ECCV. arXiv:1912.11370. The ResNet baseline ViT was compared against, also from the same group.

Direct descendants:

  • Liu, Z. et al. (2021). "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows." ICCV. arXiv:2103.14030. Local attention windows; restores some convolutional inductive bias.
  • Touvron, H. et al. (2021). "Training data-efficient image transformers & distillation through attention." ICML. arXiv:2012.12877. DeiT — ViT trained from scratch on ImageNet-1k via distillation, no JFT.
  • He, K. et al. (2022). "Masked Autoencoders Are Scalable Vision Learners." CVPR. arXiv:2111.06377. MAE — self-supervised pretraining for ViT.
  • Liu, Z. et al. (2022). "A ConvNet for the 2020s." CVPR. arXiv:2201.03545. ConvNeXt — CNNs modernized with ViT-style training recipes; matches ViT.

Self-supervised pretraining:

  • Caron, M. et al. (2021). "Emerging Properties in Self-Supervised Vision Transformers." ICCV. arXiv:2104.14294. DINO.
  • Oquab, M. et al. (2024). "DINOv2: Learning Robust Visual Features without Supervision." TMLR. arXiv:2304.07193. DINOv2.

Multimodal applications:

  • Radford, A. et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision." ICML. arXiv:2103.00020. CLIP — pairs a ViT image encoder with a text encoder.

Standard textbook:

  • Prince, S. J. D. (2023). Understanding Deep Learning. MIT Press. Chapter 12.5 — vision transformers.

Connected topics

Last reviewed: May 5, 2026