Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Beyond Llms

Vision Transformer Lineage

The evolution of visual representation learning: from CNNs (AlexNet, ResNet) to ViT (pure attention for images), Swin (hierarchical attention), and DINOv2 (self-supervised ViT with self-distillation), with connections to CLIP.

AdvancedTier 2Current~55 min
0

Why This Matters

Input image4x4 = 16 patchesflatten + linear embedToken sequenceCLSP1P2P3P4P5P6P7P8P9P10P11P12P13P14P15P16+ positional embeddingsStandard TransformerL layers of self-attention + FFNCLS → class logitsOnly change from language transformer: patch embedding replaces token embedding. Everything else is identical.

For a decade, convolutional neural networks dominated computer vision. ResNet (2015) was the default backbone for classification, detection, and segmentation. Then the Vision Transformer (ViT, 2020) showed that a pure transformer. the same architecture used for language. could match or exceed CNNs on image tasks, given enough data.

This triggered a rapid lineage of architectures: ViT for global attention, Swin for efficient hierarchical attention, DINO/DINOv2 for self-supervised visual features, and CLIP for connecting vision to language. Understanding this lineage is essential because these models serve as the vision backbone in nearly every modern multimodal system, from image generation (Stable Diffusion uses CLIP's vision encoder) to robotics (ViT features for manipulation).

Mental Model

A CNN processes an image by sliding small filters across the spatial grid, building up features from local to global through successive layers. A ViT takes a different approach: chop the image into patches, treat each patch as a token (like a word), and let self-attention figure out how patches relate to each other. Every patch can attend to every other patch from the first layer. There is no built-in locality bias.

The lineage is a series of design decisions: how much locality bias to include, how to handle multiple scales, and whether to use labels or self-supervision.

The CNN Foundation

Definition

CNN Design Principles

The core principles that made CNNs successful for vision:

  1. Local connectivity: each neuron connects only to a small spatial neighborhood
  2. Weight sharing: the same filter is applied at every position (translation equivariance)
  3. Hierarchy: early layers detect edges, middle layers detect textures and parts, late layers detect objects
  4. Downsampling: pooling or strided convolutions reduce spatial resolution while increasing channel depth

Key architectures: AlexNet (2012, 8 layers), VGGNet (2014, 19 layers), ResNet (2015, 50-152 layers with skip connections), EfficientNet (2019, compound scaling).

ResNet's skip connections solved the degradation problem (deeper networks performing worse than shallower ones) and enabled training of 100+ layer networks. For five years, ResNet variants were the default vision backbone.

ViT: Pure Transformer for Images

Proposition

ViT Patch Embedding and Complexity

Statement

Given an image of resolution H×WH \times W with CC channels, ViT creates N=HW/P2N = HW/P^2 non-overlapping patches of size P×P×CP \times P \times C. Each patch is linearly projected to a dd-dimensional embedding:

zi0=Eflatten(patchi)+eipos,i=1,,N\mathbf{z}_i^0 = \mathbf{E} \cdot \text{flatten}(\text{patch}_i) + \mathbf{e}_i^{\text{pos}}, \quad i = 1, \ldots, N

where ERd×(P2C)\mathbf{E} \in \mathbb{R}^{d \times (P^2 C)} is the patch embedding matrix and eipos\mathbf{e}_i^{\text{pos}} is a learnable positional embedding. A learnable [CLS] token is prepended, giving N+1N+1 tokens.

The self-attention complexity is O(N2d)=O(H2W2d/P4)O(N^2 d) = O(H^2 W^2 d / P^4), quadratic in the number of patches and therefore quartic in image resolution for a fixed patch size.

Intuition

ViT's insight is radical simplicity: use the standard transformer architecture without modification for images. The only vision-specific component is the patch embedding. everything else (multi-head attention, FFN, layer norm) is identical to a language transformer. The bet is that with enough data, the transformer's flexibility will outperform the CNN's inductive biases.

Why It Matters

ViT demonstrated that the transformer architecture is not specific to language. It is a general-purpose sequence processor. With sufficient pretraining data (ImageNet-21k or JFT-300M), ViT-Large matches or exceeds ResNet-152 on ImageNet classification. This opened the door to unified architectures for vision and language, which ultimately led to multimodal models like GPT-4V and Gemini.

ViT's key properties:

  • Global receptive field from layer 1: every patch can attend to every other patch, unlike CNNs where the receptive field grows gradually
  • No built-in translation equivariance: the positional embeddings must learn spatial structure from data
  • Data-hungry: ViT underperforms ResNet when trained on ImageNet-1k alone; it requires large-scale pretraining or strong data augmentation (DeiT)
  • Resolution flexibility: at inference time, the patch size is fixed but the number of patches scales with image resolution (positional embeddings can be interpolated)

Swin Transformer: Bringing Back Hierarchy

Proposition

Swin Shifted Window Attention

Statement

Swin Transformer computes self-attention within local windows of M×MM \times M patches, giving linear complexity in image size:

Complexity(W-MSA)=O(M2Nd)vs.Complexity(global MSA)=O(N2d)\text{Complexity}(\text{W-MSA}) = O(M^2 N d) \quad \text{vs.} \quad \text{Complexity}(\text{global MSA}) = O(N^2 d)

where NN is the total number of patches. To enable cross-window communication, alternating layers use shifted windows: the window partition is shifted by (M/2,M/2)(M/2, M/2) patches, so patches that were at window boundaries now share a window.

Swin builds a hierarchical feature pyramid by merging patches across stages:

  • Stage 1: H/4×W/4H/4 \times W/4 patches with CC channels
  • Stage 2: H/8×W/8H/8 \times W/8 patches with 2C2C channels
  • Stage 3: H/16×W/16H/16 \times W/16 patches with 4C4C channels
  • Stage 4: H/32×W/32H/32 \times W/32 patches with 8C8C channels

Intuition

ViT's quadratic attention is wasteful for dense prediction tasks (detection, segmentation) where you need high-resolution feature maps. Swin reintroduces CNN-like design principles. local attention, hierarchy, downsampling. while keeping the attention mechanism. Shifted windows are the key innovation: they provide cross-window connections without global attention, similar to how convolutions provide local connectivity with translational structure.

Why It Matters

Swin Transformer became the dominant vision backbone for dense prediction: object detection (with Cascade Mask R-CNN), semantic segmentation (with UPerNet), and instance segmentation. Its hierarchical multi-scale features are a direct replacement for CNN feature pyramids. Swin showed that the "ViT vs. CNN" debate was a false dichotomy. The best design combines attention with hierarchical structure.

DINO and DINOv2: Self-Supervised ViT

Proposition

DINO Self-Distillation Objective

Statement

DINO trains a Vision Transformer via self-distillation: a student network fθsf_{\theta_s} is trained to match the output of a teacher network fθtf_{\theta_t}, where the teacher is an exponential moving average of the student.

Given an image, generate two global crops x1,x2x_1, x_2 and several local crops x3,,xKx_3, \ldots, x_K. The loss matches student and teacher output distributions:

L=x{x1,x2}xall cropsxxH(Pt(x),Ps(x))\mathcal{L} = \sum_{x \in \{x_1, x_2\}} \sum_{\substack{x' \in \text{all crops} \\ x' \neq x}} H(P_t(x), P_s(x'))

where Pt(x)=softmax(fθt(x)/τt)P_t(x) = \text{softmax}(f_{\theta_t}(x) / \tau_t), Ps(x)=softmax(fθs(x)/τs)P_s(x') = \text{softmax}(f_{\theta_s}(x') / \tau_s), τt<τs\tau_t < \tau_s (sharper teacher), and HH is the cross-entropy.

The teacher is updated via EMA: θtτθt+(1τ)θs\theta_t \leftarrow \tau \theta_t + (1-\tau)\theta_s. Centering (subtracting the running mean of teacher outputs) prevents collapse to a uniform or peaked distribution.

Intuition

DINO discovers that self-supervised ViTs learn remarkable features without any labels. The self-distillation forces the student to produce consistent representations across different crops of the same image. The EMA teacher provides stable targets. The asymmetric crop strategy (student sees local crops, teacher sees global crops) forces the student to infer global structure from local information.

The resulting attention maps reveal that DINO ViTs learn to segment objects without any segmentation supervision. The [CLS] token's attention map highlights the main object in the image.

Why It Matters

DINOv2 (2023) scales this approach to produce general-purpose visual features that match or exceed supervised pretraining on a wide range of tasks: classification, segmentation, depth estimation, and retrieval. all without fine-tuning. DINOv2 features are used as frozen visual representations in multimodal systems, making them the "foundation features" of computer vision, analogous to how pretrained LLMs provide foundation features for language.

CLIP: Connecting Vision to Language

CLIP (Contrastive Language-Image Pretraining, OpenAI 2021) trains a ViT image encoder and a text transformer jointly on 400M image-text pairs using a contrastive objective:

LCLIP=12Ni=1N[logexp(sim(vi,ti)/τ)jexp(sim(vi,tj)/τ)+logexp(sim(vi,ti)/τ)jexp(sim(vj,ti)/τ)]\mathcal{L}_{\text{CLIP}} = -\frac{1}{2N}\sum_{i=1}^N \left[ \log \frac{\exp(\text{sim}(v_i, t_i)/\tau)}{\sum_j \exp(\text{sim}(v_i, t_j)/\tau)} + \log \frac{\exp(\text{sim}(v_i, t_i)/\tau)}{\sum_j \exp(\text{sim}(v_j, t_i)/\tau)} \right]

where viv_i is the image embedding and tit_i is the text embedding of the ii-th pair.

CLIP enables zero-shot classification: to classify an image, compute its similarity to text prompts like "a photo of a dog" for each class and pick the highest. The ViT encoder from CLIP is the vision component in Stable Diffusion, DALL-E, and many multimodal LLMs. Florence and related vision foundation models extend this paradigm to unified architectures covering detection, segmentation, and captioning in a single model.

The Lineage at a Glance

ModelYearKey InnovationSupervisionPrimary Use
ResNet2015Skip connections, 100+ layersSupervisedBackbone for all vision tasks
ViT2020Patches as tokens, pure attentionSupervised (large-scale)Classification
DeiT2021Distillation + augmentation for ViTSupervised (ImageNet only)Efficient ViT training
Swin2021Shifted windows, hierarchical featuresSupervisedDetection, segmentation
CLIP2021Contrastive image-text pretrainingSelf-supervised (text)Zero-shot, multimodal
DINO2021Self-distillation, EMA teacherSelf-supervisedFeature extraction
DINOv22023Scaled DINO + curated dataSelf-supervisedUniversal vision features

Why ViT Works

The standard explanation: attention can model global dependencies that convolutions miss. A convolutional layer with a 3×33 \times 3 kernel can only see a 3×33 \times 3 neighborhood; global information requires stacking many layers. A single attention layer can relate any patch to any other patch.

The deeper explanation: ViTs have a weaker inductive bias than CNNs (no hard-coded locality, no translation equivariance). With limited data, this is a disadvantage. CNNs generalize better from small datasets because their architecture encodes the right priors. With abundant data, the weaker bias becomes an advantage. The model can learn task-appropriate structure rather than being constrained by architectural assumptions.

This is the bias-variance tradeoff applied to architecture: CNNs have strong bias (good for small data), ViTs have weak bias (good for large data).

Common Confusions

Watch Out

ViT is not inherently better than CNNs

ViT's advantage appears primarily with large-scale pretraining. On ImageNet-1k alone (without extra data or heavy augmentation), a well-tuned ResNet-50 can match ViT-B. The key finding is not that attention beats convolution, but that attention scales better with data and compute. For resource-constrained settings, CNNs or efficient hybrids (EfficientNet, ConvNeXt) remain competitive.

Watch Out

Swin is not just ViT with smaller windows

Swin adds hierarchical downsampling (patch merging), creating multi-scale feature maps that ViT does not produce. This hierarchical structure is essential for dense prediction tasks where you need features at multiple resolutions. ViT produces single-scale features from the last layer, which must be adapted for detection and segmentation. Swin natively produces the feature pyramid that these tasks require.

Watch Out

DINOv2 features are not 'just' ImageNet features

DINOv2 features capture visual concepts that supervised ImageNet training misses. Supervised models overfit to the 1000 ImageNet classes. their features are optimized for distinguishing dog breeds, not for understanding spatial layout or material properties. DINOv2's self-supervised objective produces features that transfer broadly because they are not biased toward any specific label set. This is analogous to how pretrained LLMs produce better features than task-specific classifiers.

Summary

  • ViT: split image into patches, embed each as a token, apply standard transformer
  • Complexity: O(N2d)O(N^2 d) where N=HW/P2N = HW/P^2 patches. quadratic in resolution
  • Swin: local windowed attention + shifted windows + hierarchical downsampling = linear complexity
  • DINO/DINOv2: self-distillation with EMA teacher, no labels needed, produces universal features
  • CLIP: contrastive image-text pretraining, enables zero-shot classification and multimodal systems
  • The lineage: CNN (strong bias) -> ViT (weak bias, needs data) -> Swin (balanced) -> DINOv2 (self-supervised)

Exercises

ExerciseCore

Problem

For a 224x224 image with 3 color channels and a ViT with patch size P=16P = 16 and embedding dimension d=768d = 768, compute: (a) the number of patches NN, (b) the size of the patch embedding matrix, and (c) the self-attention complexity per layer.

ExerciseAdvanced

Problem

Swin Transformer with window size M=7M = 7 has attention complexity O(M2Nd)O(M^2 N d) instead of ViT's O(N2d)O(N^2 d). For a 224x224 image with P=4P = 4 (Swin's typical patch size at stage 1), compute the ratio of Swin's attention cost to ViT's attention cost (assuming ViT uses the same patch size).

ExerciseResearch

Problem

DINOv2 produces strong visual features without labels, while CLIP produces strong features with text supervision. Under what conditions would you prefer one over the other as a frozen vision backbone, and what are the failure modes of each?

Related Comparisons

References

Canonical:

  • Dosovitskiy et al., "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (ICLR 2021). ViT
  • Liu et al., "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" (ICCV 2021)

Current:

  • Oquab et al., "DINOv2: Learning Robust Visual Features without Supervision" (2023)
  • Radford et al., "Learning Transferable Visual Models From Natural Language Supervision" (ICML 2021). CLIP
  • Caron et al., "Emerging Properties in Self-Supervised Vision Transformers" (ICCV 2021). DINO

Next Topics

The natural next steps from the vision transformer lineage:

Last reviewed: April 2026

Prerequisites

Foundations this topic depends on.

Builds on This

Next Topics