CNN vs ViT vs Swin Transformer

What Each Does

All three process images to extract features for classification, detection, or segmentation. They differ in how they aggregate spatial information.

CNNs apply learned local filters (kernels) in a sliding-window fashion. Each layer has a fixed receptive field that grows with depth.

ViT splits an image into non-overlapping patches, projects them to token embeddings, and applies standard transformer self-attention across all patches.

Swin Transformer partitions the image into local windows, applies self-attention within each window, and shifts the window partition between layers to enable cross-window information flow.

Side-by-Side Computation

Definition

CNN Convolution Layer

For a kernel of size $k \times k$ on a feature map of spatial size $H \times W$ with $C_{\text{in}}$ input channels and $C_{\text{out}}$ output channels:

$\text{FLOPs} = O(k^2 \cdot C_{\text{in}} \cdot C_{\text{out}} \cdot H \cdot W)$

The cost is linear in spatial size $H \times W$ .

Definition

ViT Self-Attention

For $n = HW/P^2$ patches (patch size $P$ ) with embedding dimension $d$ :

$\text{FLOPs} = O(n^2 \cdot d + n \cdot d^2)$

The $n^2 d$ term for attention maps makes cost quadratic in the number of patches.

Definition

Swin Window Attention

For windows of size $M \times M$ patches on a feature map with $n$ total patches:

$\text{FLOPs} = O(n \cdot M^2 \cdot d + n \cdot d^2)$

The cost is linear in $n$ because attention is computed within fixed-size windows.

Where Each Is Stronger

CNNs win on data efficiency

Convolutional inductive biases (locality, translation equivariance, weight sharing) encode strong priors about images. A CNN can learn useful representations from thousands of images. ViT, lacking these priors, needs millions of images (or ImageNet-21K scale pretraining) to match CNN performance. This makes CNNs the better choice when labeled data is scarce.

ViT wins on large-scale performance

With sufficient data and compute, ViT surpasses CNNs. The global self-attention mechanism can capture long-range dependencies from the first layer, which CNNs can only achieve through many stacked layers. On ImageNet-21K and JFT-300M pretraining, ViT-Large outperforms the best CNNs.

Swin wins on dense prediction tasks

Object detection and segmentation require multi-scale features. Swin's hierarchical structure naturally produces feature maps at multiple resolutions (like a CNN's feature pyramid) while retaining transformer-level representational power. Swin achieved state-of-the-art results on COCO detection and ADE20K segmentation, outperforming both pure CNNs and the standard (unmodified) ViT, referred to as vanilla ViT in the literature, on these tasks.

Where Each Fails

CNNs fail on global context

A CNN with kernel size 3 needs many layers to build a large receptive field. In practice, the effective receptive field is much smaller than the theoretical one. This limits the ability to reason about distant image regions, which matters for tasks like understanding scene layout or counting objects spread across the image.

ViT fails on small datasets and high-resolution inputs

Without large-scale pretraining, ViT underperforms CNNs. The quadratic cost in patch count also makes ViT expensive at high resolution. A 1024x1024 image with 16x16 patches gives 4096 tokens, and the attention matrix is $4096 \times 4096$ , requiring roughly 16M entries per head per layer.

Swin fails on simplicity

Swin adds complexity: window partitioning, shifted windows, relative position biases, and a multi-stage hierarchical design. This makes implementation, debugging, and modification harder than a plain CNN or vanilla ViT. The engineering overhead is non-trivial for custom applications.

Key Properties Compared

	CNN	ViT	Swin
Receptive field	Local, grows with depth	Global from layer 1	Local per window, global via shifting
Translation	Equivariant by construction	Learned from data	Locally equivariant
Complexity in spatial size	Linear	Quadratic	Linear
Multi-scale features	Natural (pooling/stride)	Single scale	Natural (hierarchical stages)
Data requirement	Low (thousands)	High (millions)	Medium

The Inductive Bias Tradeoff

Proposition

Inductive Bias vs. Data Requirement

Statement

Stronger inductive biases (locality, translation equivariance) reduce the data needed to reach a given performance level but impose a ceiling on asymptotic performance. Weaker inductive biases (raw attention) require more data but achieve higher asymptotic performance given sufficient data.

Concretely, on ImageNet-1K (1.28M images):

ResNet-50 (CNN): ~76-78% top-1 without tricks
ViT-B/16 (from scratch): ~74-76% top-1, worse than CNN
ViT-B/16 (pretrained on ImageNet-21K): ~84% top-1, surpasses CNN
Swin-B: ~83-84% top-1 with ImageNet-1K only

Intuition

Inductive biases are correct priors. When you have little data, correct priors help enormously. When you have massive data, the data itself teaches the model the right priors, and hard-coded ones can become constraints that prevent learning better representations.

Failure Mode

This observation is specific to standard supervised training. With modern self-supervised pretraining (MAE, DINOv2), ViT can be pretrained on unlabeled data and then fine-tuned with limited labels, partially breaking the tradeoff. The data-efficiency gap has narrowed with better pretraining methods.

report a correction →

When a Researcher Would Use Each

Example

Mobile deployment with limited training data

Use a CNN (MobileNet, EfficientNet). CNNs are well-optimized for mobile hardware, and their inductive biases allow strong performance with limited data. ViT and Swin are too large and data-hungry for this setting.

Example

Large-scale pretraining for a vision foundation model

Use ViT. With billions of images (or self-supervised pretraining on large unlabeled datasets), ViT's lack of restrictive inductive biases allows it to learn the most flexible representations. Most vision foundation models (DINOv2, EVA, SigLIP) use ViT backbones.

Example

Object detection or instance segmentation

Use Swin. Dense prediction tasks need multi-scale features, which Swin provides natively. The linear complexity in spatial size allows processing high-resolution images. Swin-based detectors consistently outperform ViT-based ones on COCO without requiring special adaptations.

Common Confusions

Watch Out

ViT does not lack all inductive bias

ViT still has inductive biases: patch-based tokenization assumes local pixel grouping matters, positional embeddings encode spatial structure, and the architecture assumes a fixed sequence of patches. It has weaker spatial bias than CNNs, but it is not bias-free.

Watch Out

Swin is not just ViT with smaller attention

Swin's shifted window mechanism creates cross-window connections that vanilla local attention does not have. The shifting pattern ensures that every patch can attend to patches in adjacent windows after two consecutive layers. This is a deliberate design for hierarchical feature extraction, not just a computational shortcut.

References

Canonical:

LeCun et al., Gradient-Based Learning Applied to Document Recognition (1998)
Dosovitskiy et al., An Image is Worth 16x16 Words (ICLR 2021)
Liu et al., Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV 2021)

Current:

Oquab et al., DINOv2: Learning Robust Visual Features without Supervision (2023)
Liu et al., Swin Transformer V2: Scaling Up Capacity and Resolution (CVPR 2022)