What Each Does
All three process images to extract features for classification, detection, or segmentation. They differ in how they aggregate spatial information.
CNNs apply learned local filters (kernels) in a sliding-window fashion. Each layer has a fixed receptive field that grows with depth.
ViT splits an image into non-overlapping patches, projects them to token embeddings, and applies standard transformer self-attention across all patches.
Swin Transformer partitions the image into local windows, applies self-attention within each window, and shifts the window partition between layers to enable cross-window information flow.
Side-by-Side Computation
CNN Convolution Layer
For a kernel of size on a feature map of spatial size with input channels and output channels:
The cost is linear in spatial size .
ViT Self-Attention
For patches (patch size ) with embedding dimension :
The term for attention maps makes cost quadratic in the number of patches.
Swin Window Attention
For windows of size patches on a feature map with total patches:
The cost is linear in because attention is computed within fixed-size windows.
Where Each Is Stronger
CNNs win on data efficiency
Convolutional inductive biases (locality, translation equivariance, weight sharing) encode strong priors about images. A CNN can learn useful representations from thousands of images. ViT, lacking these priors, needs millions of images (or ImageNet-21K scale pretraining) to match CNN performance. This makes CNNs the better choice when labeled data is scarce.
ViT wins on large-scale performance
With sufficient data and compute, ViT surpasses CNNs. The global self-attention mechanism can capture long-range dependencies from the first layer, which CNNs can only achieve through many stacked layers. On ImageNet-21K and JFT-300M pretraining, ViT-Large outperforms the best CNNs.
Swin wins on dense prediction tasks
Object detection and segmentation require multi-scale features. Swin's hierarchical structure naturally produces feature maps at multiple resolutions (like a CNN's feature pyramid) while retaining transformer-level representational power. Swin achieved state-of-the-art results on COCO detection and ADE20K segmentation, outperforming both pure CNNs and the standard (unmodified) ViT, referred to as vanilla ViT in the literature, on these tasks.
Where Each Fails
CNNs fail on global context
A CNN with kernel size 3 needs many layers to build a large receptive field. In practice, the effective receptive field is much smaller than the theoretical one. This limits the ability to reason about distant image regions, which matters for tasks like understanding scene layout or counting objects spread across the image.
ViT fails on small datasets and high-resolution inputs
Without large-scale pretraining, ViT underperforms CNNs. The quadratic cost in patch count also makes ViT expensive at high resolution. A 1024x1024 image with 16x16 patches gives 4096 tokens, and the attention matrix is , requiring roughly 16M entries per head per layer.
Swin fails on simplicity
Swin adds complexity: window partitioning, shifted windows, relative position biases, and a multi-stage hierarchical design. This makes implementation, debugging, and modification harder than a plain CNN or vanilla ViT. The engineering overhead is non-trivial for custom applications.
Key Properties Compared
| CNN | ViT | Swin | |
|---|---|---|---|
| Receptive field | Local, grows with depth | Global from layer 1 | Local per window, global via shifting |
| Translation | Equivariant by construction | Learned from data | Locally equivariant |
| Complexity in spatial size | Linear | Quadratic | Linear |
| Multi-scale features | Natural (pooling/stride) | Single scale | Natural (hierarchical stages) |
| Data requirement | Low (thousands) | High (millions) | Medium |
The Inductive Bias Tradeoff
Inductive Bias vs. Data Requirement
Statement
Stronger inductive biases (locality, translation equivariance) reduce the data needed to reach a given performance level but impose a ceiling on asymptotic performance. Weaker inductive biases (raw attention) require more data but achieve higher asymptotic performance given sufficient data.
Concretely, on ImageNet-1K (1.28M images):
- ResNet-50 (CNN): ~76-78% top-1 without tricks
- ViT-B/16 (from scratch): ~74-76% top-1, worse than CNN
- ViT-B/16 (pretrained on ImageNet-21K): ~84% top-1, surpasses CNN
- Swin-B: ~83-84% top-1 with ImageNet-1K only
Intuition
Inductive biases are correct priors. When you have little data, correct priors help enormously. When you have massive data, the data itself teaches the model the right priors, and hard-coded ones can become constraints that prevent learning better representations.
Failure Mode
This observation is specific to standard supervised training. With modern self-supervised pretraining (MAE, DINOv2), ViT can be pretrained on unlabeled data and then fine-tuned with limited labels, partially breaking the tradeoff. The data-efficiency gap has narrowed with better pretraining methods.
When a Researcher Would Use Each
Mobile deployment with limited training data
Use a CNN (MobileNet, EfficientNet). CNNs are well-optimized for mobile hardware, and their inductive biases allow strong performance with limited data. ViT and Swin are too large and data-hungry for this setting.
Large-scale pretraining for a vision foundation model
Use ViT. With billions of images (or self-supervised pretraining on large unlabeled datasets), ViT's lack of restrictive inductive biases allows it to learn the most flexible representations. Most vision foundation models (DINOv2, EVA, SigLIP) use ViT backbones.
Object detection or instance segmentation
Use Swin. Dense prediction tasks need multi-scale features, which Swin provides natively. The linear complexity in spatial size allows processing high-resolution images. Swin-based detectors consistently outperform ViT-based ones on COCO without requiring special adaptations.
Common Confusions
ViT does not lack all inductive bias
ViT still has inductive biases: patch-based tokenization assumes local pixel grouping matters, positional embeddings encode spatial structure, and the architecture assumes a fixed sequence of patches. It has weaker spatial bias than CNNs, but it is not bias-free.
Swin is not just ViT with smaller attention
Swin's shifted window mechanism creates cross-window connections that vanilla local attention does not have. The shifting pattern ensures that every patch can attend to patches in adjacent windows after two consecutive layers. This is a deliberate design for hierarchical feature extraction, not just a computational shortcut.
References
Canonical:
- LeCun et al., Gradient-Based Learning Applied to Document Recognition (1998)
- Dosovitskiy et al., An Image is Worth 16x16 Words (ICLR 2021)
- Liu et al., Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (ICCV 2021)
Current:
- Oquab et al., DINOv2: Learning Robust Visual Features without Supervision (2023)
- Liu et al., Swin Transformer V2: Scaling Up Capacity and Resolution (CVPR 2022)