Beta. Content is under active construction and has not been peer-reviewed. Report errors on GitHub.Disclaimer

Comparison

CNN vs. ViT vs. Swin Transformer

CNNs bake in local inductive bias and translation equivariance. ViT applies global self-attention to image patches but needs large datasets. Swin Transformer uses hierarchical shifted windows to get the best of both: local efficiency with global reasoning.

What Each Does

All three process images to extract features for classification, detection, or segmentation. They differ in how they aggregate spatial information.

CNNs apply learned local filters (kernels) in a sliding-window fashion. Each layer has a fixed receptive field that grows with depth.

ViT splits an image into non-overlapping patches, projects them to token embeddings, and applies standard transformer self-attention across all patches.

Swin Transformer partitions the image into local windows, applies self-attention within each window, and shifts the window partition between layers to enable cross-window information flow.

Side-by-Side Computation

Definition

CNN Convolution Layer

For a kernel of size k×kk \times k on a feature map of spatial size H×WH \times W with CinC_{\text{in}} input channels and CoutC_{\text{out}} output channels:

FLOPs=O(k2CinCoutHW)\text{FLOPs} = O(k^2 \cdot C_{\text{in}} \cdot C_{\text{out}} \cdot H \cdot W)

The cost is linear in spatial size H×WH \times W.

Definition

ViT Self-Attention

For n=HW/P2n = HW/P^2 patches (patch size PP) with embedding dimension dd:

FLOPs=O(n2d+nd2)\text{FLOPs} = O(n^2 \cdot d + n \cdot d^2)

The n2dn^2 d term for attention maps makes cost quadratic in the number of patches.

Definition

Swin Window Attention

For windows of size M×MM \times M patches on a feature map with nn total patches:

FLOPs=O(nM2d+nd2)\text{FLOPs} = O(n \cdot M^2 \cdot d + n \cdot d^2)

The cost is linear in nn because attention is computed within fixed-size windows.

Where Each Is Stronger

CNNs win on data efficiency

Convolutional inductive biases (locality, translation equivariance, weight sharing) encode strong priors about images. A CNN can learn useful representations from thousands of images. ViT, lacking these priors, needs millions of images (or ImageNet-21K scale pretraining) to match CNN performance. This makes CNNs the better choice when labeled data is scarce.

ViT wins on large-scale performance

With sufficient data and compute, ViT surpasses CNNs. The global self-attention mechanism can capture long-range dependencies from the first layer, which CNNs can only achieve through many stacked layers. On ImageNet-21K and JFT-300M pretraining, ViT-Large outperforms the best CNNs.

Swin wins on dense prediction tasks

Object detection and segmentation require multi-scale features. Swin's hierarchical structure naturally produces feature maps at multiple resolutions (like a CNN's feature pyramid) while retaining transformer-level representational power. Swin achieved state-of-the-art results on COCO detection and ADE20K segmentation, outperforming both pure CNNs and the standard (unmodified) ViT, referred to as vanilla ViT in the literature, on these tasks.

Where Each Fails

CNNs fail on global context

A CNN with kernel size 3 needs many layers to build a large receptive field. In practice, the effective receptive field is much smaller than the theoretical one. This limits the ability to reason about distant image regions, which matters for tasks like understanding scene layout or counting objects spread across the image.

ViT fails on small datasets and high-resolution inputs

Without large-scale pretraining, ViT underperforms CNNs. The quadratic cost in patch count also makes ViT expensive at high resolution. A 1024x1024 image with 16x16 patches gives 4096 tokens, and the attention matrix is 4096×40964096 \times 4096, requiring roughly 16M entries per head per layer.

Swin fails on simplicity

Swin adds complexity: window partitioning, shifted windows, relative position biases, and a multi-stage hierarchical design. This makes implementation, debugging, and modification harder than a plain CNN or vanilla ViT. The engineering overhead is non-trivial for custom applications.

Key Properties Compared

CNNViTSwin
Receptive fieldLocal, grows with depthGlobal from layer 1Local per window, global via shifting
TranslationEquivariant by constructionLearned from dataLocally equivariant
Complexity in spatial sizeLinearQuadraticLinear
Multi-scale featuresNatural (pooling/stride)Single scaleNatural (hierarchical stages)
Data requirementLow (thousands)High (millions)Medium

The Inductive Bias Tradeoff

Theorem

Inductive Bias vs. Data Requirement

Statement

Stronger inductive biases (locality, translation equivariance) reduce the data needed to reach a given performance level but impose a ceiling on asymptotic performance. Weaker inductive biases (raw attention) require more data but achieve higher asymptotic performance given sufficient data.

Concretely, on ImageNet-1K (1.28M images):

  • ResNet-50 (CNN): ~76-78% top-1 without tricks
  • ViT-B/16 (from scratch): ~74-76% top-1, worse than CNN
  • ViT-B/16 (pretrained on ImageNet-21K): ~84% top-1, surpasses CNN
  • Swin-B: ~83-84% top-1 with ImageNet-1K only

Intuition

Inductive biases are correct priors. When you have little data, correct priors help enormously. When you have massive data, the data itself teaches the model the right priors, and hard-coded ones can become constraints that prevent learning better representations.

Failure Mode

This observation is specific to standard supervised training. With modern self-supervised pretraining (MAE, DINOv2), ViT can be pretrained on unlabeled data and then fine-tuned with limited labels, partially breaking the tradeoff. The data-efficiency gap has narrowed with better pretraining methods.

When a Researcher Would Use Each

Example

Mobile deployment with limited training data

Use a CNN (MobileNet, EfficientNet). CNNs are well-optimized for mobile hardware, and their inductive biases allow strong performance with limited data. ViT and Swin are too large and data-hungry for this setting.

Example

Large-scale pretraining for a vision foundation model

Use ViT. With billions of images (or self-supervised pretraining on large unlabeled datasets), ViT's lack of restrictive inductive biases allows it to learn the most flexible representations. Most vision foundation models (DINOv2, EVA, SigLIP) use ViT backbones.

Example

Object detection or instance segmentation

Use Swin. Dense prediction tasks need multi-scale features, which Swin provides natively. The linear complexity in spatial size allows processing high-resolution images. Swin-based detectors consistently outperform ViT-based ones on COCO without requiring special adaptations.

Common Confusions

Watch Out

ViT does not lack all inductive bias

ViT still has inductive biases: patch-based tokenization assumes local pixel grouping matters, positional embeddings encode spatial structure, and the architecture assumes a fixed sequence of patches. It has weaker spatial bias than CNNs, but it is not bias-free.

Watch Out

Swin is not just ViT with smaller attention

Swin's shifted window mechanism creates cross-window connections that vanilla local attention does not have. The shifting pattern ensures that every patch can attend to patches in adjacent windows after two consecutive layers. This is a deliberate design for hierarchical feature extraction, not just a computational shortcut.

References

Canonical:

Current: