Skip to main content

Vision Transformer Lineage: ViT, DeiT, Swin, MAE, DINOv2, SAM

4 selectedDifficulty 5-64 unseenView topic
IntermediateNew
0 answered
4 intermediateAdapts to your performance
Question 1 of 4
120sintermediate (5/10)compute
For a Vision Transformer with patches and embedding dimension , what is the per-layer self-attention cost, and how does it scale with image side length at fixed patch size ?