Vision Transformers
ViT, DeiT, and attention-based vision architectures.
Attention Mechanisms in Vision
Applying self-attention to images requires careful handling of 2D spatial structure, patch size tradeoffs, and the quadratic cost of attention over thousands of visual tokens – design choices that fundamentally shape every vision Transformer.
Data-Efficient Image Transformers (DeiT)
DeiT demonstrates that Vision Transformers can be trained competitively on ImageNet-1K alone – without hundreds of millions of private images – by using knowledge distillation from a CNN teacher and aggressive data augmentation.
DINO (Self-Distillation with No Labels)
DINO trains a Vision Transformer through self-distillation – a student network learns to match the output of a momentum-updated teacher network on different augmented views of the same image – producing features that exhibit emergent object segmentation without any labels.
Hybrid CNN-Transformer Architectures
Hybrid models use CNN layers for early-stage local feature extraction and Transformer layers for later-stage global reasoning, combining the inductive biases of convolutions with the flexibility of self-attention.
Masked Image Modeling
Masked Image Modeling (MIM) pre-trains vision Transformers by masking a large portion of image patches and training the model to reconstruct the missing content – either as discrete visual tokens (BEiT) or raw pixels (MAE).
Swin Transformer
The Swin Transformer computes self-attention within local windows and shifts those windows between layers to achieve hierarchical feature maps and linear computational complexity with respect to image size.
Vision Transformer (ViT)
The Vision Transformer splits an image into fixed-size patches, treats each patch as a token, and processes the sequence with a standard Transformer encoder to perform image classification.
Vision Transformer Scaling
Vision Transformers follow predictable scaling laws where performance improves log-linearly with compute and data, but they require substantially more training data than CNNs to reach their potential – a threshold that, once crossed, allows ViTs to decisively overtake convolutional models.