Multimodal & Foundation Models
CLIP, vision-language models, and foundation models for vision.
CLIP (Contrastive Language-Image Pretraining)
CLIP learns a shared embedding space for images and text by training on 400 million image-text pairs with a contrastive objective, enabling zero-shot visual recognition without task-specific fine-tuning.
DINOv2
DINOv2 is a family of self-supervised Vision Transformers trained by Meta with distillation at scale on 142 million curated images, producing visual features that match or surpass supervised pretraining across diverse downstream tasks without fine-tuning.
Grounding DINO
Grounding DINO combines the DINO detection Transformer with grounded language pretraining to perform open-set object detection, localizing objects in images from arbitrary text descriptions without being limited to predefined categories.
Image Captioning
Image captioning generates natural language descriptions of images using encoder-decoder architectures that attend to visual regions, evolving from CNN-LSTM models to modern multimodal LLMs like LLaVA and GPT-4V.
Open-Vocabulary Detection
Open-vocabulary detection extends object detection beyond fixed label sets by conditioning on arbitrary text queries, enabling detection of any object category described in natural language.
Text-to-Image Generation
Text-to-image generation synthesizes photorealistic or artistic images from natural language prompts using diffusion models guided by vision-language embeddings, with DALL-E, Stable Diffusion, and Midjourney as leading systems.
Vision Foundation Models
Vision foundation models are large-scale, general-purpose visual backbones – trained on broad data with self-supervised or language-supervised objectives – that transfer to a wide range of downstream tasks without task-specific architecture changes.
Visual Question Answering (VQA)
Visual question answering requires models to answer free-form natural language questions about images, demanding joint reasoning over visual content and linguistic structure.
Zero-Shot Classification
Zero-shot classification recognizes visual categories never seen during training by using natural language descriptions as class prototypes in a shared vision-language embedding space.