Multimodal & Foundation Models

CLIP, vision-language models, and foundation models for vision.

CLIP (Contrastive Language-Image Pretraining)

CLIP learns a shared embedding space for images and text by training on 400 million image-text pairs with a contrastive objective, enabling zero-shot visual recognition without task-specific fine-tuning.

DINOv2

DINOv2 is a family of self-supervised Vision Transformers trained by Meta with distillation at scale on 142 million curated images, producing visual features that match or surpass supervised pretraining across diverse downstream tasks without fine-tuning.

Grounding DINO

Grounding DINO combines the DINO detection Transformer with grounded language pretraining to perform open-set object detection, localizing objects in images from arbitrary text descriptions without being limited to predefined categories.

Image Captioning

Image captioning generates natural language descriptions of images using encoder-decoder architectures that attend to visual regions, evolving from CNN-LSTM models to modern multimodal LLMs like LLaVA and GPT-4V.

Open-Vocabulary Detection

Open-vocabulary detection extends object detection beyond fixed label sets by conditioning on arbitrary text queries, enabling detection of any object category described in natural language.

Text-to-Image Generation

Text-to-image generation synthesizes photorealistic or artistic images from natural language prompts using diffusion models guided by vision-language embeddings, with DALL-E, Stable Diffusion, and Midjourney as leading systems.

Vision Foundation Models

Vision foundation models are large-scale, general-purpose visual backbones – trained on broad data with self-supervised or language-supervised objectives – that transfer to a wide range of downstream tasks without task-specific architecture changes.

Visual Question Answering (VQA)

Visual question answering requires models to answer free-form natural language questions about images, demanding joint reasoning over visual content and linguistic structure.

Zero-Shot Classification

Zero-shot classification recognizes visual categories never seen during training by using natural language descriptions as class prototypes in a shared vision-language embedding space.