BD Brain Drip
🌐
Module 11 9 concepts

Multimodal & Foundation Models

CLIP, vision-language models, and foundation models for vision.

01

CLIP (Contrastive Language-Image Pretraining)

CLIP learns a shared embedding space for images and text by training on 400 million image-text pairs with a contrastive objective, enabling zero-shot visual recognition without task-specific fine-tuning.

02

DINOv2

DINOv2 is a family of self-supervised Vision Transformers trained by Meta with distillation at scale on 142 million curated images, producing visual features that match or surpass supervised pretraining across diverse downstream tasks without fine-tuning.

03

Grounding DINO

Grounding DINO combines the DINO detection Transformer with grounded language pretraining to perform open-set object detection, localizing objects in images from arbitrary text descriptions without being limited to predefined categories.

04

Image Captioning

Image captioning generates natural language descriptions of images using encoder-decoder architectures that attend to visual regions, evolving from CNN-LSTM models to modern multimodal LLMs like LLaVA and GPT-4V.

05

Open-Vocabulary Detection

Open-vocabulary detection extends object detection beyond fixed label sets by conditioning on arbitrary text queries, enabling detection of any object category described in natural language.

06

Text-to-Image Generation

Text-to-image generation synthesizes photorealistic or artistic images from natural language prompts using diffusion models guided by vision-language embeddings, with DALL-E, Stable Diffusion, and Midjourney as leading systems.

07

Vision Foundation Models

Vision foundation models are large-scale, general-purpose visual backbones – trained on broad data with self-supervised or language-supervised objectives – that transfer to a wide range of downstream tasks without task-specific architecture changes.

08

Visual Question Answering (VQA)

Visual question answering requires models to answer free-form natural language questions about images, demanding joint reasoning over visual content and linguistic structure.

09

Zero-Shot Classification

Zero-shot classification recognizes visual categories never seen during training by using natural language descriptions as class prototypes in a shared vision-language embedding space.