CLIP & native multimodal models
Before we begin
CLIP (Contrastive Language–Image Pre-training) taught models that images and captions should live in the same vector space. That idea powers search, zero-shot classification, and today’s native multimodal chat models.
Multimodal = one system reasons over more than one modality (text + image + audio/video).
What you will learn
- Explain contrastive image–text training.
- Use CLIP-style models for zero-shot tasks.
- Describe vision encoders inside GPT-4o / Gemini-style APIs.
- Preview video as frame sequences or native video tokens.
Before this lesson
CLIP training (intuition)
Pairs of (image, caption) from the web:
- Image encoder (ViT or ResNet) → vector v
- Text encoder (transformer) → vector t
- Loss: matching pairs should have high cosine similarity; non-matching pairs low.
No class labels — learning is contrastive across millions of pairs.
Result: you can classify "a photo of a dog" vs "a photo of a cat" by comparing text embedding to image embedding — zero-shot.
What CLIP enables
| Use | How |
|---|---|
| Semantic image search | Embed query text + gallery images; nearest neighbors |
| Moderation / routing | Compare image to policy phrases |
| RAG over images | Store image embeddings + captions in vector DB |
Native multimodality (GPT-4o-style)
Newer models fuse vision inside one stack:
- Image → vision encoder → patch tokens
- Patch tokens interleave with text tokens in the transformer
- Model generates text (and sometimes images/audio) in one session
vs CLIP-only pipeline: native models can reason about fine details ("count the red buttons") without a separate captioning step.
API pattern: send image_url or base64 in chat message content array alongside text.
Video
Common approaches:
| Approach | Trade-off |
|---|---|
| Sample frames | Cheap; may miss motion |
| Video encoder | Native in some APIs; higher cost |
| CLIP per frame + aggregate | Baseline for search |
For robotics and CV depth, see the Computer Vision Foundations track on this site.
Training data notes
Multimodal models inherit bias and copyright questions from web-scale image–text pairs. Production apps add filters, watermark detection, and usage policies.