← Back to curriculum

Module 9 — Multimodal & image models

CLIP & native multimodal models

Contrastive image–text training, zero-shot classification, vision encoders in GPT-4o-style models, and video understanding basics.

~80 min read + exercises

CLIP & native multimodal models

Before we begin

CLIP (Contrastive Language–Image Pre-training) taught models that images and captions should live in the same vector space. That idea powers search, zero-shot classification, and today’s native multimodal chat models.

Multimodal = one system reasons over more than one modality (text + image + audio/video).


What you will learn

  • Explain contrastive image–text training.
  • Use CLIP-style models for zero-shot tasks.
  • Describe vision encoders inside GPT-4o / Gemini-style APIs.
  • Preview video as frame sequences or native video tokens.

Before this lesson


CLIP training (intuition)

Pairs of (image, caption) from the web:

  1. Image encoder (ViT or ResNet) → vector v
  2. Text encoder (transformer) → vector t
  3. Loss: matching pairs should have high cosine similarity; non-matching pairs low.

No class labels — learning is contrastive across millions of pairs.

Result: you can classify "a photo of a dog" vs "a photo of a cat" by comparing text embedding to image embedding — zero-shot.


What CLIP enables

UseHow
Semantic image searchEmbed query text + gallery images; nearest neighbors
Moderation / routingCompare image to policy phrases
RAG over imagesStore image embeddings + captions in vector DB

Native multimodality (GPT-4o-style)

Newer models fuse vision inside one stack:

  • Image → vision encoder → patch tokens
  • Patch tokens interleave with text tokens in the transformer
  • Model generates text (and sometimes images/audio) in one session

vs CLIP-only pipeline: native models can reason about fine details ("count the red buttons") without a separate captioning step.

API pattern: send image_url or base64 in chat message content array alongside text.


Video

Common approaches:

ApproachTrade-off
Sample framesCheap; may miss motion
Video encoderNative in some APIs; higher cost
CLIP per frame + aggregateBaseline for search

For robotics and CV depth, see the Computer Vision Foundations track on this site.


Training data notes

Multimodal models inherit bias and copyright questions from web-scale image–text pairs. Production apps add filters, watermark detection, and usage policies.


What's next

Lesson 2 — Diffusion & image generation