CLIP & native multimodal models

Before we begin

CLIP (Contrastive Language–Image Pre-training) taught models that images and captions should live in the same vector space. That idea powers search, zero-shot classification, and today’s native multimodal chat models.

Multimodal = one system reasons over more than one modality (text + image + audio/video).

What you will learn

Explain contrastive image–text training.
Use CLIP-style models for zero-shot tasks.
Describe vision encoders inside GPT-4o / Gemini-style APIs.
Preview video as frame sequences or native video tokens.

Before this lesson

CLIP training (intuition)

Pairs of (image, caption) from the web:

Image encoder (ViT or ResNet) → vector v
Text encoder (transformer) → vector t
Loss: matching pairs should have high cosine similarity; non-matching pairs low.

No class labels — learning is contrastive across millions of pairs.

Result: you can classify "a photo of a dog" vs "a photo of a cat" by comparing text embedding to image embedding — zero-shot.

What CLIP enables

Use	How
Semantic image search	Embed query text + gallery images; nearest neighbors
Moderation / routing	Compare image to policy phrases
RAG over images	Store image embeddings + captions in vector DB

Native multimodality (GPT-4o-style)

Newer models fuse vision inside one stack:

Image → vision encoder → patch tokens
Patch tokens interleave with text tokens in the transformer
Model generates text (and sometimes images/audio) in one session

vs CLIP-only pipeline: native models can reason about fine details ("count the red buttons") without a separate captioning step.

API pattern: send image_url or base64 in chat message content array alongside text.

Video

Common approaches:

Approach	Trade-off
Sample frames	Cheap; may miss motion
Video encoder	Native in some APIs; higher cost
CLIP per frame + aggregate	Baseline for search

For robotics and CV depth, see the Computer Vision Foundations track on this site.

Training data notes

Multimodal models inherit bias and copyright questions from web-scale image–text pairs. Production apps add filters, watermark detection, and usage policies.

What's next

Lesson 2 — Diffusion & image generation