Welcome to Module 9 — multimodal & image models
Before we begin
Text-only LLMs are one slice of modern AI. Multimodal models read images and video; diffusion models generate pixels from text.
This module maps Week 9 of the cohort syllabus — CLIP, native multimodality, and diffusion.
What Module 9 covers
| Topic | What you will understand |
|---|---|
| CLIP | Joint image–text training and zero-shot search |
| Native multimodality | Vision encoders inside GPT-4o-style models |
| Diffusion | How Stable Diffusion and similar models generate images |
Before you start
Helpful background:
- Module 4 — CNNs
- Module 5 — Segmentation (dense prediction intuition)
- Module 7 — LLM basics
No new project in this module — skills feed your Module 10 capstone if you add image features.