Welcome to Module 9 — multimodal & image models

Before we begin

Text-only LLMs are one slice of modern AI. Multimodal models read images and video; diffusion models generate pixels from text.

This module maps Week 9 of the cohort syllabus — CLIP, native multimodality, and diffusion.

Topic	What you will understand
CLIP	Joint image–text training and zero-shot search
Native multimodality	Vision encoders inside GPT-4o-style models
Diffusion	How Stable Diffusion and similar models generate images

Helpful background:

No new project in this module — skills feed your Module 10 capstone if you add image features.