← Back to curriculum

Module 9 — Multimodal & image models

Welcome to Module 9

Why multimodal models matter, how they connect to your LLM and CV foundations, and what this module covers.

~25 min read + exercises

Welcome to Module 9 — multimodal & image models

Before we begin

Text-only LLMs are one slice of modern AI. Multimodal models read images and video; diffusion models generate pixels from text.

This module maps Week 9 of the cohort syllabus — CLIP, native multimodality, and diffusion.


What Module 9 covers

TopicWhat you will understand
CLIPJoint image–text training and zero-shot search
Native multimodalityVision encoders inside GPT-4o-style models
DiffusionHow Stable Diffusion and similar models generate images

Before you start

Helpful background:

No new project in this module — skills feed your Module 10 capstone if you add image features.


Ready?

Lesson 1 — CLIP & multimodal models