Diffusion models & image generation
Before we begin
Diffusion models generate images by learning to remove noise step by step — the backbone of Stable Diffusion, DALL·E 3, and Midjourney-class systems.
Start from random noise → gradually denoise → coherent image matching your prompt.
What you will learn
- Explain the forward / reverse diffusion intuition.
- Map the Stable Diffusion pipeline (VAE, U-Net, text encoder).
- Know ControlNet and conditioning basics.
- Set expectations for API vs self-host image generation.
Before this lesson
- Lesson 1 — CLIP & multimodal
- Module 5 — U-Net (encoder–decoder shapes)
Forward process (training)
Gradually add Gaussian noise to an image over T steps until pure noise.
The model learns to predict the noise (or the clean image) at each step — supervised on image datasets with captions.
Reverse process (generation)
- Sample random noise latent.
- Condition on text embedding from a text encoder (often CLIP or T5).
- U-Net predicts denoising update — repeat for 20–50 steps.
- VAE decoder maps latent → RGB image.
Classifier-free guidance: scale text conditioning so outputs follow the prompt more strongly (at cost of diversity).
Stable Diffusion components
| Part | Role |
|---|---|
| VAE | Compress 512×512 → smaller latent (faster denoising) |
| U-Net | Denoiser in latent space — same family as segmentation U-Nets |
| Text encoder | Prompt → embedding vector |
Module 5 taught U-Net for segmentation; here U-Net predicts noise, not class masks.
ControlNet & conditioning
ControlNet feeds extra structure — edges, depth map, pose — so generation follows layout.
Other conditioning: inpainting masks, image-to-image strength, IP-Adapter for style reference.
Production considerations
| Topic | Note |
|---|---|
| Latency | Many denoise steps — use distilled models or fewer steps |
| Safety | NSFW filters, celebrity policies |
| Copyright | Train data disputes; enterprise APIs offer indemnity tiers |
| Cost | GPU seconds per image — often cheaper via API than self-host unless high volume |
Connect to capstone
Module 10 capstone can combine RAG + agents (required) with optional image generation for marketing or diagram drafts — only if evals cover quality and safety.
Module 9 complete
Continue to Module 10 — Production & scaling for the course capstone.