Diffusion models & image generation

Before we begin

Diffusion models generate images by learning to remove noise step by step — the backbone of Stable Diffusion, DALL·E 3, and Midjourney-class systems.

Start from random noise → gradually denoise → coherent image matching your prompt.

What you will learn

Explain the forward / reverse diffusion intuition.
Map the Stable Diffusion pipeline (VAE, U-Net, text encoder).
Know ControlNet and conditioning basics.
Set expectations for API vs self-host image generation.

Before this lesson

Lesson 1 — CLIP & multimodal
Module 5 — U-Net (encoder–decoder shapes)

Forward process (training)

Gradually add Gaussian noise to an image over T steps until pure noise.

The model learns to predict the noise (or the clean image) at each step — supervised on image datasets with captions.

Reverse process (generation)

Sample random noise latent.
Condition on text embedding from a text encoder (often CLIP or T5).
U-Net predicts denoising update — repeat for 20–50 steps.
VAE decoder maps latent → RGB image.

Classifier-free guidance: scale text conditioning so outputs follow the prompt more strongly (at cost of diversity).

Stable Diffusion components

Part	Role
VAE	Compress 512×512 → smaller latent (faster denoising)
U-Net	Denoiser in latent space — same family as segmentation U-Nets
Text encoder	Prompt → embedding vector

Module 5 taught U-Net for segmentation; here U-Net predicts noise, not class masks.

ControlNet & conditioning

ControlNet feeds extra structure — edges, depth map, pose — so generation follows layout.

Other conditioning: inpainting masks, image-to-image strength, IP-Adapter for style reference.

Production considerations

Topic	Note
Latency	Many denoise steps — use distilled models or fewer steps
Safety	NSFW filters, celebrity policies
Copyright	Train data disputes; enterprise APIs offer indemnity tiers
Cost	GPU seconds per image — often cheaper via API than self-host unless high volume

Connect to capstone

Module 10 capstone can combine RAG + agents (required) with optional image generation for marketing or diagram drafts — only if evals cover quality and safety.

Module 9 complete

Continue to Module 10 — Production & scaling for the course capstone.