← Back to curriculum

Module 9 — Multimodal & image models

Diffusion models & image generation

Noise schedules, U-Net denoisers, Stable Diffusion pipeline, ControlNet, and practical limits of generative image APIs.

~85 min read + exercises

Diffusion models & image generation

Before we begin

Diffusion models generate images by learning to remove noise step by step — the backbone of Stable Diffusion, DALL·E 3, and Midjourney-class systems.

Start from random noise → gradually denoise → coherent image matching your prompt.


What you will learn

  • Explain the forward / reverse diffusion intuition.
  • Map the Stable Diffusion pipeline (VAE, U-Net, text encoder).
  • Know ControlNet and conditioning basics.
  • Set expectations for API vs self-host image generation.

Before this lesson


Forward process (training)

Gradually add Gaussian noise to an image over T steps until pure noise.

The model learns to predict the noise (or the clean image) at each step — supervised on image datasets with captions.


Reverse process (generation)

  1. Sample random noise latent.
  2. Condition on text embedding from a text encoder (often CLIP or T5).
  3. U-Net predicts denoising update — repeat for 20–50 steps.
  4. VAE decoder maps latent → RGB image.

Classifier-free guidance: scale text conditioning so outputs follow the prompt more strongly (at cost of diversity).


Stable Diffusion components

PartRole
VAECompress 512×512 → smaller latent (faster denoising)
U-NetDenoiser in latent space — same family as segmentation U-Nets
Text encoderPrompt → embedding vector

Module 5 taught U-Net for segmentation; here U-Net predicts noise, not class masks.


ControlNet & conditioning

ControlNet feeds extra structure — edges, depth map, pose — so generation follows layout.

Other conditioning: inpainting masks, image-to-image strength, IP-Adapter for style reference.


Production considerations

TopicNote
LatencyMany denoise steps — use distilled models or fewer steps
SafetyNSFW filters, celebrity policies
CopyrightTrain data disputes; enterprise APIs offer indemnity tiers
CostGPU seconds per image — often cheaper via API than self-host unless high volume

Connect to capstone

Module 10 capstone can combine RAG + agents (required) with optional image generation for marketing or diagram drafts — only if evals cover quality and safety.


Module 9 complete

Continue to Module 10 — Production & scaling for the course capstone.