Welcome to Module 5 — image segmentation

Before we begin

In Module 3 you trained a network on MNIST — one label for the whole image. In Module 4 you learned CNNs: filters that scan local patches and build up edges, textures, and parts. Module 5 is a full vision module — not a quick detour. You will spend meaningful time on what segmentation is, how encoder–decoders work, U-Net, other major families (FCN, DeepLab, SegFormer), instance models (Mask R-CNN), metrics, and a hands-on project.

Classification: “What is in this photo?” → cat
Segmentation: “Which pixels belong to the cat?” → a mask the same size as the image

That mask powers portrait blur, background removal, medical outlines, and driving perception. This module is designed to feel complete, not rushed — budget 12–15 hours for lessons + quiz + project.

Figure

Module 5 at a glance

Eight lessons: foundations → U-Net → other models → instance seg → metrics → quiz → project.

What you will learn (by the end of this module)

Skill	You will be able to…
Vocabulary	Distinguish semantic, instance, and panoptic segmentation
Encoder–decoder	Trace spatial sizes and explain the bottleneck problem
U-Net	Implement skips and train on real mask labels
Model landscape	Compare FCN, DeepLab/ASPP, SegFormer, Mask R-CNN — when each fits
Metrics	Use CE, IoU, Dice; avoid the accuracy trap
Project	Train U-Net on pets; optional compare to pretrained DeepLab

Lesson path (read in order)

#	Lesson	Focus
1	What is segmentation?	Task ladder, types, portrait walkthrough
2	Encoder–decoder	Dense prediction, upsampling, alignment
3	U-Net	Skips, shapes, implementation map
4	Beyond U-Net	FCN, DeepLab, ASPP, SegFormer
5	Instance & Mask R-CNN	Two-stage instance masks
6	Losses & metrics	CE, IoU, Dice, logging
7	Quiz	25 questions — pass 19/25
8	Project	U-Net from scratch + optional DeepLab compare

Why this module is harder (and worth it)

Earlier project	Output size
MNIST	1 digit label
Segmentation	H × W labels per image

A 256×256 image = 65,536 predictions per forward pass. That is why we teach IoU instead of raw accuracy — and why you look at mask overlays every epoch.

How Module 5 connects to prior work

Prior lesson	Carries forward here
Module 1 — Image patches	Images as grids — now every cell gets a label
Module 4 — CNNs	Conv stacks in encoders; pretrained backbones in DeepLab
Module 3 — Training loop	Same `forward → loss → backward → step`

You do not need the sentiment LSTM project finished. You do need CNNs and PyTorch comfort.

Before you start

Required

Module 4 — CNNs
Module 3 project or equivalent

Install before the project

bash

pip install torch torchvision matplotlib numpy
# optional stretch: pip install segmentation-models-pytorch

Optional: CV track — object detection for depth on mAP, NMS, and detector training.

How to study (avoid rushing)

Block 2–3 evenings for Lessons 1–5 before touching code.
After Lesson 3, sketch U-Net on paper with skip arrows.
After Lesson 4, write one sentence: “I would pick DeepLab over U-Net when ___.”
In the project, save overlays by epoch 3 — do not wait until training ends.

Ready?

Lesson 1 — What is image segmentation?