Welcome to Module 5 — image segmentation
Before we begin
In Module 3 you trained a network on MNIST — one label for the whole image. In Module 4 you learned CNNs: filters that scan local patches and build up edges, textures, and parts. Module 5 is a full vision module — not a quick detour. You will spend meaningful time on what segmentation is, how encoder–decoders work, U-Net, other major families (FCN, DeepLab, SegFormer), instance models (Mask R-CNN), metrics, and a hands-on project.
Classification: “What is in this photo?” →
cat
Segmentation: “Which pixels belong to the cat?” → a mask the same size as the image
That mask powers portrait blur, background removal, medical outlines, and driving perception. This module is designed to feel complete, not rushed — budget 12–15 hours for lessons + quiz + project.
Figure
Module 5 at a glance
What you will learn (by the end of this module)
| Skill | You will be able to… |
|---|---|
| Vocabulary | Distinguish semantic, instance, and panoptic segmentation |
| Encoder–decoder | Trace spatial sizes and explain the bottleneck problem |
| U-Net | Implement skips and train on real mask labels |
| Model landscape | Compare FCN, DeepLab/ASPP, SegFormer, Mask R-CNN — when each fits |
| Metrics | Use CE, IoU, Dice; avoid the accuracy trap |
| Project | Train U-Net on pets; optional compare to pretrained DeepLab |
Lesson path (read in order)
| # | Lesson | Focus |
|---|---|---|
| 1 | What is segmentation? | Task ladder, types, portrait walkthrough |
| 2 | Encoder–decoder | Dense prediction, upsampling, alignment |
| 3 | U-Net | Skips, shapes, implementation map |
| 4 | Beyond U-Net | FCN, DeepLab, ASPP, SegFormer |
| 5 | Instance & Mask R-CNN | Two-stage instance masks |
| 6 | Losses & metrics | CE, IoU, Dice, logging |
| 7 | Quiz | 25 questions — pass 19/25 |
| 8 | Project | U-Net from scratch + optional DeepLab compare |
Why this module is harder (and worth it)
| Earlier project | Output size |
|---|---|
| MNIST | 1 digit label |
| Segmentation | H × W labels per image |
A 256×256 image = 65,536 predictions per forward pass. That is why we teach IoU instead of raw accuracy — and why you look at mask overlays every epoch.
How Module 5 connects to prior work
| Prior lesson | Carries forward here |
|---|---|
| Module 1 — Image patches | Images as grids — now every cell gets a label |
| Module 4 — CNNs | Conv stacks in encoders; pretrained backbones in DeepLab |
| Module 3 — Training loop | Same forward → loss → backward → step |
You do not need the sentiment LSTM project finished. You do need CNNs and PyTorch comfort.
Before you start
Required
- Module 4 — CNNs
- Module 3 project or equivalent
Install before the project
pip install torch torchvision matplotlib numpy
# optional stretch: pip install segmentation-models-pytorchOptional: CV track — object detection for depth on mAP, NMS, and detector training.
How to study (avoid rushing)
- Block 2–3 evenings for Lessons 1–5 before touching code.
- After Lesson 3, sketch U-Net on paper with skip arrows.
- After Lesson 4, write one sentence: “I would pick DeepLab over U-Net when ___.”
- In the project, save overlays by epoch 3 — do not wait until training ends.