← Back to curriculum

Module 4 — Object detection

Welcome to Module 4

Why detection is harder than classification, COCO vs YOLO labels, module roadmap, and what to install before the project.

~35 min read + exercises

Welcome to Module 4 — object detection

Before we begin

If you completed Module 3, you can train a model to answer: "Is there a dog in this photo?"
Production vision systems usually need more:

  • How many dogs?
  • Where are they (pixel coordinates)?
  • How confident is each prediction?

That is object detection — and it is one of the most deployed CV tasks (autonomous driving, retail analytics, security, robotics, document OCR regions).

This module is in depth: not just names of models, but box math, training assignment, evaluation (mAP), failure modes, and a full fine-tuning project.

Figure

Module 4 at a glance

Module 4 — object detection pathWork top to bottom. Each lesson builds on the previous one.1Welcomeyou are here2Taskboxes3ModelsFPN YOLO4Trainlosses5mAPIoU NMS6Edgedeploy7Quizcheck8Projectdetector
Seven lessons, quiz, then a hands-on detector project with mAP and threshold tuning.

Key concepts (plain English)

Bounding box — Rectangle (x1,y1,x2,y2)(x_1, y_1, x_2, y_2) or center+size that tightly contains one object.

Confidence score — Model's estimated probability that the box contains class cc (after softmax or sigmoid per design).

Anchor — Template box at a grid cell; the network predicts offsets from the anchor (Faster R-CNN / SSD / older YOLO).

Proposal — Candidate region that might contain an object (RPN output in two-stage detectors).

NMS (non-maximum suppression) — Post-processing that removes duplicate boxes on the same object.

mAP (mean Average Precision) — Standard detection benchmark: integrates precision–recall across score thresholds and classes.


What detection adds over classification

Figure

Four vision output types

street sceneClassification1 label per imagepersonpersoncarDetectionboxes + labelsSemantic seg.class per pixelInstance seg.mask per object
Detection sits between whole-image labels and per-pixel masks.
TaskOutputExample question
Classification1 label"Is this a cat photo?"
DetectionN boxes + labels"Where are all the pedestrians?"
Semantic segmentationH×W class map"Which pixels are road?"
Instance segmentationMask per object"Which pixel belongs to person #2?"

Module 5 covers masks. Module 4 makes you fluent in boxes.


What Module 4 covers

#LessonYou will be able to…
1Classification → detectionConvert between box formats; explain set outputs
2ArchitecturesTrace Faster R-CNN and YOLO data flow
3TrainingInterpret loss dicts; load COCO/YOLO labels
4IoU, NMS, mAPCompute metrics; tune thresholds honestly
5On-deviceBudget latency; export quantized models
Quiz25 MCQsSelf-check with review links
ProjectFaster R-CNNFine-tune, mAP, failure analysis, ONNX

Estimated module time: ~18–22 hours (reading + project).


Before you start

Required:

Install before the project:

bash
pip install torch torchvision matplotlib numpy
# optional for extensions:
pip install torchmetrics pycocotools onnx onnxruntime

GPU: strongly recommended for the project (CPU works but slow).


How to read these lessons

  1. Do checkpoint questions before reading answer sketches.
  2. Run short code snippets in a notebook when suggested.
  3. Sketch boxes on paper for IoU exercises — muscle memory matters.
  4. After the project, keep a failure gallery — best way to learn detection.

Progress saves in this browser when you open each lesson.


What's next

Lesson 1 — From classification to detection