← Back to curriculum

Module 4 — Object detection

Detector architectures — two-stage, one-stage & FPN

Faster R-CNN stage-by-stage, YOLO grid intuition, RetinaNet focal loss, DETR matching, and how to pick a family for your product.

~100 min read + exercises

Detector architectures — two-stage, one-stage & FPN

Before we begin

"Papers mention Faster R-CNN, YOLOv8, DETR — which one am I actually running?" This lesson traces data flow through major families so you can read configs, debug outputs, and pick models for latency vs accuracy.


What you will learn

  • Trace Faster R-CNN from image tensor to final boxes.
  • Explain FPN with a small-object vs large-object example.
  • Contrast YOLO one-pass design with two-stage proposal flow.
  • Describe RetinaNet focal loss and DETR Hungarian matching.
  • Pick a detector family for edge, server, or research constraints.

Before this lesson


Shared components

Almost every detector has:

  1. Backbone — ResNet, MobileNet, EfficientNet → feature maps
  2. Neck (often FPN) — multi-scale fusion
  3. Heads — classification + box regression (+ mask in Module 5)
plaintext
Image → Backbone → FPN → {Head @ stride 4, 8, 16, 32, ...} → post-process (NMS)
ComponentTypical output strideSees
P3 (fine)8Small objects, fine edges
P416Medium
P5 (coarse)32Large context

Feature Pyramid Network (FPN) — why one scale fails

Scenario: 640×640 image, person 20px tall (distant) vs bus 300px wide (near).

  • Deep layer stride 32 → distant person ≈ less than 1 cell → invisible
  • Shallow layer stride 8 → bus spans many cells but weak semantics

FPN: lateral connections merge shallow (detail) + deep (semantics).

DesignSmall object APLarge object AP
Single top featureLowOK
FPNMuch betterStill strong

Exercise: Why does resizing input to 1280 help small objects but cost 4× memory in early layers?

Activations scale with H×WH \times W; larger input preserves tiny objects across more pixels — but compute and RAM grow quadratically in spatial dims.


Two-stage: Faster R-CNN (deep dive)

Figure

Faster R-CNN pipeline

Faster R-CNN — two-stage pipelinePropose regions first, then classify and refine each region.ImageH×W×3BackboneResNetFPNmulti-scaleRPNproposalsRoI headclass+box
RPN proposes; RoI head refines.

Stage A — Region Proposal Network (RPN)

On each FPN level, sliding anchors → predict:

  • Objectness (fg vs bg)
  • Box deltas to refine anchor → proposal

Thousands of proposals → keep top 1000 (training) or ~300 (inference) by score.

Stage B — RoI head

For each proposal:

  1. RoI Align — sample features at proposal location (bilinear, sub-pixel)
  2. FC layers — class logits + second box refinement

Why two stages help: background clutter filtered before expensive per-region classification.

torchvision mental model

python
from torchvision.models.detection import fasterrcnn_resnet50_fpn
model = fasterrcnn_resnet50_fpn(weights="DEFAULT")
model.eval()
# preds = model([image_tensor])[0]
# preds.keys(): boxes, labels, scores

Training mode with targets returns loss_classifier, loss_box_reg, loss_objectness, loss_rpn_box_reg.


One-stage: YOLO family

Figure

YOLO grid

One-stage grid (YOLO intuition)Each cell can predict boxes + class scores in one forward pass.Green cells: responsible for predicting the object in the dashed box
Each cell responsible for objects whose center falls inside.

Core idea: divide image into S×SS \times S grid. Cell containing object center predicts:

  • Box (cx, cy, w, h) — often relative to cell
  • Objectness / confidence
  • Class probabilities

Single forward pass → decode boxes → NMS.

Version trendNotes
YOLOv3–v5Anchor-based, multi-scale heads
YOLOv8+ (Ultralytics)Anchor-free decoupled head, strong defaults

Pros: real-time on GPU/edge, simple deploy story.
Cons: crowded scenes, tiny objects — need tuning (resolution, aug, NMS).

bash
# Ultralytics quick train (extension)
yolo detect train data=coco128.yaml model=yolov8n.pt epochs=50

RetinaNet — one-stage + focal loss

Dense predictions on FPN levels like SSD, but focal loss fixes extreme foreground/background imbalance:

FL(pt)=α(1pt)γlog(pt)FL(p_t) = -\alpha (1-p_t)^\gamma \log(p_t)

Easy negatives (empty sky) down-weighted; hard examples dominate gradient.

Result (at time of paper): one-stage matches two-stage mAP with faster inference.


DETR and query-based detectors

DEtection TRansformer:

  1. CNN backbone → flattened features + positional encoding
  2. Transformer encoder–decoder
  3. Fixed NN learned queries (e.g. 100) → each outputs class + box

Hungarian matching assigns queries to ground truth — no anchors, original paper reduced reliance on NMS.

AspectDETRYOLO
Inductive biasSet transformerGrid/local
ConvergenceSlower, needs dataFast
Edge deployHeavierLighter variants

Modern variants (DINO, RT-DETR) improve speed and accuracy — same set prediction idea.


Architecture comparison table

FamilyStagesNMSTypical use
Faster R-CNN2YesResearch, high accuracy
Cascade R-CNN2+ cascadedYesBest bbox tightness
RetinaNet1YesGPU balanced
YOLOv81YesReal-time
DETR1OptionalSet formulation

Choosing for your product

ConstraintStart here
< 30ms on phoneYOLO-nano, MobileNet-SSD, INT8
Best mAP, server GPUCascade, DINO, large YOLO
Custom 1-class detectorFine-tune Faster R-CNN or YOLO-small (this module's project)
Need masks tooMask R-CNN (Module 5)

Checkpoint questions

  1. What does RPN output that the RoI head consumes?
  2. Why does FPN attach heads to multiple strides?
  3. What problem does focal loss address in one-stage detectors?

Sketches: (1) region proposals. (2) objects of different pixel sizes. (3) extreme negative imbalance from dense sliding windows.


What's next

Lesson 3 — Training detectors