Detector architectures — two-stage, one-stage & FPN

Before we begin

"Papers mention Faster R-CNN, YOLOv8, DETR — which one am I actually running?" This lesson traces data flow through major families so you can read configs, debug outputs, and pick models for latency vs accuracy.

What you will learn

Trace Faster R-CNN from image tensor to final boxes.
Explain FPN with a small-object vs large-object example.
Contrast YOLO one-pass design with two-stage proposal flow.
Describe RetinaNet focal loss and DETR Hungarian matching.
Pick a detector family for edge, server, or research constraints.

Before this lesson

Lesson 1 — From classification to detection

Shared components

Almost every detector has:

Backbone — ResNet, MobileNet, EfficientNet → feature maps
Neck (often FPN) — multi-scale fusion
Heads — classification + box regression (+ mask in Module 5)

plaintext

Image → Backbone → FPN → {Head @ stride 4, 8, 16, 32, ...} → post-process (NMS)

Component	Typical output stride	Sees
P3 (fine)	8	Small objects, fine edges
P4	16	Medium
P5 (coarse)	32	Large context

Feature Pyramid Network (FPN) — why one scale fails

Scenario: 640×640 image, person 20px tall (distant) vs bus 300px wide (near).

Deep layer stride 32 → distant person ≈ less than 1 cell → invisible
Shallow layer stride 8 → bus spans many cells but weak semantics

FPN: lateral connections merge shallow (detail) + deep (semantics).

Design	Small object AP	Large object AP
Single top feature	Low	OK
FPN	Much better	Still strong

Exercise: Why does resizing input to 1280 help small objects but cost 4× memory in early layers?

Activations scale with $H \times W$ ; larger input preserves tiny objects across more pixels — but compute and RAM grow quadratically in spatial dims.

Two-stage: Faster R-CNN (deep dive)

Figure

Faster R-CNN pipeline

RPN proposes; RoI head refines.

Stage A — Region Proposal Network (RPN)

On each FPN level, sliding anchors → predict:

Objectness (fg vs bg)
Box deltas to refine anchor → proposal

Thousands of proposals → keep top 1000 (training) or ~300 (inference) by score.

Stage B — RoI head

For each proposal:

RoI Align — sample features at proposal location (bilinear, sub-pixel)
FC layers — class logits + second box refinement

Why two stages help: background clutter filtered before expensive per-region classification.

torchvision mental model

python

from torchvision.models.detection import fasterrcnn_resnet50_fpn
model = fasterrcnn_resnet50_fpn(weights="DEFAULT")
model.eval()
# preds = model([image_tensor])[0]
# preds.keys(): boxes, labels, scores

Training mode with targets returns loss_classifier, loss_box_reg, loss_objectness, loss_rpn_box_reg.

One-stage: YOLO family

Figure

YOLO grid

Each cell responsible for objects whose center falls inside.

Core idea: divide image into $S \times S$ grid. Cell containing object center predicts:

Box (cx, cy, w, h) — often relative to cell
Objectness / confidence
Class probabilities

Single forward pass → decode boxes → NMS.

Version trend	Notes
YOLOv3–v5	Anchor-based, multi-scale heads
YOLOv8+ (Ultralytics)	Anchor-free decoupled head, strong defaults

Pros: real-time on GPU/edge, simple deploy story.
Cons: crowded scenes, tiny objects — need tuning (resolution, aug, NMS).

bash

# Ultralytics quick train (extension)
yolo detect train data=coco128.yaml model=yolov8n.pt epochs=50

RetinaNet — one-stage + focal loss

Dense predictions on FPN levels like SSD, but focal loss fixes extreme foreground/background imbalance:

FL(p_t) = -\alpha (1-p_t)^\gamma \log(p_t)

Easy negatives (empty sky) down-weighted; hard examples dominate gradient.

Result (at time of paper): one-stage matches two-stage mAP with faster inference.

DETR and query-based detectors

DEtection TRansformer:

CNN backbone → flattened features + positional encoding
Transformer encoder–decoder
Fixed $N$ learned queries (e.g. 100) → each outputs class + box

Hungarian matching assigns queries to ground truth — no anchors, original paper reduced reliance on NMS.

Aspect	DETR	YOLO
Inductive bias	Set transformer	Grid/local
Convergence	Slower, needs data	Fast
Edge deploy	Heavier	Lighter variants

Modern variants (DINO, RT-DETR) improve speed and accuracy — same set prediction idea.

Architecture comparison table

Family	Stages	NMS	Typical use
Faster R-CNN	2	Yes	Research, high accuracy
Cascade R-CNN	2+ cascaded	Yes	Best bbox tightness
RetinaNet	1	Yes	GPU balanced
YOLOv8	1	Yes	Real-time
DETR	1	Optional	Set formulation

Choosing for your product

Constraint	Start here
< 30ms on phone	YOLO-nano, MobileNet-SSD, INT8
Best mAP, server GPU	Cascade, DINO, large YOLO
Custom 1-class detector	Fine-tune Faster R-CNN or YOLO-small (this module's project)
Need masks too	Mask R-CNN (Module 5)

Checkpoint questions

What does RPN output that the RoI head consumes?
Why does FPN attach heads to multiple strides?
What problem does focal loss address in one-stage detectors?

Sketches: (1) region proposals. (2) objects of different pixel sizes. (3) extreme negative imbalance from dense sliding windows.

What's next

Lesson 3 — Training detectors