Detector architectures — two-stage, one-stage & FPN
Before we begin
"Papers mention Faster R-CNN, YOLOv8, DETR — which one am I actually running?" This lesson traces data flow through major families so you can read configs, debug outputs, and pick models for latency vs accuracy.
What you will learn
- Trace Faster R-CNN from image tensor to final boxes.
- Explain FPN with a small-object vs large-object example.
- Contrast YOLO one-pass design with two-stage proposal flow.
- Describe RetinaNet focal loss and DETR Hungarian matching.
- Pick a detector family for edge, server, or research constraints.
Before this lesson
Shared components
Almost every detector has:
- Backbone — ResNet, MobileNet, EfficientNet → feature maps
- Neck (often FPN) — multi-scale fusion
- Heads — classification + box regression (+ mask in Module 5)
Image → Backbone → FPN → {Head @ stride 4, 8, 16, 32, ...} → post-process (NMS)| Component | Typical output stride | Sees |
|---|---|---|
| P3 (fine) | 8 | Small objects, fine edges |
| P4 | 16 | Medium |
| P5 (coarse) | 32 | Large context |
Feature Pyramid Network (FPN) — why one scale fails
Scenario: 640×640 image, person 20px tall (distant) vs bus 300px wide (near).
- Deep layer stride 32 → distant person ≈ less than 1 cell → invisible
- Shallow layer stride 8 → bus spans many cells but weak semantics
FPN: lateral connections merge shallow (detail) + deep (semantics).
| Design | Small object AP | Large object AP |
|---|---|---|
| Single top feature | Low | OK |
| FPN | Much better | Still strong |
Exercise: Why does resizing input to 1280 help small objects but cost 4× memory in early layers?
Activations scale with ; larger input preserves tiny objects across more pixels — but compute and RAM grow quadratically in spatial dims.
Two-stage: Faster R-CNN (deep dive)
Figure
Faster R-CNN pipeline
Stage A — Region Proposal Network (RPN)
On each FPN level, sliding anchors → predict:
- Objectness (fg vs bg)
- Box deltas to refine anchor → proposal
Thousands of proposals → keep top 1000 (training) or ~300 (inference) by score.
Stage B — RoI head
For each proposal:
- RoI Align — sample features at proposal location (bilinear, sub-pixel)
- FC layers — class logits + second box refinement
Why two stages help: background clutter filtered before expensive per-region classification.
torchvision mental model
from torchvision.models.detection import fasterrcnn_resnet50_fpn
model = fasterrcnn_resnet50_fpn(weights="DEFAULT")
model.eval()
# preds = model([image_tensor])[0]
# preds.keys(): boxes, labels, scoresTraining mode with targets returns loss_classifier, loss_box_reg, loss_objectness, loss_rpn_box_reg.
One-stage: YOLO family
Figure
YOLO grid
Core idea: divide image into grid. Cell containing object center predicts:
- Box (cx, cy, w, h) — often relative to cell
- Objectness / confidence
- Class probabilities
Single forward pass → decode boxes → NMS.
| Version trend | Notes |
|---|---|
| YOLOv3–v5 | Anchor-based, multi-scale heads |
| YOLOv8+ (Ultralytics) | Anchor-free decoupled head, strong defaults |
Pros: real-time on GPU/edge, simple deploy story.
Cons: crowded scenes, tiny objects — need tuning (resolution, aug, NMS).
# Ultralytics quick train (extension)
yolo detect train data=coco128.yaml model=yolov8n.pt epochs=50RetinaNet — one-stage + focal loss
Dense predictions on FPN levels like SSD, but focal loss fixes extreme foreground/background imbalance:
Easy negatives (empty sky) down-weighted; hard examples dominate gradient.
Result (at time of paper): one-stage matches two-stage mAP with faster inference.
DETR and query-based detectors
DEtection TRansformer:
- CNN backbone → flattened features + positional encoding
- Transformer encoder–decoder
- Fixed learned queries (e.g. 100) → each outputs class + box
Hungarian matching assigns queries to ground truth — no anchors, original paper reduced reliance on NMS.
| Aspect | DETR | YOLO |
|---|---|---|
| Inductive bias | Set transformer | Grid/local |
| Convergence | Slower, needs data | Fast |
| Edge deploy | Heavier | Lighter variants |
Modern variants (DINO, RT-DETR) improve speed and accuracy — same set prediction idea.
Architecture comparison table
| Family | Stages | NMS | Typical use |
|---|---|---|---|
| Faster R-CNN | 2 | Yes | Research, high accuracy |
| Cascade R-CNN | 2+ cascaded | Yes | Best bbox tightness |
| RetinaNet | 1 | Yes | GPU balanced |
| YOLOv8 | 1 | Yes | Real-time |
| DETR | 1 | Optional | Set formulation |
Choosing for your product
| Constraint | Start here |
|---|---|
| < 30ms on phone | YOLO-nano, MobileNet-SSD, INT8 |
| Best mAP, server GPU | Cascade, DINO, large YOLO |
| Custom 1-class detector | Fine-tune Faster R-CNN or YOLO-small (this module's project) |
| Need masks too | Mask R-CNN (Module 5) |
Checkpoint questions
- What does RPN output that the RoI head consumes?
- Why does FPN attach heads to multiple strides?
- What problem does focal loss address in one-stage detectors?
Sketches: (1) region proposals. (2) objects of different pixel sizes. (3) extreme negative imbalance from dense sliding windows.