Detection, segmentation, and on-device trade-offs
This lesson separates classification (what?) from localization (where?) and segmentation (which pixels?). It ends with the engineering constraints you hit when deploying vision models on phones and embedded hardware.
Figure
Four ways to ask: what's in this image?
Learning objectives
- Contrast object detection, instance segmentation, and semantic segmentation.
- Explain anchors at a high level and why modern detectors moved toward anchor-free or query-based designs.
- List major on-device constraints: latency, memory, thermal, and numerical precision.
Prerequisites
- Convolutional networks lesson.
- Basic idea of GPU vs CPU (helpful but not required).
Step 1 — From one label per image to many objects
A classifier outputs a distribution over classes for the whole image. A detector outputs a set of objects, each with:
- a bounding box (often parameterized as center, width, height), and
- class scores (possibly multi-label in some setups).
Checkpoint: Why is “set prediction” harder than fixed-length outputs?
Step 2 — Two-stage vs one-stage (conceptual)
Historically:
- Two-stage detectors (e.g. R-CNN family): propose regions, then classify/refine.
- One-stage detectors (e.g. SSD, YOLO family): predict boxes densely in one forward pass.
Modern systems blur lines with improved training objectives and feature pyramids.
Exercise: Give one advantage of two-stage and one advantage of one-stage approaches.
Step 3 — Segmentation flavors
- Semantic segmentation: each pixel gets a class label (all “person” pixels share a category).
- Instance segmentation: separate object instances (person 1 vs person 2), often masks + boxes.
- Panoptic segmentation: combines “stuff” (sky, road) and “things” (people, cars) in one unified labeling (advanced topic).
Checkpoint: For autonomous driving scene parsing, which failure is worse: confusing road vs sidewalk semantically, or merging two pedestrians into one instance mask?
Step 4 — Anchors and matching (intuition)
Many classical detectors tile the image with anchor boxes at multiple scales. Training matches ground-truth boxes to anchors using IoU thresholds.
IoU (intersection over union) measures overlap between two boxes.
Figure
Intersection over Union
Pain point: anchor tuning is fiddly — hence anchor-free and transformer-style detectors that predict boxes differently.
Step 5 — On-device deployment constraints
When you leave the datacenter:
- Latency: interactive apps need stable frame times; spikes feel janky.
- Memory: activations dominate for large resolutions — watch feature map sizes.
- Thermal / battery: sustained inference throttles CPUs/GPUs/NPUs.
- Numerical precision: INT8 quantization reduces bandwidth and can speed math, but can hurt calibration-sensitive layers.
Figure
Four constraints to budget on-device
Exercise: List two vision tasks where a 30 ms model is acceptable vs unacceptable.
Step 6 — Quantization and accuracy (high level)
Post-training quantization maps floating weights/activations to low-bit integers using calibration data.
- Sometimes you need quantization-aware training (QAT) to recover accuracy.
- Some layers (e.g. certain attention patterns) are more sensitive than conv blocks.
Check your understanding
- What extra output does detection require compared to classification?
- Why does increasing input resolution nonlinearly affect memory in many architectures?
- What is one reason instance segmentation is more expensive than bounding-box detection?
Lab-style stretch goal (optional)
Export a small ONNX model and run it with ONNX Runtime on CPU; measure median latency at two different input resolutions.