Detection, segmentation, and on-device trade-offs

This lesson separates classification (what?) from localization (where?) and segmentation (which pixels?). It ends with the engineering constraints you hit when deploying vision models on phones and embedded hardware.

Figure

Four ways to ask: what's in this image?

The output structure changes from a single label to per-pixel labels and finally per-instance masks. Each step costs more compute and more annotation effort.

Learning objectives

Contrast object detection, instance segmentation, and semantic segmentation.
Explain anchors at a high level and why modern detectors moved toward anchor-free or query-based designs.
List major on-device constraints: latency, memory, thermal, and numerical precision.

Prerequisites

Convolutional networks lesson.
Basic idea of GPU vs CPU (helpful but not required).

Step 1 — From one label per image to many objects

A classifier outputs a distribution over classes for the whole image. A detector outputs a set of objects, each with:

a bounding box (often parameterized as center, width, height), and
class scores (possibly multi-label in some setups).

Checkpoint: Why is “set prediction” harder than fixed-length outputs?

Step 2 — Two-stage vs one-stage (conceptual)

Historically:

Two-stage detectors (e.g. R-CNN family): propose regions, then classify/refine.
One-stage detectors (e.g. SSD, YOLO family): predict boxes densely in one forward pass.

Modern systems blur lines with improved training objectives and feature pyramids.

Exercise: Give one advantage of two-stage and one advantage of one-stage approaches.

Step 3 — Segmentation flavors

Semantic segmentation: each pixel gets a class label (all “person” pixels share a category).
Instance segmentation: separate object instances (person 1 vs person 2), often masks + boxes.
Panoptic segmentation: combines “stuff” (sky, road) and “things” (people, cars) in one unified labeling (advanced topic).

Checkpoint: For autonomous driving scene parsing, which failure is worse: confusing road vs sidewalk semantically, or merging two pedestrians into one instance mask?

Step 4 — Anchors and matching (intuition)

Many classical detectors tile the image with anchor boxes at multiple scales. Training matches ground-truth boxes to anchors using IoU thresholds.

IoU (intersection over union) measures overlap between two boxes.

Figure

Intersection over Union

IoU compares the overlapping area against the combined area. Thresholds on IoU decide which anchors are 'positive' during training and which detections count as a hit at test time.

Pain point: anchor tuning is fiddly — hence anchor-free and transformer-style detectors that predict boxes differently.

Step 5 — On-device deployment constraints

When you leave the datacenter:

Latency: interactive apps need stable frame times; spikes feel janky.
Memory: activations dominate for large resolutions — watch feature map sizes.
Thermal / battery: sustained inference throttles CPUs/GPUs/NPUs.
Numerical precision: INT8 quantization reduces bandwidth and can speed math, but can hurt calibration-sensitive layers.

Figure

Four constraints to budget on-device

On phones and embedded boards these axes compete: chasing one (e.g. latency via lower precision) usually costs another (accuracy, calibration headroom).

Exercise: List two vision tasks where a 30 ms model is acceptable vs unacceptable.

Step 6 — Quantization and accuracy (high level)

Post-training quantization maps floating weights/activations to low-bit integers using calibration data.

Sometimes you need quantization-aware training (QAT) to recover accuracy.
Some layers (e.g. certain attention patterns) are more sensitive than conv blocks.

Check your understanding

What extra output does detection require compared to classification?
Why does increasing input resolution nonlinearly affect memory in many architectures?
What is one reason instance segmentation is more expensive than bounding-box detection?

Lab-style stretch goal (optional)

Export a small ONNX model and run it with ONNX Runtime on CPU; measure median latency at two different input resolutions.