← Back to curriculum

Module 4 — Object detection

On-device detection & deployment

Latency vs resolution budgets, batching, INT8 quantization for detectors, ONNX/TFLite export, and profiling on CPU/GPU/mobile.

~80 min read + exercises

On-device detection & deployment

Before we begin

A detector with 45 mAP on a workstation is only useful on a phone if it runs at acceptable latency, stable frame rate, and fits thermal limits. This lesson connects model choices to milliseconds, megabytes, and ship/no-ship product calls.


What you will learn

  • Budget latency, memory, thermal for detection pipelines.
  • Compare FP32, FP16, INT8 for detector backbones and heads.
  • Export torchvision / YOLO → ONNX → TensorRT / TFLite.
  • Place NMS on CPU vs GPU vs in-graph.
  • Define a minimal production checklist before launch.

Before this lesson


Where time goes (typical pipeline)

Stagems (illustrative mobile)Notes
Preprocess (resize, norm)2–8Often CPU
Backbone + heads15–80GPU/NPU
Decode boxes1–5
NMS2–20CPU if many candidates
Draw / tracking1–5

Rule: profile end-to-end — optimizing backbone while NMS dominates is wasted effort.


Resolution vs accuracy vs cost

InputRelative computeSmall object AP
320×320Lower
640×640~4×Baseline
1280×1280~16×Higher

Doubling side length ≈ activation memory in early conv layers.

Product pattern: preview stream at 320, capture high-res on demand for detail.

Figure

Deployment constraints

Four constraints to budget on-deviceLatencyframe-time stabilityMemoryactivations dominateThermalsustained throttlePrecisionINT8 quantization
Latency, memory, thermal, precision — pick two aggressively.

Model selection for edge

ModelParams (order)Typical role
YOLOv8n~3MPhone real-time
MobileNet-SSD~5MClassic mobile
Faster R-CNN R50-FPN~40M+Server / batch
YOLOv8x~68MOffline accuracy

Knowledge distillation: train small student to mimic large teacher logits + box outputs.


Quantization

Post-training quantization (PTQ)

  1. Run 100–500 calibration images through model.
  2. Record activation ranges per layer.
  3. Map to INT8 scale/zero-point.

Risk: small objects and fine boxes degrade first — always re-run mAP@0.5 on val.

Quantization-aware training (QAT)

Fake-quant nodes during fine-tune — recovers 1–3 mAP points vs PTQ.

Precision ladder

DtypeSpeedAccuracy
FP32BaselineBest
FP16~1.5–2× GPUUsually fine
INT82–4×Validate per class

Export paths

PyTorch → ONNX

python
model.eval()
dummy = torch.randn(1, 3, 640, 640)
torch.onnx.export(
    model, dummy, "detector.onnx",
    input_names=["images"],
    output_names=["boxes", "labels", "scores"],
    opset_version=17,
    dynamic_axes={"images": {0: "batch"}},
)

Verify with ONNX Runtime — compare max diff to PyTorch on same tensor.

YOLO → ONNX / TFLite (Ultralytics)

bash
yolo export model=best.pt format=onnx simplify=True
yolo export model=best.pt format=tflite int8=True

NMS placement

OptionProsCons
In model graphSingle runtime callExport complexity
App code (CPU)Flexible thresholdsExtra latency
GPU NMS pluginFast for many boxesPlatform-specific

Serving pattern (server)

  1. Client uploads image or frame.
  2. Server preprocesses (same as training!).
  3. ONNX Runtime / TensorRT inference.
  4. NMS + filter by score_thresh.
  5. Return JSON: [{bbox, class, score}, ...].
json
{
  "detections": [
    {"label": "person", "score": 0.91, "bbox": [120, 40, 220, 200]}
  ],
  "latency_ms": 28
}

Module 7 project extends classifier serving to batched inference — same discipline applies.


Monitoring in production

Log per request:

  • Latency p50/p95
  • Detection count distribution
  • Max score histogram
  • Input resolution

Alert on distribution shift (suddenly zero detections on busy camera) — see Module 7 drift lesson.


Ship checklist

  • Val mAP documented with IoU threshold
  • score_thresh chosen on val, not test
  • Preprocess documented (RGB order, mean/std, resize)
  • Worst-case latency measured (thermal throttling after 5 min)
  • Failure gallery reviewed by team

What's next

Module 4 quiz — then detector project.