On-device detection & deployment
Before we begin
A detector with 45 mAP on a workstation is only useful on a phone if it runs at acceptable latency, stable frame rate, and fits thermal limits. This lesson connects model choices to milliseconds, megabytes, and ship/no-ship product calls.
What you will learn
- Budget latency, memory, thermal for detection pipelines.
- Compare FP32, FP16, INT8 for detector backbones and heads.
- Export torchvision / YOLO → ONNX → TensorRT / TFLite.
- Place NMS on CPU vs GPU vs in-graph.
- Define a minimal production checklist before launch.
Before this lesson
Where time goes (typical pipeline)
| Stage | ms (illustrative mobile) | Notes |
|---|---|---|
| Preprocess (resize, norm) | 2–8 | Often CPU |
| Backbone + heads | 15–80 | GPU/NPU |
| Decode boxes | 1–5 | |
| NMS | 2–20 | CPU if many candidates |
| Draw / tracking | 1–5 |
Rule: profile end-to-end — optimizing backbone while NMS dominates is wasted effort.
Resolution vs accuracy vs cost
| Input | Relative compute | Small object AP |
|---|---|---|
| 320×320 | 1× | Lower |
| 640×640 | ~4× | Baseline |
| 1280×1280 | ~16× | Higher |
Doubling side length ≈ 4× activation memory in early conv layers.
Product pattern: preview stream at 320, capture high-res on demand for detail.
Figure
Deployment constraints
Model selection for edge
| Model | Params (order) | Typical role |
|---|---|---|
| YOLOv8n | ~3M | Phone real-time |
| MobileNet-SSD | ~5M | Classic mobile |
| Faster R-CNN R50-FPN | ~40M+ | Server / batch |
| YOLOv8x | ~68M | Offline accuracy |
Knowledge distillation: train small student to mimic large teacher logits + box outputs.
Quantization
Post-training quantization (PTQ)
- Run 100–500 calibration images through model.
- Record activation ranges per layer.
- Map to INT8 scale/zero-point.
Risk: small objects and fine boxes degrade first — always re-run mAP@0.5 on val.
Quantization-aware training (QAT)
Fake-quant nodes during fine-tune — recovers 1–3 mAP points vs PTQ.
Precision ladder
| Dtype | Speed | Accuracy |
|---|---|---|
| FP32 | Baseline | Best |
| FP16 | ~1.5–2× GPU | Usually fine |
| INT8 | 2–4× | Validate per class |
Export paths
PyTorch → ONNX
model.eval()
dummy = torch.randn(1, 3, 640, 640)
torch.onnx.export(
model, dummy, "detector.onnx",
input_names=["images"],
output_names=["boxes", "labels", "scores"],
opset_version=17,
dynamic_axes={"images": {0: "batch"}},
)Verify with ONNX Runtime — compare max diff to PyTorch on same tensor.
YOLO → ONNX / TFLite (Ultralytics)
yolo export model=best.pt format=onnx simplify=True
yolo export model=best.pt format=tflite int8=TrueNMS placement
| Option | Pros | Cons |
|---|---|---|
| In model graph | Single runtime call | Export complexity |
| App code (CPU) | Flexible thresholds | Extra latency |
| GPU NMS plugin | Fast for many boxes | Platform-specific |
Serving pattern (server)
- Client uploads image or frame.
- Server preprocesses (same as training!).
- ONNX Runtime / TensorRT inference.
- NMS + filter by
score_thresh. - Return JSON:
[{bbox, class, score}, ...].
{
"detections": [
{"label": "person", "score": 0.91, "bbox": [120, 40, 220, 200]}
],
"latency_ms": 28
}Module 7 project extends classifier serving to batched inference — same discipline applies.
Monitoring in production
Log per request:
- Latency p50/p95
- Detection count distribution
- Max score histogram
- Input resolution
Alert on distribution shift (suddenly zero detections on busy camera) — see Module 7 drift lesson.
Ship checklist
- Val mAP documented with IoU threshold
-
score_threshchosen on val, not test - Preprocess documented (RGB order, mean/std, resize)
- Worst-case latency measured (thermal throttling after 5 min)
- Failure gallery reviewed by team
What's next
Module 4 quiz — then detector project.