← Back to curriculum

Module 7 — CV production & deployment

Edge deployment & optimization

INT8 quantization, pruning, knowledge distillation, TFLite and CoreML, NPU delegates, and profiling latency on device.

~75 min read + exercises

Edge deployment & optimization

Before we begin

Mobile robots, phones, and embedded cameras need low latency and low power — often INT8 models on NPUs.


Learning objectives

  • Explain post-training quantization (PTQ) vs quantization-aware training (QAT).
  • Name TensorRT, TFLite, and CoreML roles.
  • Describe pruning and distillation at high level.
  • Profile latency and memory on target hardware.

INT8 quantization

Map float32 weights/activations to 8-bit integers with scale/zero-point. ~4× smaller, faster on supported hardware.

PTQ: calibrate on representative batches — fast, may lose 1–2% accuracy.
QAT: simulate quantization during training — better accuracy at INT8.


ONNX → TensorRT

Build engine for specific GPU; fuse layers; pick FP16/INT8 precision. Profile with trtexec or NVIDIA tools.


Mobile

TFLite with GPU/NNAPI delegates. Watch operator support — some ops fall back to slow CPU paths.


Pruning & distillation

Pruning: remove low-magnitude weights or whole channels — retrain briefly.
Distillation: small student mimics large teacher logits — compact models for edge.


Trade-off table

TechniqueLatencyAccuracy risk
FP16Moderate gainUsually low
INT8 PTQLarge gainMedium
Smaller backboneLarge gainArchitecture-dependent
Input resolution ↓Large gainCan hurt small objects

What's next

Lesson 3 — Monitoring, drift & retraining