← Back to curriculum

Module 7 — CV production & deployment

Model serving for vision

REST vs gRPC inference, batching, TensorRT and ONNX Runtime, warm-up, preprocessing on server vs client, and container basics.

~70 min read + exercises

Model serving for vision

Before we begin

Serving turns a trained weights file into a service other apps call. Vision adds heavy preprocessing and large payloads.


Learning objectives

  • Compare REST vs gRPC for image inference.
  • Design preprocessing parity between train and serve.
  • Use batching and warm-up for stable latency.
  • Outline ONNX Runtime deployment path.

Preprocessing contract

Document and test:

  • Resize dimensions and crop policy
  • RGB vs BGR
  • Normalization mean/std
  • NCHW vs NHWC tensor layout

A one-pixel resize mismatch can crater detection mAP in production.


REST API sketch (FastAPI)

python
from fastapi import FastAPI, File, UploadFile
import numpy as np
import onnxruntime as ort
from PIL import Image
import io
 
app = FastAPI()
sess = ort.InferenceSession("classifier.onnx", providers=["CPUExecutionProvider"])
 
def preprocess(img: Image.Image):
    img = img.convert("RGB").resize((224, 224))
    x = np.array(img).astype(np.float32) / 255.0
    mean = np.array([0.485, 0.456, 0.406])
    std = np.array([0.229, 0.224, 0.225])
    x = (x - mean) / std
    return x.transpose(2, 0, 1)[None, ...]
 
@app.post("/predict")
async def predict(file: UploadFile = File(...)):
    img = Image.open(io.BytesIO(await file.read()))
    logits = sess.run(None, {"input": preprocess(img)})[0]
    return {"class_id": int(logits.argmax())}

Batching & warm-up

  • Dynamic batching: queue requests for a few ms, run batch on GPU.
  • Warm-up: run dummy inference on deploy to avoid cold-start p99 spikes.

Health checks

GET /health returns model version and readiness — required for Kubernetes/load balancers.


What's next

Lesson 2 — Edge deployment & optimization