Model serving for vision

Before we begin

Serving turns a trained weights file into a service other apps call. Vision adds heavy preprocessing and large payloads.

Learning objectives

Compare REST vs gRPC for image inference.
Design preprocessing parity between train and serve.
Use batching and warm-up for stable latency.
Outline ONNX Runtime deployment path.

Preprocessing contract

Document and test:

Resize dimensions and crop policy
RGB vs BGR
Normalization mean/std
NCHW vs NHWC tensor layout

A one-pixel resize mismatch can crater detection mAP in production.

REST API sketch (FastAPI)

python

from fastapi import FastAPI, File, UploadFile
import numpy as np
import onnxruntime as ort
from PIL import Image
import io
 
app = FastAPI()
sess = ort.InferenceSession("classifier.onnx", providers=["CPUExecutionProvider"])
 
def preprocess(img: Image.Image):
    img = img.convert("RGB").resize((224, 224))
    x = np.array(img).astype(np.float32) / 255.0
    mean = np.array([0.485, 0.456, 0.406])
    std = np.array([0.229, 0.224, 0.225])
    x = (x - mean) / std
    return x.transpose(2, 0, 1)[None, ...]
 
@app.post("/predict")
async def predict(file: UploadFile = File(...)):
    img = Image.open(io.BytesIO(await file.read()))
    logits = sess.run(None, {"input": preprocess(img)})[0]
    return {"class_id": int(logits.argmax())}

Batching & warm-up

Dynamic batching: queue requests for a few ms, run batch on GPU.
Warm-up: run dummy inference on deploy to avoid cold-start p99 spikes.

Health checks

GET /health returns model version and readiness — required for Kubernetes/load balancers.

What's next

Lesson 2 — Edge deployment & optimization