← Back to curriculum

Module 10 — Production & scaling

Model serving & deployment

Hosted APIs vs self-hosting, Next.js API routes, streaming, timeouts, and deployment patterns for GenAI workloads.

~70 min read + exercises

Model serving & deployment

Before we begin

Training happens offline. Serving happens on every user request — your app calls a model and returns an answer under time and cost constraints.

Figure

Typical serving stack

ClientNext.js UIAPI routeauth + limitsOrchestrationRAG / agentModelhosted or GPU
UI → API route → orchestration (RAG/agent) → model provider or GPU server.

What you will learn

  • Define model serving vs training vs fine-tuning.
  • Choose hosted API vs self-hosted inference.
  • Deploy GenAI routes in Next.js with streaming and timeouts.

Before this lesson


Hosted API vs self-hosting

Hosted (OpenAI, Anthropic, etc.)Self-hosted (vLLM, Ollama, TGI)
SetupMinutesDays–weeks (GPU, ops)
Cost modelPer tokenFixed GPU + engineering
Best forMVPs, most productsHigh volume, strict privacy
ScalingProvider handlesYou handle replicas

Most teams start hosted, measure unit economics, then evaluate self-hosting.


Next.js API route pattern

Keep secrets server-side. Never expose API keys in client bundles.

typescript
// app/api/chat/route.ts
export async function POST(req: Request) {
  const { messages } = await req.json();
  const controller = new AbortController();
  const timeout = setTimeout(() => controller.abort(), 30_000);
 
  try {
    const res = await fetch("https://api.openai.com/v1/chat/completions", {
      method: "POST",
      headers: {
        Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({ model: "gpt-4o-mini", messages, stream: true }),
      signal: controller.signal,
    });
    // return streaming Response to client
    return new Response(res.body, { headers: { "Content-Type": "text/event-stream" } });
  } finally {
    clearTimeout(timeout);
  }
}

Streaming

Autoregressive models generate one token at a time. Streaming sends tokens as they arrive — better perceived latency even when total time is similar.

Client: use fetch + ReadableStream or provider SDK stream helpers.


Timeouts and retries

  • Set hard timeouts on every upstream LLM call.
  • Retry idempotent failures (429, 503) with exponential backoff — cap attempts.
  • Do not blindly retry non-idempotent tool side effects (payments, deletes).

Deployment checklist

  • Env vars in host dashboard — not in git
  • Health check route (/api/health)
  • Edge vs Node runtime — LLM routes usually need Node
  • Version pin for model name in config

Checkpoint

You should explain why the LLM call lives in a server route, not the browser, and what streaming improves for users.


What's next

Lesson 2 — Caching for LLM apps