← Back to curriculum

Module 10 — Production & scaling

Project: production-ready GenAI app

Combine Module 7 RAG + Module 8 agents in one Next.js app with Redis cache, rate limits, structured logging, and a simple metrics dashboard.

~360 min read + exercises

Project: production-ready GenAI app

Before we begin

This is the course capstone. Combine your Module 7 RAG chatbot and Module 8 agent patterns into one Next.js app — then add what separates demos from portfolio pieces: caching, rate limits, logging, and error handling.

Figure

What you are building

UIstreamAPIlimitRAGindexAgenttoolsRediscacheLogsobs
One app: RAG + optional agent tools, Redis cache, limits, structured logs.

How this connects to Module 10

LessonWhere you use it
CachingRedis keyed by query + INDEX_VERSION
Rate limitingPer-user token bucket on /api/chat
Cost / tokensLog inputTokens + outputTokens per request
MonitoringJSON logs + /admin/metrics counters
Error handlingTimeouts, graceful 502, cache bypass if Redis down

Folder layout:

text
production-genai/
  app/
    api/chat/route.ts
    api/admin/metrics/route.ts
    chat-lab/page.tsx
  lib/
    redis.ts
    rateLimit.ts
    logger.ts
    rag.ts              # from Module 7
    agent.ts            # from Module 8
  .env.local            # OPENAI_API_KEY, REDIS_URL, INDEX_VERSION

What you will build

  1. Unified chat UI — streaming answers with citations (RAG mode).
  2. Agent mode toggle — e.g. “research this topic” runs tool loop with visible steps.
  3. Redis cache — completion cache keyed by query + index version.
  4. Rate limiting — per-user or per-IP on /api/chat.
  5. Structured logging — JSON per request with tokens and latency.
  6. Metrics page — simple /admin/metrics or dashboard card (request count, cache hit %, avg latency).
  7. Graceful errors — timeouts and upstream failures show clear UI messages.

Estimated time: 6–10 hours.


Before you start

  • Finish Module 10 quiz.
  • Have Module 7 RAG index code and Module 8 agent/tool code available to merge or copy.
  • Redis running locally (docker run -p 6379:6379 redis) or Upstash URL in .env.local.

Suggested folder: extend your existing Next.js app or create production-genai/.


Step 1 — Architecture (document in README)

text
User → POST /api/chat
  → rate limit check (Redis INCR + TTL)
  → cache lookup (hash of last user message + mode + INDEX_VERSION)
  → miss:
      mode=rag  → retrieve chunks → LLM stream
      mode=agent → tool loop (max 8 steps)
  → store cache (TTL 1h)
  → structured log + metrics counters
  → JSON response (or SSE stream)

Env vars:

VariablePurpose
OPENAI_API_KEYChat + embeddings
REDIS_URLredis://127.0.0.1:6379 or Upstash
INDEX_VERSIONBump when you rebuild FAISS index
RATE_LIMIT_PER_MINe.g. 30

Step 2 — Redis + cache helpers

typescript
// lib/redis.ts
import { createClient } from "redis";
 
const globalForRedis = globalThis as unknown as { redis?: ReturnType<typeof createClient> };
 
export async function getRedis() {
  if (!globalForRedis.redis) {
    globalForRedis.redis = createClient({ url: process.env.REDIS_URL });
    globalForRedis.redis.on("error", (e) => console.error("redis", e));
    await globalForRedis.redis.connect();
  }
  return globalForRedis.redis;
}
 
export function buildCacheKey(messages: { role: string; content: string }[], mode: string) {
  const lastUser = [...messages].reverse().find((m) => m.role === "user")?.content ?? "";
  const normalized = lastUser.trim().toLowerCase().replace(/\s+/g, " ");
  const version = process.env.INDEX_VERSION ?? "v1";
  return `chat:${mode}:${version}:${hashString(normalized)}`;
}
 
function hashString(s: string) {
  let h = 0;
  for (let i = 0; i < s.length; i++) h = (Math.imul(31, h) + s.charCodeAt(i)) | 0;
  return h.toString(36);
}

Why include INDEX_VERSION? After re-indexing blog posts, old cached answers would cite stale chunks without it.


Step 3 — Rate limiting

typescript
// lib/rateLimit.ts
import { getRedis } from "./redis";
 
export async function rateLimit(userId: string, limit = 30, windowSec = 60) {
  const redis = await getRedis();
  const key = `rl:${userId}:${Math.floor(Date.now() / 1000 / windowSec)}`;
  const count = await redis.incr(key);
  if (count === 1) await redis.expire(key, windowSec);
  return { ok: count <= limit, remaining: Math.max(0, limit - count) };
}
 
export function getUserId(req: Request) {
  return req.headers.get("x-user-id") ?? "anon";
}

Return 429 with { error: "Rate limit exceeded", retryAfterSec: 60 } — show friendly banner in UI.


Step 4 — Structured logging

typescript
// lib/logger.ts
export function logChatEvent(event: Record<string, unknown>) {
  console.log(JSON.stringify({ ts: new Date().toISOString(), ...event }));
}
 
export async function bumpMetric(name: string) {
  try {
    const redis = await getRedis();
    await redis.incr(`metrics:${name}`);
  } catch {
    // Redis down — don't fail the request
  }
}

Every chat request should log: requestId, mode, cacheHit, latencyMs, inputTokens, outputTokens, userId.


Step 5 — Unified /api/chat route

typescript
// app/api/chat/route.ts
import { getRedis, buildCacheKey } from "@/lib/redis";
import { rateLimit, getUserId } from "@/lib/rateLimit";
import { logChatEvent, bumpMetric } from "@/lib/logger";
import { runRagChat } from "@/lib/rag";
import { runAgentLoop } from "@/lib/agent";
 
export async function POST(req: Request) {
  const requestId = crypto.randomUUID();
  const start = Date.now();
  const userId = getUserId(req);
 
  const limited = await rateLimit(userId, Number(process.env.RATE_LIMIT_PER_MIN ?? 30));
  if (!limited.ok) {
    return Response.json({ error: "Rate limit exceeded" }, { status: 429 });
  }
 
  const { messages, mode = "rag" } = await req.json();
  const cacheKey = buildCacheKey(messages, mode);
 
  try {
    const redis = await getRedis();
    const cached = await redis.get(cacheKey);
    if (cached) {
      await bumpMetric("cache_hits");
      await bumpMetric("requests");
      logChatEvent({ requestId, userId, mode, cacheHit: true, latencyMs: Date.now() - start });
      return Response.json({ ...JSON.parse(cached), cached: true });
    }
 
    const controller = new AbortController();
    const timeout = setTimeout(() => controller.abort(), 30_000);
 
    const result =
      mode === "agent"
        ? await runAgentLoop(messages, { maxSteps: 8, signal: controller.signal })
        : await runRagChat(messages, { signal: controller.signal });
 
    clearTimeout(timeout);
 
    await redis.set(cacheKey, JSON.stringify(result), { EX: 3600 });
    await bumpMetric("requests");
    logChatEvent({
      requestId,
      userId,
      mode,
      cacheHit: false,
      latencyMs: Date.now() - start,
      inputTokens: result.usage?.inputTokens,
      outputTokens: result.usage?.outputTokens,
    });
 
    return Response.json({ ...result, cached: false });
  } catch (err) {
    const isTimeout = err instanceof Error && err.name === "AbortError";
    logChatEvent({ requestId, userId, mode, error: isTimeout ? "timeout" : "upstream", latencyMs: Date.now() - start });
    return Response.json(
      { error: isTimeout ? "Request timed out — try a shorter question." : "Service temporarily unavailable." },
      { status: isTimeout ? 504 : 502 },
    );
  }
}

Copy runRagChat from Module 7 and runAgentLoop from Module 8 into lib/rag.ts and lib/agent.ts.


Step 6 — Metrics endpoint

typescript
// app/api/admin/metrics/route.ts
import { getRedis } from "@/lib/redis";
 
export async function GET() {
  const redis = await getRedis();
  const [requests, cacheHits] = await Promise.all([
    redis.get("metrics:requests"),
    redis.get("metrics:cache_hits"),
  ]);
  const reqN = Number(requests ?? 0);
  const hitN = Number(cacheHits ?? 0);
  return Response.json({
    requests: reqN,
    cacheHits: hitN,
    cacheHitRate: reqN ? hitN / reqN : 0,
  });
}

Protect with auth in real deploy — for the course, localhost-only is fine.


Step 7 — UI polish

app/chat-lab/page.tsx:

  • Mode toggle: RAG vs Agent.
  • Streaming tokens (optional SSE) for RAG answers.
  • Citation chips from Module 7.
  • Agent trace panel from Module 8.
  • Badge “cached” when response.cached === true.
  • Error banners for 429 / 504 — never blank screen.

Prove cache works:

  1. Ask "What is on-device AI?"
  2. Ask the same question again — second response faster + cached badge.
  3. Check server log: cacheHit: true.

Prove rate limit:

Send 31 rapid requests (script or browser devtools) — 31st returns 429.


Step 8 — Error handling checklist

FailureBehavior
LLM timeout (30s)504 + user message
Wrong API key502 + log requestId (no key in log)
Redis downSkip cache; log once; still answer
Agent tool API failPartial answer + warning in UI
RAG empty retrieval"I don't have that in the docs" — no hallucination

Test deliberately: set invalid API key → user sees error, log contains requestId for debugging.


Acceptance criteria

  • RAG answers include at least one citation
  • Agent mode completes a tool call with visible steps
  • Repeat question hits cache (prove via log or UI badge)
  • 31st rapid request returns 429
  • Timeout shows user-friendly message, not blank screen
  • README with architecture diagram + env setup + demo GIF

Deliverables

  • Git repo (or folder) with working /api/chat
  • README explaining cost/limit choices
  • Optional: deploy to Vercel with Redis env vars

What's next

Congratulations — you completed the full AI: From Basics to GenAI track.

You moved from math intuition → ML → neural nets → transformers → RAG → agents → production. That arc is what strong junior AI engineer portfolios demonstrate.

Return to the AI course curriculum anytime, or revisit any module project to extend features.