← Back to curriculum

Module 8 — Agentic AI

Agentic system design

Production architecture, deployment and scaling, structured traces, guardrails, degradation, and on-call runbooks.

~90 min read + exercises

Agentic system design

Before we begin

A Jupyter notebook agent that works on your laptop is not a production agentic system. Real users bring concurrency, abuse, API outages, and budget limits — often all on the same afternoon.

This lesson ties Lessons 1–7 into architecture, deployment, scaling, observability, and guardrails — the Week 8 cohort theme before your quiz and travel project.

Design for failure first. Happy path demos are easy; on-call runbooks separate senior engineers.

Figure

Production agent — loop with guardrails

Agent loop: goal → think → tool → observe → repeatGoalThinkAct (tool)Observe
Same agent loop as Lesson 1 — wrapped in limits, logs, and human gates.

What you will learn

  • Pick among single-agent, planner–executor, and supervisor architectures.
  • Deploy stateless APIs with session stores and job queues.
  • Scale around LLM rate limits and tool bottlenecks.
  • Instrument traces and dashboards operators actually use.
  • Layer guardrails without destroying usefulness.
  • Write a minimal operational runbook.

Before this lesson


Architecture patterns (choose deliberately)

PatternWhen to useWatch out
Single agent + tools≤5 tools, simple goalsTool confusion as catalog grows
Planner + executorMulti-step trips, coding tasksMore LLM calls — justify with evals
Supervisor + workersLarge tool domainsSupervisor bottleneck
Human-in-the-loopRefunds, legal, production deploysLatency — async job pattern
RAG-onlyDoc Q&ANot an agent — do not over-engineer

Default path: single agent → planner/executor when evals fail → supervisor only if domains are truly separate.

Reference architecture (travel product)

text
                    ┌─────────────┐
  User → Next.js UI │  /api/chat  │
                    └──────┬──────┘

              ┌────────────▼────────────┐
              │   Orchestrator (graph)   │
              │  plan → execute → synth  │
              └─┬──────────┬──────────┬─┘
                │          │          │
           ┌────▼───┐ ┌────▼────┐ ┌───▼────┐
           │ LLM API│ │ Tool svc│ │ Redis  │
           └────────┘ │ weather │ │ session│
                      │ maps    │ │ + cache│
                      └─────────┘ └────────┘

                      ┌────▼────┐
                      │ MongoDB │
                      │  prefs  │
                      └─────────┘

Deployment patterns

Stateless application servers

Stateless means each API server holds no session in RAM — if the server restarts, nothing is lost because session data lives elsewhere.

  • No session state in Node/Python process memory.
  • Store session_id → messages, scratchpad in Redis or DB.
  • Horizontally scale API replicas behind load balancer.

Long-running agent jobs

Research tasks may take 2–10 minutes:

  1. POST /api/trips/plan → returns { job_id } immediately.
  2. Worker runs agent graph asynchronously (async — in the background, not blocking the HTTP response).
  3. Client polls GET /api/jobs/{id} or uses SSE (server-sent events: server pushes progress) or WebSocket (two-way live connection) for trace updates.

Do not hold HTTP connections open 10 minutes on serverless without streaming design.

Idempotent tools

Idempotent = safe to retry: calling the same action twice does not create duplicates.

create_support_ticket(idempotency_key=...) — retries after a timeout must not open two tickets.

Timeouts (layered)

LayerExample limit
Single HTTP tool call10s
Whole agent loop120s
User-facing request (sync mode)30s with partial progress UI

Scaling bottlenecks

BottleneckSymptomMitigation
LLM rate limits429 errorsQueue, exponential backoff, multi-key pool
LLM latencySlow UXStream tokens; show trace while working
Tool APIsMap quota exhaustedCache geocode results 24h
Context sizeTruncated instructionsSummarize tool history (Lesson 5)
CostInvoice shockmax_steps, cheaper model for planner
DBMemory fetch slowIndex user_id; cache profile in Redis

Model routing: planner on mini, final synthesis on full only when needed — measure quality impact on eval set.


Observability — what to log

Structured trace per user request (trace_id = unique ID to find this run in logs):

json
{
  "trace_id": "tr_8f3a2c",
  "user_id": "u_123",
  "session_id": "s_456",
  "model": "gpt-4.1-mini",
  "outcome": "success",
  "total_tokens_in": 4200,
  "total_tokens_out": 890,
  "total_cost_usd": 0.012,
  "duration_ms": 18400,
  "steps": [
    {"type": "llm", "node": "plan", "tokens_in": 800, "tokens_out": 200},
    {"type": "tool", "name": "get_weather", "ok": true, "latency_ms": 340},
    {"type": "llm", "node": "execute", "tokens_in": 1200, "tokens_out": 80}
  ]
}

Dashboards operators need

ChartAction when bad
Success rateRoll back prompt version
p95 latency95% of requests finish faster than this — check tool or model slowdown
Cost per successful taskTighten max steps or cache
Tool error rate by nameFix API or mock fallback
Loop count histogramDetect infinite replan bugs

Module 10 adds Redis caching, rate limits, and /admin/metrics patterns.


Guardrails (defense in depth)

LayerExampleMeasure
InputBlock known injection patterns; max message lengthFalse positive rate on eval
Tool allowlistExecutor may only call listed tools (e.g. no send_email in staging)Unauthorized tool attempts logged
Argument validationCity name regex; max limit on searchRejected args count
Output schemaItinerary must be valid JSON schemaParse failure rate
Grounding“Only use tool observations for weather”Faithfulness eval
Human gateBook button requires user clickN/A
BudgetStop after $0.50 LLM spend per sessionHard cap errors

Balance: over-blocking frustrates users — track false refusals on eval set alongside safety passes.


Graceful degradation

FailureUser-visible behavior
Weather API down“Live weather unavailable — here's a plan with indoor options and a note to check forecast”
LLM timeoutRetry once; then cached generic tips + apology
Redis downSkip cache; continue without session restore banner
Rate limited429 with Retry-After header

Never return empty 500 with no message.


Security checklist

  • API keys server-side only
  • user_id authorization on every memory fetch
  • Tool path allowlists (filesystem, URLs)
  • PII (personally identifiable information) redaction in logs
  • Prompt injection awareness in system prompts
  • Audit log for destructive tools

Operational runbook (template)

Keep in docs/agent-oncall.md:

  1. Disable tool globally — env flag DISABLE_WEATHER_TOOL=true, redeploy.
  2. Roll back promptsplanner_prompt_version pin in config service.
  3. Find user trace — search trace_id or user_id in log aggregator.
  4. Cost circuit breaker — auto-pause agent if hourly spend > $X (like an electrical fuse — stops runaway cost)
  5. Model outage — failover model string in config.
  6. Eval regression — link to last green CI (automated test) run.

On-call should not grep raw production DB without a playbook.


Lifecycle: design → eval → ship → monitor

text
Design architecture (this lesson)
    → Build planner/executor (Lessons 3, 6)
    → Write eval set (Lesson 7)
    → Pass CI gate (≥80% success on automated eval suite)
    → Ship with traces + guardrails
    → Monitor dashboards (Module 10)
    → Iterate prompts with eval diff

Travel project — production checklist

Before calling Module 8 complete:

  • max_iterations and per-tool retries
  • Trace returned to UI
  • User prefs loaded from DB
  • Tool errors surfaced to model and user
  • 10+ eval cases in repo
  • README architecture diagram

Module 10 capstone adds Redis cache, rate limits, and metrics dashboard.


Check yourself

  1. Why stateless servers + Redis sessions?
  2. Name three guardrail layers.
  3. What goes in a structured trace?
  4. When do you choose async background jobs vs a single synchronous /api/chat response?

What's next

Module 8 quiz — then the travel planner project.