Agentic system design

Before we begin

A Jupyter notebook agent that works on your laptop is not a production agentic system. Real users bring concurrency, abuse, API outages, and budget limits — often all on the same afternoon.

This lesson ties Lessons 1–7 into architecture, deployment, scaling, observability, and guardrails — the Week 8 cohort theme before your quiz and travel project.

Design for failure first. Happy path demos are easy; on-call runbooks separate senior engineers.

Figure

Production agent — loop with guardrails

Same agent loop as Lesson 1 — wrapped in limits, logs, and human gates.

What you will learn

Pick among single-agent, planner–executor, and supervisor architectures.
Deploy stateless APIs with session stores and job queues.
Scale around LLM rate limits and tool bottlenecks.
Instrument traces and dashboards operators actually use.
Layer guardrails without destroying usefulness.
Write a minimal operational runbook.

Before this lesson

Lesson 7 — Evals
Module 10 welcome (full serving, cache, rate limits next module)

Architecture patterns (choose deliberately)

Pattern	When to use	Watch out
Single agent + tools	≤5 tools, simple goals	Tool confusion as catalog grows
Planner + executor	Multi-step trips, coding tasks	More LLM calls — justify with evals
Supervisor + workers	Large tool domains	Supervisor bottleneck
Human-in-the-loop	Refunds, legal, production deploys	Latency — async job pattern
RAG-only	Doc Q&A	Not an agent — do not over-engineer

Default path: single agent → planner/executor when evals fail → supervisor only if domains are truly separate.

Reference architecture (travel product)

text

                    ┌─────────────┐
  User → Next.js UI │  /api/chat  │
                    └──────┬──────┘
                           │
              ┌────────────▼────────────┐
              │   Orchestrator (graph)   │
              │  plan → execute → synth  │
              └─┬──────────┬──────────┬─┘
                │          │          │
           ┌────▼───┐ ┌────▼────┐ ┌───▼────┐
           │ LLM API│ │ Tool svc│ │ Redis  │
           └────────┘ │ weather │ │ session│
                      │ maps    │ │ + cache│
                      └─────────┘ └────────┘
                           │
                      ┌────▼────┐
                      │ MongoDB │
                      │  prefs  │
                      └─────────┘

Deployment patterns

Stateless application servers

Stateless means each API server holds no session in RAM — if the server restarts, nothing is lost because session data lives elsewhere.

No session state in Node/Python process memory.
Store session_id → messages, scratchpad in Redis or DB.
Horizontally scale API replicas behind load balancer.

Long-running agent jobs

Research tasks may take 2–10 minutes:

POST /api/trips/plan → returns { job_id } immediately.
Worker runs agent graph asynchronously (async — in the background, not blocking the HTTP response).
Client polls GET /api/jobs/{id} or uses SSE (server-sent events: server pushes progress) or WebSocket (two-way live connection) for trace updates.

Do not hold HTTP connections open 10 minutes on serverless without streaming design.

Idempotent tools

Idempotent = safe to retry: calling the same action twice does not create duplicates.

create_support_ticket(idempotency_key=...) — retries after a timeout must not open two tickets.

Timeouts (layered)

Layer	Example limit
Single HTTP tool call	10s
Whole agent loop	120s
User-facing request (sync mode)	30s with partial progress UI

Scaling bottlenecks

Bottleneck	Symptom	Mitigation
LLM rate limits	429 errors	Queue, exponential backoff, multi-key pool
LLM latency	Slow UX	Stream tokens; show trace while working
Tool APIs	Map quota exhausted	Cache geocode results 24h
Context size	Truncated instructions	Summarize tool history (Lesson 5)
Cost	Invoice shock	`max_steps`, cheaper model for planner
DB	Memory fetch slow	Index `user_id`; cache profile in Redis

Model routing: planner on mini, final synthesis on full only when needed — measure quality impact on eval set.

Observability — what to log

Structured trace per user request (trace_id = unique ID to find this run in logs):

json

{
  "trace_id": "tr_8f3a2c",
  "user_id": "u_123",
  "session_id": "s_456",
  "model": "gpt-4.1-mini",
  "outcome": "success",
  "total_tokens_in": 4200,
  "total_tokens_out": 890,
  "total_cost_usd": 0.012,
  "duration_ms": 18400,
  "steps": [
    {"type": "llm", "node": "plan", "tokens_in": 800, "tokens_out": 200},
    {"type": "tool", "name": "get_weather", "ok": true, "latency_ms": 340},
    {"type": "llm", "node": "execute", "tokens_in": 1200, "tokens_out": 80}
  ]
}

Dashboards operators need

Chart	Action when bad
Success rate	Roll back prompt version
p95 latency	95% of requests finish faster than this — check tool or model slowdown
Cost per successful task	Tighten max steps or cache
Tool error rate by name	Fix API or mock fallback
Loop count histogram	Detect infinite replan bugs

Module 10 adds Redis caching, rate limits, and /admin/metrics patterns.

Guardrails (defense in depth)

Layer	Example	Measure
Input	Block known injection patterns; max message length	False positive rate on eval
Tool allowlist	Executor may only call listed tools (e.g. no `send_email` in staging)	Unauthorized tool attempts logged
Argument validation	City name regex; max `limit` on search	Rejected args count
Output schema	Itinerary must be valid JSON schema	Parse failure rate
Grounding	“Only use tool observations for weather”	Faithfulness eval
Human gate	Book button requires user click	N/A
Budget	Stop after $0.50 LLM spend per session	Hard cap errors

Balance: over-blocking frustrates users — track false refusals on eval set alongside safety passes.

Graceful degradation

Failure	User-visible behavior
Weather API down	“Live weather unavailable — here's a plan with indoor options and a note to check forecast”
LLM timeout	Retry once; then cached generic tips + apology
Redis down	Skip cache; continue without session restore banner
Rate limited	429 with `Retry-After` header

Never return empty 500 with no message.

Security checklist

API keys server-side only
user_id authorization on every memory fetch
Tool path allowlists (filesystem, URLs)
PII (personally identifiable information) redaction in logs
Prompt injection awareness in system prompts
Audit log for destructive tools

Operational runbook (template)

Keep in docs/agent-oncall.md:

Disable tool globally — env flag DISABLE_WEATHER_TOOL=true, redeploy.
Roll back prompts — planner_prompt_version pin in config service.
Find user trace — search trace_id or user_id in log aggregator.
Cost circuit breaker — auto-pause agent if hourly spend > $X (like an electrical fuse — stops runaway cost)
Model outage — failover model string in config.
Eval regression — link to last green CI (automated test) run.

On-call should not grep raw production DB without a playbook.

Lifecycle: design → eval → ship → monitor

text

Design architecture (this lesson)
    → Build planner/executor (Lessons 3, 6)
    → Write eval set (Lesson 7)
    → Pass CI gate (≥80% success on automated eval suite)
    → Ship with traces + guardrails
    → Monitor dashboards (Module 10)
    → Iterate prompts with eval diff

Travel project — production checklist

Before calling Module 8 complete:

max_iterations and per-tool retries
Trace returned to UI
User prefs loaded from DB
Tool errors surfaced to model and user
10+ eval cases in repo
README architecture diagram

Module 10 capstone adds Redis cache, rate limits, and metrics dashboard.

Check yourself

Why stateless servers + Redis sessions?
Name three guardrail layers.
What goes in a structured trace?
When do you choose async background jobs vs a single synchronous /api/chat response?

What's next

Module 8 quiz — then the travel planner project.