Agentic system design
Before we begin
A Jupyter notebook agent that works on your laptop is not a production agentic system. Real users bring concurrency, abuse, API outages, and budget limits — often all on the same afternoon.
This lesson ties Lessons 1–7 into architecture, deployment, scaling, observability, and guardrails — the Week 8 cohort theme before your quiz and travel project.
Design for failure first. Happy path demos are easy; on-call runbooks separate senior engineers.
Figure
Production agent — loop with guardrails
What you will learn
- Pick among single-agent, planner–executor, and supervisor architectures.
- Deploy stateless APIs with session stores and job queues.
- Scale around LLM rate limits and tool bottlenecks.
- Instrument traces and dashboards operators actually use.
- Layer guardrails without destroying usefulness.
- Write a minimal operational runbook.
Before this lesson
- Lesson 7 — Evals
- Module 10 welcome (full serving, cache, rate limits next module)
Architecture patterns (choose deliberately)
| Pattern | When to use | Watch out |
|---|---|---|
| Single agent + tools | ≤5 tools, simple goals | Tool confusion as catalog grows |
| Planner + executor | Multi-step trips, coding tasks | More LLM calls — justify with evals |
| Supervisor + workers | Large tool domains | Supervisor bottleneck |
| Human-in-the-loop | Refunds, legal, production deploys | Latency — async job pattern |
| RAG-only | Doc Q&A | Not an agent — do not over-engineer |
Default path: single agent → planner/executor when evals fail → supervisor only if domains are truly separate.
Reference architecture (travel product)
┌─────────────┐
User → Next.js UI │ /api/chat │
└──────┬──────┘
│
┌────────────▼────────────┐
│ Orchestrator (graph) │
│ plan → execute → synth │
└─┬──────────┬──────────┬─┘
│ │ │
┌────▼───┐ ┌────▼────┐ ┌───▼────┐
│ LLM API│ │ Tool svc│ │ Redis │
└────────┘ │ weather │ │ session│
│ maps │ │ + cache│
└─────────┘ └────────┘
│
┌────▼────┐
│ MongoDB │
│ prefs │
└─────────┘Deployment patterns
Stateless application servers
Stateless means each API server holds no session in RAM — if the server restarts, nothing is lost because session data lives elsewhere.
- No session state in Node/Python process memory.
- Store
session_id → messages, scratchpadin Redis or DB. - Horizontally scale API replicas behind load balancer.
Long-running agent jobs
Research tasks may take 2–10 minutes:
POST /api/trips/plan→ returns{ job_id }immediately.- Worker runs agent graph asynchronously (async — in the background, not blocking the HTTP response).
- Client polls
GET /api/jobs/{id}or uses SSE (server-sent events: server pushes progress) or WebSocket (two-way live connection) for trace updates.
Do not hold HTTP connections open 10 minutes on serverless without streaming design.
Idempotent tools
Idempotent = safe to retry: calling the same action twice does not create duplicates.
create_support_ticket(idempotency_key=...) — retries after a timeout must not open two tickets.
Timeouts (layered)
| Layer | Example limit |
|---|---|
| Single HTTP tool call | 10s |
| Whole agent loop | 120s |
| User-facing request (sync mode) | 30s with partial progress UI |
Scaling bottlenecks
| Bottleneck | Symptom | Mitigation |
|---|---|---|
| LLM rate limits | 429 errors | Queue, exponential backoff, multi-key pool |
| LLM latency | Slow UX | Stream tokens; show trace while working |
| Tool APIs | Map quota exhausted | Cache geocode results 24h |
| Context size | Truncated instructions | Summarize tool history (Lesson 5) |
| Cost | Invoice shock | max_steps, cheaper model for planner |
| DB | Memory fetch slow | Index user_id; cache profile in Redis |
Model routing: planner on mini, final synthesis on full only when needed — measure quality impact on eval set.
Observability — what to log
Structured trace per user request (trace_id = unique ID to find this run in logs):
{
"trace_id": "tr_8f3a2c",
"user_id": "u_123",
"session_id": "s_456",
"model": "gpt-4.1-mini",
"outcome": "success",
"total_tokens_in": 4200,
"total_tokens_out": 890,
"total_cost_usd": 0.012,
"duration_ms": 18400,
"steps": [
{"type": "llm", "node": "plan", "tokens_in": 800, "tokens_out": 200},
{"type": "tool", "name": "get_weather", "ok": true, "latency_ms": 340},
{"type": "llm", "node": "execute", "tokens_in": 1200, "tokens_out": 80}
]
}Dashboards operators need
| Chart | Action when bad |
|---|---|
| Success rate | Roll back prompt version |
| p95 latency | 95% of requests finish faster than this — check tool or model slowdown |
| Cost per successful task | Tighten max steps or cache |
| Tool error rate by name | Fix API or mock fallback |
| Loop count histogram | Detect infinite replan bugs |
Module 10 adds Redis caching, rate limits, and /admin/metrics patterns.
Guardrails (defense in depth)
| Layer | Example | Measure |
|---|---|---|
| Input | Block known injection patterns; max message length | False positive rate on eval |
| Tool allowlist | Executor may only call listed tools (e.g. no send_email in staging) | Unauthorized tool attempts logged |
| Argument validation | City name regex; max limit on search | Rejected args count |
| Output schema | Itinerary must be valid JSON schema | Parse failure rate |
| Grounding | “Only use tool observations for weather” | Faithfulness eval |
| Human gate | Book button requires user click | N/A |
| Budget | Stop after $0.50 LLM spend per session | Hard cap errors |
Balance: over-blocking frustrates users — track false refusals on eval set alongside safety passes.
Graceful degradation
| Failure | User-visible behavior |
|---|---|
| Weather API down | “Live weather unavailable — here's a plan with indoor options and a note to check forecast” |
| LLM timeout | Retry once; then cached generic tips + apology |
| Redis down | Skip cache; continue without session restore banner |
| Rate limited | 429 with Retry-After header |
Never return empty 500 with no message.
Security checklist
- API keys server-side only
-
user_idauthorization on every memory fetch - Tool path allowlists (filesystem, URLs)
- PII (personally identifiable information) redaction in logs
- Prompt injection awareness in system prompts
- Audit log for destructive tools
Operational runbook (template)
Keep in docs/agent-oncall.md:
- Disable tool globally — env flag
DISABLE_WEATHER_TOOL=true, redeploy. - Roll back prompts —
planner_prompt_versionpin in config service. - Find user trace — search
trace_idoruser_idin log aggregator. - Cost circuit breaker — auto-pause agent if hourly spend > $X (like an electrical fuse — stops runaway cost)
- Model outage — failover model string in config.
- Eval regression — link to last green CI (automated test) run.
On-call should not grep raw production DB without a playbook.
Lifecycle: design → eval → ship → monitor
Design architecture (this lesson)
→ Build planner/executor (Lessons 3, 6)
→ Write eval set (Lesson 7)
→ Pass CI gate (≥80% success on automated eval suite)
→ Ship with traces + guardrails
→ Monitor dashboards (Module 10)
→ Iterate prompts with eval diffTravel project — production checklist
Before calling Module 8 complete:
-
max_iterationsand per-tool retries - Trace returned to UI
- User prefs loaded from DB
- Tool errors surfaced to model and user
- 10+ eval cases in repo
- README architecture diagram
Module 10 capstone adds Redis cache, rate limits, and metrics dashboard.
Check yourself
- Why stateless servers + Redis sessions?
- Name three guardrail layers.
- What goes in a structured trace?
- When do you choose async background jobs vs a single synchronous
/api/chatresponse?
What's next
Module 8 quiz — then the travel planner project.