Evals for AI apps
Before we begin
You would not ship a payment API without tests. Yet many teams ship agent features and only discover regressions when users complain on Twitter.
An eval is a repeatable test case plus a scoring rule — human, automated, or LLM-judged — that you re-run whenever prompts, models, tools, or retrieval change.
If you cannot measure it, you cannot safely iterate.
What you will learn
- Build eval sets from real user questions and SME gold answers.
- Score RAG faithfulness and agent task success with rubrics.
- Use LLM-as-judge without blind trust.
- Wire evals into CI and deploy gates.
- Design trajectory evals for multi-step agents.
Before this lesson
What to evaluate (by app type)
| App type | What “good” means | Example metric |
|---|---|---|
| RAG Q&A | Answer supported by retrieved text | Faithfulness 4/5 |
| RAG + cite | Correct source linked | Citation match rate |
| Single agent | Task done with ≤ N steps | Success @ 8 steps |
| Planner + executor | Plan covers required data sources | Plan completeness |
| Router / classifier | Right specialist chosen | Routing accuracy |
| Refusal | Unsafe query declined | Must-refuse pass rate |
Start with 20–50 cases from support logs, sales calls, or internal dogfood — quality beats quantity.
Anatomy of one eval case
{
"id": "travel-014",
"input": "Plan a rainy-day activity in Rome for a 7-year-old.",
"context_setup": {"user_profile": {"child_age": 7}},
"expected": {
"must_call_tools": ["get_weather", "search_places"],
"must_mention": ["indoor"],
"must_not_claim": ["exact ticket price without tool"]
},
"gold_answer_notes": "Indoor museum acceptable; outdoor pool not ideal if raining."
}Store as JSONL (JSON Lines — one JSON object per line) in git — versioned like code.
Manual SME evals
SME = subject-matter expert — a person who knows the domain (travel advisor, lawyer, nurse) and labels correct answers.
| Label type | Use |
|---|---|
| Gold answer | Reference paragraph or bullet checklist |
| Acceptable variants | Multiple valid itineraries |
| Must-cite | Policy doc required for compliance |
| Must-refuse | Jailbreak or medical diagnosis |
Process:
- SME writes 30 cases.
- Engineer runs system weekly on staging.
- SME scores 1–5 or pass/fail in spreadsheet or Label Studio.
- Track trend over time — not one-off demos.
When mandatory: regulated domains, customer-facing commitments, SLAs with penalties.
LLM-as-judge
A second model scores the production output against a rubric.
Faithfulness judge prompt (RAG):
CONTEXT:
{retrieved_chunks}
QUESTION:
{user_question}
ANSWER:
{model_answer}
Score 1-5 for faithfulness: every factual claim in ANSWER must be
supported by CONTEXT. 5 = fully supported; 1 = mostly invented.
Reply JSON only: {"score": N, "reason": "..."}Pros: fast, scales to hundreds of cases, catches subtle drift.
Cons: judge can be wrong or lenient — calibrate against human scores monthly.
Best practice: judge model ≥ production model capability; spot-check 10% by hand.
Agent trajectory evals
Final answer quality is not enough. Score the path:
| Check | How |
|---|---|
| Tool sequence | get_weather before outdoor suggestions? |
| No skipped tools | Did not invent weather when rain mattered |
| Step budget | Finished in ≤ 10 LLM steps |
| Error recovery | Retried typo city name |
| Cost | Total tokens < threshold |
Implementation: run agent with mocked tools — fixtures (canned fake API responses) — in CI (continuous integration: automated tests on every code push):
def mock_get_weather(city, **kwargs):
return {"condition": "rain", "temp_c": 18}
assert "indoor" in final_answer.lower()
assert trace_contains_tool("get_weather")Mocks make CI stable — real APIs belong in nightly staging evals.
Metrics cheat sheet
| Metric | Question it answers |
|---|---|
| Task success rate | Did user goal get met? |
| Faithfulness | Claims grounded in context/tools? |
| Relevance | On-topic vs rambling? |
| Citation accuracy | Right doc linked? |
| Tool accuracy | Correct tool + args? |
| Latency p95 | 95% of runs finish under this time — “typical slow case” speed |
| Cost per success | Economically viable? |
Report all that matter to product — not only success rate.
Eval loop in practice
1. BASELINE — run eval v1.0 on system @ git SHA abc123
2. CHANGE — new planner prompt / GPT-4.1-mini → 4.1
3. RE-RUN — same frozen eval set
4. COMPARE — faithfulness 4.2 → 4.0 (−0.2) → investigate
5. GATE — block merge if success rate drops >5% (catch **regression** — quality that used to pass)Store results:
{
"run_id": "2025-06-25-staging",
"git_sha": "abc123",
"prompt_version": "planner-v3",
"model": "gpt-4.1-mini",
"aggregate": {"success_rate": 0.86, "faithfulness_mean": 4.1}
}Frameworks and tooling
| Tool | Strength |
|---|---|
| LangSmith | Traces + datasets + evals in LangChain ecosystem |
| Braintrust | Experiment tracking, human review UI |
| Phoenix (Arize) | Observability + eval on traces |
| Custom JSONL + pytest | Full control; pytest is Python’s standard test runner |
For this course, a JSONL file + Python script is enough to learn the discipline.
CI integration (minimal example)
CI = continuous integration — e.g. GitHub Actions running tests automatically when you push code.
# .github/workflows/agent-evals.yml
- name: Run agent evals
run: python scripts/run_evals.py --mock-tools --min-success 0.80Nightly: full eval with live APIs (flaky — retry, alert on trend not single flake).
Connecting Module 7 and Module 8 evals
| Layer | Eval focus |
|---|---|
| Retrieval only | Did correct chunk appear in top-5? (Module 7) |
| RAG answer | Faithfulness to chunks |
| Agent plan | Steps cover required tools |
| Full agent | End-to-end success + cost |
Improve retrieval before blaming the planner prompt.
Common mistakes
| Mistake | Fix |
|---|---|
| Eval set = 3 demo questions | Add real failure cases from logs |
| Only LLM-judge, no humans | Calibrate monthly |
| Eval once before launch | Re-run on every prompt change |
| Live APIs in unit CI | Mock tools; staging for integration |
| Optimize success rate only | Track false refusals and cost |
Before the quiz and project
Your travel planner should include at least 10 eval cases: sunny trip, rainy trip, typo city, missing budget, train preference from memory, etc.