Evals for AI apps

Before we begin

You would not ship a payment API without tests. Yet many teams ship agent features and only discover regressions when users complain on Twitter.

An eval is a repeatable test case plus a scoring rule — human, automated, or LLM-judged — that you re-run whenever prompts, models, tools, or retrieval change.

If you cannot measure it, you cannot safely iterate.

What you will learn

Build eval sets from real user questions and SME gold answers.
Score RAG faithfulness and agent task success with rubrics.
Use LLM-as-judge without blind trust.
Wire evals into CI and deploy gates.
Design trajectory evals for multi-step agents.

Before this lesson

What to evaluate (by app type)

App type	What “good” means	Example metric
RAG Q&A	Answer supported by retrieved text	Faithfulness 4/5
RAG + cite	Correct source linked	Citation match rate
Single agent	Task done with ≤ N steps	Success @ 8 steps
Planner + executor	Plan covers required data sources	Plan completeness
Router / classifier	Right specialist chosen	Routing accuracy
Refusal	Unsafe query declined	Must-refuse pass rate

Start with 20–50 cases from support logs, sales calls, or internal dogfood — quality beats quantity.

Anatomy of one eval case

json

{
  "id": "travel-014",
  "input": "Plan a rainy-day activity in Rome for a 7-year-old.",
  "context_setup": {"user_profile": {"child_age": 7}},
  "expected": {
    "must_call_tools": ["get_weather", "search_places"],
    "must_mention": ["indoor"],
    "must_not_claim": ["exact ticket price without tool"]
  },
  "gold_answer_notes": "Indoor museum acceptable; outdoor pool not ideal if raining."
}

Store as JSONL (JSON Lines — one JSON object per line) in git — versioned like code.

Manual SME evals

SME = subject-matter expert — a person who knows the domain (travel advisor, lawyer, nurse) and labels correct answers.

Label type	Use
Gold answer	Reference paragraph or bullet checklist
Acceptable variants	Multiple valid itineraries
Must-cite	Policy doc required for compliance
Must-refuse	Jailbreak or medical diagnosis

Process:

SME writes 30 cases.
Engineer runs system weekly on staging.
SME scores 1–5 or pass/fail in spreadsheet or Label Studio.
Track trend over time — not one-off demos.

When mandatory: regulated domains, customer-facing commitments, SLAs with penalties.

LLM-as-judge

A second model scores the production output against a rubric.

Faithfulness judge prompt (RAG):

text

CONTEXT:
{retrieved_chunks}
 
QUESTION:
{user_question}
 
ANSWER:
{model_answer}
 
Score 1-5 for faithfulness: every factual claim in ANSWER must be
supported by CONTEXT. 5 = fully supported; 1 = mostly invented.
 
Reply JSON only: {"score": N, "reason": "..."}

Pros: fast, scales to hundreds of cases, catches subtle drift.
Cons: judge can be wrong or lenient — calibrate against human scores monthly.

Best practice: judge model ≥ production model capability; spot-check 10% by hand.

Agent trajectory evals

Final answer quality is not enough. Score the path:

Check	How
Tool sequence	`get_weather` before outdoor suggestions?
No skipped tools	Did not invent weather when rain mattered
Step budget	Finished in ≤ 10 LLM steps
Error recovery	Retried typo city name
Cost	Total tokens < threshold

Implementation: run agent with mocked tools — fixtures (canned fake API responses) — in CI (continuous integration: automated tests on every code push):

python

def mock_get_weather(city, **kwargs):
    return {"condition": "rain", "temp_c": 18}
 
assert "indoor" in final_answer.lower()
assert trace_contains_tool("get_weather")

Mocks make CI stable — real APIs belong in nightly staging evals.

Metrics cheat sheet

Metric	Question it answers
Task success rate	Did user goal get met?
Faithfulness	Claims grounded in context/tools?
Relevance	On-topic vs rambling?
Citation accuracy	Right doc linked?
Tool accuracy	Correct tool + args?
Latency p95	95% of runs finish under this time — “typical slow case” speed
Cost per success	Economically viable?

Report all that matter to product — not only success rate.

Eval loop in practice

text

1. BASELINE — run eval v1.0 on system @ git SHA abc123
2. CHANGE    — new planner prompt / GPT-4.1-mini → 4.1
3. RE-RUN    — same frozen eval set
4. COMPARE   — faithfulness 4.2 → 4.0 (−0.2) → investigate
5. GATE      — block merge if success rate drops >5% (catch **regression** — quality that used to pass)

Store results:

json

{
  "run_id": "2025-06-25-staging",
  "git_sha": "abc123",
  "prompt_version": "planner-v3",
  "model": "gpt-4.1-mini",
  "aggregate": {"success_rate": 0.86, "faithfulness_mean": 4.1}
}

Frameworks and tooling

Tool	Strength
LangSmith	Traces + datasets + evals in LangChain ecosystem
Braintrust	Experiment tracking, human review UI
Phoenix (Arize)	Observability + eval on traces
Custom JSONL + pytest	Full control; pytest is Python’s standard test runner

For this course, a JSONL file + Python script is enough to learn the discipline.

CI integration (minimal example)

CI = continuous integration — e.g. GitHub Actions running tests automatically when you push code.

yaml

# .github/workflows/agent-evals.yml
- name: Run agent evals
  run: python scripts/run_evals.py --mock-tools --min-success 0.80

Nightly: full eval with live APIs (flaky — retry, alert on trend not single flake).

Connecting Module 7 and Module 8 evals

Layer	Eval focus
Retrieval only	Did correct chunk appear in top-5? (Module 7)
RAG answer	Faithfulness to chunks
Agent plan	Steps cover required tools
Full agent	End-to-end success + cost

Improve retrieval before blaming the planner prompt.

Common mistakes

Mistake	Fix
Eval set = 3 demo questions	Add real failure cases from logs
Only LLM-judge, no humans	Calibrate monthly
Eval once before launch	Re-run on every prompt change
Live APIs in unit CI	Mock tools; staging for integration
Optimize success rate only	Track false refusals and cost

Before the quiz and project

Your travel planner should include at least 10 eval cases: sunny trip, rainy trip, typo city, missing budget, train preference from memory, etc.

What's next

Lesson 8 — Agentic system design