← Back to curriculum

Module 8 — Agentic AI

Evals for AI apps

SME gold sets, LLM-as-judge rubrics, trajectory evals with mocked tools, CI gates, and frameworks.

~90 min read + exercises

Evals for AI apps

Before we begin

You would not ship a payment API without tests. Yet many teams ship agent features and only discover regressions when users complain on Twitter.

An eval is a repeatable test case plus a scoring rule — human, automated, or LLM-judged — that you re-run whenever prompts, models, tools, or retrieval change.

If you cannot measure it, you cannot safely iterate.


What you will learn

  • Build eval sets from real user questions and SME gold answers.
  • Score RAG faithfulness and agent task success with rubrics.
  • Use LLM-as-judge without blind trust.
  • Wire evals into CI and deploy gates.
  • Design trajectory evals for multi-step agents.

Before this lesson


What to evaluate (by app type)

App typeWhat “good” meansExample metric
RAG Q&AAnswer supported by retrieved textFaithfulness 4/5
RAG + citeCorrect source linkedCitation match rate
Single agentTask done with ≤ N stepsSuccess @ 8 steps
Planner + executorPlan covers required data sourcesPlan completeness
Router / classifierRight specialist chosenRouting accuracy
RefusalUnsafe query declinedMust-refuse pass rate

Start with 20–50 cases from support logs, sales calls, or internal dogfood — quality beats quantity.


Anatomy of one eval case

json
{
  "id": "travel-014",
  "input": "Plan a rainy-day activity in Rome for a 7-year-old.",
  "context_setup": {"user_profile": {"child_age": 7}},
  "expected": {
    "must_call_tools": ["get_weather", "search_places"],
    "must_mention": ["indoor"],
    "must_not_claim": ["exact ticket price without tool"]
  },
  "gold_answer_notes": "Indoor museum acceptable; outdoor pool not ideal if raining."
}

Store as JSONL (JSON Lines — one JSON object per line) in git — versioned like code.


Manual SME evals

SME = subject-matter expert — a person who knows the domain (travel advisor, lawyer, nurse) and labels correct answers.

Label typeUse
Gold answerReference paragraph or bullet checklist
Acceptable variantsMultiple valid itineraries
Must-citePolicy doc required for compliance
Must-refuseJailbreak or medical diagnosis

Process:

  1. SME writes 30 cases.
  2. Engineer runs system weekly on staging.
  3. SME scores 1–5 or pass/fail in spreadsheet or Label Studio.
  4. Track trend over time — not one-off demos.

When mandatory: regulated domains, customer-facing commitments, SLAs with penalties.


LLM-as-judge

A second model scores the production output against a rubric.

Faithfulness judge prompt (RAG):

text
CONTEXT:
{retrieved_chunks}
 
QUESTION:
{user_question}
 
ANSWER:
{model_answer}
 
Score 1-5 for faithfulness: every factual claim in ANSWER must be
supported by CONTEXT. 5 = fully supported; 1 = mostly invented.
 
Reply JSON only: {"score": N, "reason": "..."}

Pros: fast, scales to hundreds of cases, catches subtle drift.
Cons: judge can be wrong or lenient — calibrate against human scores monthly.

Best practice: judge model production model capability; spot-check 10% by hand.


Agent trajectory evals

Final answer quality is not enough. Score the path:

CheckHow
Tool sequenceget_weather before outdoor suggestions?
No skipped toolsDid not invent weather when rain mattered
Step budgetFinished in ≤ 10 LLM steps
Error recoveryRetried typo city name
CostTotal tokens < threshold

Implementation: run agent with mocked toolsfixtures (canned fake API responses) — in CI (continuous integration: automated tests on every code push):

python
def mock_get_weather(city, **kwargs):
    return {"condition": "rain", "temp_c": 18}
 
assert "indoor" in final_answer.lower()
assert trace_contains_tool("get_weather")

Mocks make CI stable — real APIs belong in nightly staging evals.


Metrics cheat sheet

MetricQuestion it answers
Task success rateDid user goal get met?
FaithfulnessClaims grounded in context/tools?
RelevanceOn-topic vs rambling?
Citation accuracyRight doc linked?
Tool accuracyCorrect tool + args?
Latency p9595% of runs finish under this time — “typical slow case” speed
Cost per successEconomically viable?

Report all that matter to product — not only success rate.


Eval loop in practice

text
1. BASELINE — run eval v1.0 on system @ git SHA abc123
2. CHANGE    — new planner prompt / GPT-4.1-mini → 4.1
3. RE-RUN    — same frozen eval set
4. COMPARE   — faithfulness 4.2 → 4.0 (−0.2) → investigate
5. GATE      — block merge if success rate drops >5% (catch **regression** — quality that used to pass)

Store results:

json
{
  "run_id": "2025-06-25-staging",
  "git_sha": "abc123",
  "prompt_version": "planner-v3",
  "model": "gpt-4.1-mini",
  "aggregate": {"success_rate": 0.86, "faithfulness_mean": 4.1}
}

Frameworks and tooling

ToolStrength
LangSmithTraces + datasets + evals in LangChain ecosystem
BraintrustExperiment tracking, human review UI
Phoenix (Arize)Observability + eval on traces
Custom JSONL + pytestFull control; pytest is Python’s standard test runner

For this course, a JSONL file + Python script is enough to learn the discipline.


CI integration (minimal example)

CI = continuous integration — e.g. GitHub Actions running tests automatically when you push code.

yaml
# .github/workflows/agent-evals.yml
- name: Run agent evals
  run: python scripts/run_evals.py --mock-tools --min-success 0.80

Nightly: full eval with live APIs (flaky — retry, alert on trend not single flake).


Connecting Module 7 and Module 8 evals

LayerEval focus
Retrieval onlyDid correct chunk appear in top-5? (Module 7)
RAG answerFaithfulness to chunks
Agent planSteps cover required tools
Full agentEnd-to-end success + cost

Improve retrieval before blaming the planner prompt.


Common mistakes

MistakeFix
Eval set = 3 demo questionsAdd real failure cases from logs
Only LLM-judge, no humansCalibrate monthly
Eval once before launchRe-run on every prompt change
Live APIs in unit CIMock tools; staging for integration
Optimize success rate onlyTrack false refusals and cost

Before the quiz and project

Your travel planner should include at least 10 eval cases: sunny trip, rainy trip, typo city, missing budget, train preference from memory, etc.


What's next

Lesson 8 — Agentic system design