What's the best LLM observability tool in 2026?

There is no single best tool — the right pick depends on stack and buy-vs-build appetite. For OSS, self-hosted setups we default to [Langfuse](https://langfuse.com). For SaaS teams optimizing for developer velocity, [Braintrust](https://braintrust.dev) is the fastest path to a productionized eval pipeline. For OpenTelemetry-native shops, [Phoenix by Arize](https://phoenix.arize.com) is the natural fit. If you're RAG-heavy, [Ragas](https://docs.ragas.io) ships purpose-built metrics (faithfulness, answer relevance, context precision). And if you live in the OpenAI ecosystem, [OpenAI Evals](https://github.com/openai/evals) is still the simplest first step.

How much should I budget for eval costs?

Eval cost scales as roughly N × M × LLM_cost, where N is the number of samples in your golden set, M is the number of eval criteria (correctness, relevance, groundedness, etc.) and LLM_cost is the per-call cost of your judge model. For a 200-case golden set scored across 5 criteria with a Claude 4.5 Sonnet or GPT-5-class judge, expect roughly $1-$5 per full eval run. Multiply by every PR that triggers a full regression and add online sampling (1-5% of production traffic) and it adds up. Most teams we work with budget eval costs as a fixed monthly line item rather than letting it grow uncapped.

What's LLM-as-judge and when does it fail?

LLM-as-judge means using a stronger LLM to score the outputs of a cheaper agent — for example, scoring GPT-4o-mini outputs with Claude 4.5 Sonnet on a 1-5 rubric. It's the only scalable way to get thousands of quality scores per day. It fails when (1) the judge is biased toward its own family's outputs (a known effect — never use the same model family as both agent and judge for high-stakes scoring), (2) the judge model itself is silently updated by the provider and your scores drift, (3) the rubric is vague enough that the judge invents its own scoring axis, or (4) judge scores look stable but don't correlate with real user feedback. Always calibrate against a human-labeled subset of 50-100 cases before trusting any judge in production.

AI Agent Testing, Evaluation & Observability: 2026 Guide

Testing AI agents is fundamentally different from testing traditional software. A function either returns the correct value or it does not. An AI agent returns responses that exist on a spectrum from perfect to acceptable to wrong to harmful — and the same input can produce different outputs on different runs. You cannot test AI agents with simple assertions. You need evaluation frameworks that measure quality at scale.

Most AI projects that fail in production fail because of inadequate evaluation, not inadequate models. The team builds a demo that works on 20 hand-picked examples, ships it, and discovers that real-world inputs are nothing like their test set. This guide covers how to build the evaluation infrastructure that prevents this.

The Evaluation Stack

Production AI testing operates at three levels.

Level 1: Pre-deployment evaluation

Testing before the agent goes live. This catches problems before they reach users.

Unit evaluation — Does each component (retrieval, generation, tool calling) work correctly in isolation?
Integration evaluation — Do the components work together correctly?
End-to-end evaluation — Given real-world inputs, does the agent produce acceptable outputs?
Adversarial evaluation — Does the agent handle edge cases, malicious inputs, and unexpected scenarios safely?

Level 2: Pre-release regression

Testing before each update (prompt change, model upgrade, knowledge base update).

Regression suite — Run the full evaluation dataset to ensure the update does not degrade existing quality
A/B comparison — Compare new version against current version on the same inputs
Canary deployment — Roll out to a small percentage of traffic and monitor before full deployment

Level 3: Production monitoring

Continuous evaluation of the live system.

Automated scoring — Score a sample of live interactions on quality dimensions
Human review — Human evaluators review a random sample of agent interactions
User feedback — Thumbs up/down, ratings, and explicit feedback from users
Drift detection — Alert when agent quality degrades over time

Building an Evaluation Dataset

The evaluation dataset is the foundation of your testing strategy. Get this right and everything else follows.

What a good evaluation dataset contains

Component	Description	Example
Input	The user message or query	"How do I reset my password?"
Expected output	The ideal response (or acceptable response criteria)	"Navigate to Settings > Security > Reset Password..."
Context (if RAG)	The documents the agent should retrieve	Password reset documentation
Expected tools (if agent)	Which tools should be called	`search_knowledge_base("password reset")`
Metadata	Category, difficulty, source	category: "account", difficulty: "easy"

How many examples do you need?

Purpose	Minimum	Recommended
Initial development	50	100
Pre-release regression	100	200–500
Comprehensive evaluation	200	500–1,000

How to collect evaluation examples

From real user data — The best source. Sample from actual user interactions. Annotate with correct answers.
From domain experts — Have subject matter experts write realistic queries and expected answers.
From error analysis — When the agent fails in production, add the failure case to the evaluation set.
Synthetic generation — Use an LLM to generate variations of existing examples. Useful for expanding coverage, but verify quality.

Categories to cover

Common queries (60%) — The bread-and-butter questions your agent handles daily
Edge cases (20%) — Ambiguous inputs, unusual phrasing, multi-part questions
Adversarial inputs (10%) — Prompt injection attempts, off-topic queries, harmful requests
Boundary cases (10%) — Questions at the edge of the agent's scope (should escalate vs answer)

Evaluation Metrics

For response quality

Metric	What It Measures	How to Calculate
Correctness	Is the answer factually accurate?	LLM-as-judge or human review
Relevance	Does the answer address the actual question?	LLM-as-judge scoring 1–5
Completeness	Does the answer cover all aspects of the question?	LLM-as-judge or checklist
Groundedness	Is the answer supported by retrieved context (not hallucinated)?	Compare claims against source documents
Harmlessness	Does the answer avoid harmful, biased, or inappropriate content?	Automated content classifiers + human review

For RAG quality

Metric	What It Measures	How to Calculate
Retrieval precision	What percentage of retrieved chunks are relevant?	Human annotation of retrieved chunks
Retrieval recall	What percentage of relevant chunks were retrieved?	Compare against known-relevant documents
Context utilization	Does the LLM actually use the retrieved context?	Compare response against context content
Citation accuracy	Are citations correct and pointing to actual sources?	Verify each citation against source

For agent behavior

Metric	What It Measures	How to Calculate
Tool selection accuracy	Does the agent call the right tool?	Compare against expected tools in eval set
Tool argument accuracy	Are the arguments passed to tools correct?	Validate against expected arguments
Step efficiency	Does the agent complete the task in a reasonable number of steps?	Count LLM calls per task
Escalation accuracy	Does the agent correctly escalate when it should?	Compare escalation decisions against labels
Boundary adherence	Does the agent stay within its defined scope?	Test with out-of-scope inputs

LLM-as-Judge — How to Do It Without Burning Your Eval Budget

The most scalable evaluation method is using another LLM to judge the quality of agent responses. It is also the line item that bankrupts evaluation pipelines when teams turn it on without a budget. Here is how we run it across 10 AI products without watching eval cost balloon past inference cost.

The basic pattern

judge_prompt = """
You are evaluating a customer support AI agent's response.

Customer question: {question}
Agent response: {response}
Reference answer: {reference}

Rate the response on these dimensions (1-5 scale):

1. Correctness: Is the information factually accurate?
2. Relevance: Does it address the customer's actual question?
3. Completeness: Does it cover all necessary information?
4. Tone: Is the tone appropriate and professional?
5. Actionability: Can the customer act on this response?

For each dimension, provide the score and a brief justification.
Return as JSON.
"""

Best practices for LLM-as-judge

Use a stronger model as judge — If your agent uses GPT-4o-mini, use GPT-5 or Claude 4.5 Sonnet as the judge. We have a separate breakdown on which model to use as judge in Claude vs GPT vs Gemini 2026.
Provide reference answers — Judges are more accurate when they have a gold standard to compare against.
Use structured rubrics — Specific scoring criteria produce more consistent results than open-ended evaluation.
Validate with human agreement — Check that your LLM judge agrees with human evaluators on a sample (aim for Cohen's kappa of 0.7+, which roughly corresponds to 80%+ agreement on a 5-point rubric).
Use multiple judge prompts — Average scores across different prompt framings to reduce bias.

Three failure modes that show up in production

1. Judge cost can quietly exceed agent cost. A 5-criterion rubric over a 500-case golden set is 2,500 judge calls per run. At ~$0.01/call that is ~$25 per CI run. Twenty PRs a day adds up fast. Treat judge cost as a real line item — see our AI agent development cost model.

2. Judge bias is real and undertested. LLM judges over-score their own model family's output style. Rule we follow: judge family and agent family must differ. If the agent is OpenAI, the judge is Anthropic, or vice versa.

3. The meta-eval problem — who evals the judge? Providers silently update model versions, and the same prompt that scored a 4 last month scores a 3 today. The fix: a frozen 50-100 hand-labeled calibration set re-scored weekly. If judge agreement with human labels drops below baseline kappa, pin the judge to a dated checkpoint and investigate.

Bottom line: LLM-as-judge is the right answer at scale, but it is not free, not unbiased, and not stable. Treat the judge as production infrastructure with its own versioning, cost line, and calibration suite.

Regression Testing Workflow

Every time you change prompts, models, or knowledge base content:

1. Run full evaluation suite against current version → baseline scores
2. Make the change
3. Run full evaluation suite against new version → new scores
4. Compare: overall accuracy, per-category scores, worst-case examples
5. If new version is better overall AND no category regresses more than 5%:
 → Approve for deployment
6. If any category regresses significantly:
 → Investigate, fix, re-evaluate

Automate this in your CI/CD pipeline. Never ship a prompt change without running the evaluation suite.

Production Monitoring

Real-time metrics

Metric	Collection Method	Alert Threshold
Response latency	Application logging	> 5 seconds (P95)
Error rate	Application logging	> 2%
Tool call failure rate	Tool execution logging	> 5%
Escalation rate	Agent decision logging	> 30% (or sudden change)
User feedback score	In-app feedback	Under 3.5/5 (7-day rolling average)
Cost per interaction	Token counting + pricing	> 2x baseline

Automated quality sampling

Score a random 5–10% of production interactions using LLM-as-judge daily. Track quality scores over time. Alert when scores drop below threshold or trend downward.

Important gotcha: eval scores do not translate 1:1 to user-perceived quality. A judge can score a response 4.5/5 on correctness while users rate it 2/5 because the tone was robotic or the answer was technically right but unhelpful. Calibrate eval scores against actual user feedback (CSAT, thumbs up/down, ticket reopen rate) at least quarterly. If the correlation drops below ~0.6, the rubric is measuring the wrong thing and needs rewriting.

Human review cadence

Review Type	Cadence	Sample Size
Random sample review	Weekly	50–100 interactions
Escalated interaction review	Daily	All escalated interactions
Low-confidence response review	Daily	All responses below confidence threshold
Negative feedback review	Daily	All negative feedback interactions

Observability Tooling in 2026

The observability tooling category matured fast between late 2024 and early 2026. In 2024 the typical setup was "a trace dashboard plus a Postgres table of scored rows." Today's stack is trace + eval + replay + score + dataset management, all in one platform, often with OpenTelemetry-compatible spans and built-in LLM-as-judge runners. If you are picking tooling for an agent project in May 2026, the choice space has narrowed to a handful of serious options.

What changed since 2024

Traces became first-class — full agent runs (every LLM call, tool call, retrieval, retry), not flat lists of completions.
Evals run on the trace, not just the response — you can score a trajectory ("did the agent take reasonable steps?"), not only the final string.
Replay arrived — re-run a stored production trace against a new prompt or model and diff the result. The closest thing to a unit test LLM agents have.
Dataset management is built in — failures sampled from production one-click into the golden set.
OpenTelemetry-native ingest — most platforms accept OTel spans directly, so you do not commit to a vendor SDK.

Categorized tool roundup

Langfuse — Open-source observability with traces, scores, datasets, and a judge runner. Best for self-hosting and full data ownership. Gotcha: scaling self-hosted past ~10M spans/month needs real ops work.

Braintrust — Eval-first SaaS, strong on dataset diffing, replay, and human-in-the-loop scoring. Best for shipping a productionized eval loop in days. Gotcha: SaaS-only; per-seat plus per-event pricing scales with traffic.

Phoenix by Arize — Open-source tracing + eval, OpenTelemetry-native. Best for OTel-first shops. Gotcha: the OSS-vs-commercial split is confusing on first read; pick which side you are on early.

Ragas — Python framework for RAG metrics (faithfulness, answer relevance, context precision, context recall). Best when retrieval quality is the main question. Gotcha: it is a library, not a platform — you bring the dashboard.

TruLens — Code-first eval with composable feedback functions. Gotcha: project velocity has been uneven; verify activity before committing.

LangSmith — LangChain's own observability and eval platform. Best if you are on LangChain/LangGraph. Gotcha: framework lock-in — moving off LangChain means moving off LangSmith.

Helicone — LLM gateway with eval built in. Best for single-point capture across providers. Gotcha: a gateway is in your critical path — its outages are yours.

OpenAI Evals — OpenAI's OSS eval framework. Best as a Python-native starting point for OpenAI-ecosystem teams. Gotcha: framework, not a dashboard — ship the visualization yourself.

Which to pick — a short decision tree

OSS, self-hosted, full data ownership → Langfuse.
SaaS, prioritizing dev velocity → Braintrust.
OpenTelemetry-native, already on Arize or building OTel-first → Phoenix.
RAG-heavy agent where retrieval quality is the main question → Ragas (often layered on top of one of the above).
Already on LangChain/LangGraph → LangSmith — moving to anything else is unnecessary effort.
Already deep in the OpenAI ecosystem with no platform needs → OpenAI Evals as a starting point, graduate when needed.

We run a mix internally: Langfuse for the products where data residency matters, Braintrust where iteration speed matters, Ragas as a library inside both for RAG scoring. There is no one-tool-fits-all, and switching costs are lower than people think — most teams over-optimize this decision.

Online vs Offline Evals — When to Run Each

Offline evals run a fixed golden dataset against the agent in a controlled environment (CI, a notebook, a scheduled job). Online evals score real production traffic — sampled and labeled either by an LLM judge or by humans.

Offline evals are non-negotiable. They are the regression guard. They catch model swaps and prompt edits before they reach users.

Online evals are where teams cut corners and regret it. The reason is that they cost real money (you are paying for judge calls on a percentage of production traffic, every day) and the wins are slow — you measure a drift in week 8, not in PR 12. Teams skip them and then six months later discover that their offline scores look perfect but their support tickets are up 30%.

The simple rule we follow: run offline evals on every PR; sample 1-5% of production for online evals daily; reconcile the two once a week. If your offline scores diverge from your online scores by more than ~10%, your golden dataset is out of date.

Production Tracing — OpenTelemetry for LLM Apps

By 2026, OpenTelemetry has GenAI-specific semantic conventions for LLM spans — gen_ai.prompt, gen_ai.completion, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.system. Every serious observability platform mentioned above ingests OTel directly, which means you can capture once and route anywhere.

A few rules that have saved us pain:

Tag every span with cost. Compute cost at ingest time, store it on the span. Querying cost-per-trace later without this is impossibly slow.
Sample aggressively at scale. At 1M+ traces/day, full-fidelity capture is not affordable. We sample 100% of errored or escalated runs, 5% of normal runs, and 1% of high-frequency repetitive runs.
Redact at the SDK layer, not the platform. Sensitive prompts and tool arguments must never leave your perimeter unredacted. Do the redaction client-side before the span is exported.
Keep raw responses for 7-30 days; keep scored summaries forever. This is a cost/value trade-off we have re-tuned twice.

For broader agent orchestration context and MCP-based agent infra, see our related guides — the tracing pattern is identical across orchestration frameworks.

Eval as CI Gate — The Engineering Pattern

The most useful operational change we have made in the last 12 months is gating model and prompt swaps on eval-score thresholds in CI. No prompt change, no model upgrade, no retrieval re-indexing ships to production without the gate passing.

The pattern, in pseudocode

# .ci/eval_gate.py — runs on every PR that touches /prompts, /models, or /retrieval

GOLDEN_SET = load_dataset("golden/v2026-05") # versioned, frozen per quarter
JUDGE = "claude-4-5-sonnet-2026-04-15" # pinned, not floating
CRITERIA = ["correctness", "relevance", "groundedness", "tone", "safety"]

baseline = run_eval(branch="main", dataset=GOLDEN_SET, judge=JUDGE, criteria=CRITERIA)
candidate = run_eval(branch="HEAD", dataset=GOLDEN_SET, judge=JUDGE, criteria=CRITERIA)

# Gate 1: aggregate score must not regress by more than 1%
assert candidate.overall >= baseline.overall - 0.01, "overall regression"

# Gate 2: no individual criterion may regress by more than 5%
for c in CRITERIA:
 assert candidate[c] >= baseline[c] - 0.05, f"{c} regression"

# Gate 3: no individual example may regress from PASS to FAIL on safety
for example in GOLDEN_SET.where(criteria="safety"):
 assert not (baseline.passed(example) and not candidate.passed(example))

# Gate 4: eval budget — fail if this PR's eval cost exceeds the per-PR budget
assert candidate.cost_usd < EVAL_BUDGET_PER_PR, "eval budget exceeded"

post_to_pr(candidate.summary) # diff table on the PR

Patterns that matter

Pin the judge model. Floating judge versions are the most common cause of "the test suite passes today, fails tomorrow." We pin to a dated model ID and bump it explicitly on a quarterly cadence.
Version the golden set. A golden set is a product artifact. We tag it (golden/v2026-05) and ship new versions every 60-90 days — partly to add real-world failures, partly because stale golden sets quietly stop catching the failures that matter.
Budget per PR. The eval budget is a hard cap. If a PR's eval run would exceed it (because someone added 5x more criteria), the CI fails before the eval runs. This single check has saved us more than $4k/month of eval drift over the last year.
Surface the diff, not just pass/fail. Post a per-criterion delta table to the PR. Engineers fix things faster when they can see "correctness +2%, tone -1%, groundedness flat" than when they see a binary "FAIL."

The whole point: model and prompt changes become as boring as schema migrations. They either pass the gate or they don't.

Evaluation Tools — Quick Reference

Tool	Type	Best For
Langfuse	OSS LLM observability	Self-hosted tracing, scoring, eval pipelines
Braintrust	SaaS eval platform	Fast eval iteration, replay, dataset diffing
Phoenix by Arize	OSS LLM tracing + eval	OpenTelemetry-native shops
Ragas	RAG eval library	Faithfulness, answer relevance, context precision
TruLens	Eval framework	Composable feedback functions
LangSmith	LangChain's platform	Teams on LangChain/LangGraph
Helicone	LLM gateway + eval	Single-point capture across providers
OpenAI Evals	OSS eval framework	OpenAI-ecosystem teams starting from scratch
DeepEval	OSS eval framework	Unit testing for LLM applications
Custom scripts	Python + your data	Full control, no vendor lock-in

Getting Started

Build your evaluation dataset — Start with 50 examples from real use cases. Grow to 200+ over the first month. Version it (golden/v2026-05) and plan a refresh every 60-90 days.
Pick observability tooling early — Langfuse, Braintrust, Phoenix, LangSmith — pick one in week 1 rather than waiting until you have 10k traces to wade through. The cost of switching is lower than the cost of flying blind.
Set up LLM-as-judge — Pin the judge model, cross-family with the agent, calibrate against 50-100 human labels.
Integrate into CI/CD as a gate — No prompt or model change ships without passing the eval-score thresholds. Budget eval cost per PR.
Deploy production monitoring + tracing — OpenTelemetry-compatible spans, cost on every span, sample 1-5% for online evals.
Establish human review cadence — Weekly reviews of sampled interactions, with one-click promotion of failures into the golden set.

Frequently Asked Questions

How do you measure AI agent quality?

AI agent quality is measured across multiple dimensions rather than a single pass/fail metric. The most important dimensions are correctness (factual accuracy), relevance (does the response address the user's actual question), completeness (does it cover all aspects), and groundedness (is the response supported by retrieved context rather than hallucinated). Production teams typically combine automated LLM-as-judge scoring on these dimensions with periodic human review of sampled interactions to get a comprehensive picture. Tracking these metrics over time with tools like Langfuse or LangSmith reveals quality trends that a one-time test cannot capture.

What metrics matter most for agent evaluation?

The metrics that matter most depend on your use case, but three are nearly universal. Task completion rate measures how often the agent successfully resolves the user's request without human intervention. Correctness measures factual accuracy against a reference answer set. Escalation accuracy measures whether the agent correctly identifies cases it cannot handle and routes them to a human. For RAG-based systems, retrieval precision and groundedness are also critical — an agent that retrieves the wrong documents will generate plausible-sounding but incorrect answers.

How often should you test AI agents?

You should run your full evaluation suite before every deployment — any prompt change, model upgrade, or knowledge base update can cause unexpected regressions. Beyond pre-deployment testing, production agents need continuous monitoring: automated quality scoring on 5–10% of live interactions daily, weekly human review of 50–100 sampled conversations, and daily review of all escalated or negatively-rated interactions. The cadence matters because LLM behavior can drift as providers update models, and your users' questions evolve over time in ways your original test set may not cover.

What tools are best for testing AI agents?

The best tooling depends on your stack and scale — see the Observability Tooling in 2026 section above for the full breakdown. The short version: Langfuse for OSS self-hosted, Braintrust for SaaS dev velocity, Phoenix by Arize for OpenTelemetry-native shops, Ragas if you are RAG-heavy, LangSmith if you are already on LangChain/LangGraph, and OpenAI Evals if you are deep in the OpenAI ecosystem and want a Python-native starting point. Many production teams combine two or more — for example Langfuse for tracing plus Ragas as a library for RAG-specific metrics inside it.

For help building evaluation infrastructure for your AI agents, explore our AI agent development services or contact us. We build evaluation suites, observability pipelines, and CI gates as part of every agent deployment — and we can help you pick between Langfuse, Braintrust, Phoenix, and the rest based on your stack rather than the loudest blog post. For related operational guides see AI agent orchestration, MCP protocol, AI agent cost modeling, and Claude vs GPT vs Gemini 2026 for picking the judge model.

The Evaluation Stack

Production AI testing operates at three levels.

Level 1: Pre-deployment evaluation

Testing before the agent goes live. This catches problems before they reach users.

Unit evaluation — Does each component (retrieval, generation, tool calling) work correctly in isolation?
Integration evaluation — Do the components work together correctly?
End-to-end evaluation — Given real-world inputs, does the agent produce acceptable outputs?
Adversarial evaluation — Does the agent handle edge cases, malicious inputs, and unexpected scenarios safely?

Level 2: Pre-release regression

Testing before each update (prompt change, model upgrade, knowledge base update).

Regression suite — Run the full evaluation dataset to ensure the update does not degrade existing quality
A/B comparison — Compare new version against current version on the same inputs
Canary deployment — Roll out to a small percentage of traffic and monitor before full deployment

Level 3: Production monitoring

Continuous evaluation of the live system.

Automated scoring — Score a sample of live interactions on quality dimensions
Human review — Human evaluators review a random sample of agent interactions
User feedback — Thumbs up/down, ratings, and explicit feedback from users
Drift detection — Alert when agent quality degrades over time

Building an Evaluation Dataset

The evaluation dataset is the foundation of your testing strategy. Get this right and everything else follows.

What a good evaluation dataset contains

Component	Description	Example
Input	The user message or query	"How do I reset my password?"
Expected output	The ideal response (or acceptable response criteria)	"Navigate to Settings > Security > Reset Password..."
Context (if RAG)	The documents the agent should retrieve	Password reset documentation
Expected tools (if agent)	Which tools should be called	`search_knowledge_base("password reset")`
Metadata	Category, difficulty, source	category: "account", difficulty: "easy"

How many examples do you need?

Purpose	Minimum	Recommended
Initial development	50	100
Pre-release regression	100	200–500
Comprehensive evaluation	200	500–1,000

How to collect evaluation examples

From real user data — The best source. Sample from actual user interactions. Annotate with correct answers.
From domain experts — Have subject matter experts write realistic queries and expected answers.
From error analysis — When the agent fails in production, add the failure case to the evaluation set.
Synthetic generation — Use an LLM to generate variations of existing examples. Useful for expanding coverage, but verify quality.

Categories to cover

Common queries (60%) — The bread-and-butter questions your agent handles daily
Edge cases (20%) — Ambiguous inputs, unusual phrasing, multi-part questions
Adversarial inputs (10%) — Prompt injection attempts, off-topic queries, harmful requests
Boundary cases (10%) — Questions at the edge of the agent's scope (should escalate vs answer)

Evaluation Metrics

For response quality

Metric	What It Measures	How to Calculate
Correctness	Is the answer factually accurate?	LLM-as-judge or human review
Relevance	Does the answer address the actual question?	LLM-as-judge scoring 1–5
Completeness	Does the answer cover all aspects of the question?	LLM-as-judge or checklist
Groundedness	Is the answer supported by retrieved context (not hallucinated)?	Compare claims against source documents
Harmlessness	Does the answer avoid harmful, biased, or inappropriate content?	Automated content classifiers + human review

For RAG quality

Metric	What It Measures	How to Calculate
Retrieval precision	What percentage of retrieved chunks are relevant?	Human annotation of retrieved chunks
Retrieval recall	What percentage of relevant chunks were retrieved?	Compare against known-relevant documents
Context utilization	Does the LLM actually use the retrieved context?	Compare response against context content
Citation accuracy	Are citations correct and pointing to actual sources?	Verify each citation against source

For agent behavior

Metric	What It Measures	How to Calculate
Tool selection accuracy	Does the agent call the right tool?	Compare against expected tools in eval set
Tool argument accuracy	Are the arguments passed to tools correct?	Validate against expected arguments
Step efficiency	Does the agent complete the task in a reasonable number of steps?	Count LLM calls per task
Escalation accuracy	Does the agent correctly escalate when it should?	Compare escalation decisions against labels
Boundary adherence	Does the agent stay within its defined scope?	Test with out-of-scope inputs

LLM-as-Judge — How to Do It Without Burning Your Eval Budget

The basic pattern

judge_prompt = """
You are evaluating a customer support AI agent's response.

Customer question: {question}
Agent response: {response}
Reference answer: {reference}

Rate the response on these dimensions (1-5 scale):

1. Correctness: Is the information factually accurate?
2. Relevance: Does it address the customer's actual question?
3. Completeness: Does it cover all necessary information?
4. Tone: Is the tone appropriate and professional?
5. Actionability: Can the customer act on this response?

For each dimension, provide the score and a brief justification.
Return as JSON.
"""

Best practices for LLM-as-judge

Use a stronger model as judge — If your agent uses GPT-4o-mini, use GPT-5 or Claude 4.5 Sonnet as the judge. We have a separate breakdown on which model to use as judge in Claude vs GPT vs Gemini 2026.
Provide reference answers — Judges are more accurate when they have a gold standard to compare against.
Use structured rubrics — Specific scoring criteria produce more consistent results than open-ended evaluation.
Validate with human agreement — Check that your LLM judge agrees with human evaluators on a sample (aim for Cohen's kappa of 0.7+, which roughly corresponds to 80%+ agreement on a 5-point rubric).
Use multiple judge prompts — Average scores across different prompt framings to reduce bias.

Three failure modes that show up in production

Regression Testing Workflow

Every time you change prompts, models, or knowledge base content:

1. Run full evaluation suite against current version → baseline scores
2. Make the change
3. Run full evaluation suite against new version → new scores
4. Compare: overall accuracy, per-category scores, worst-case examples
5. If new version is better overall AND no category regresses more than 5%:
 → Approve for deployment
6. If any category regresses significantly:
 → Investigate, fix, re-evaluate

Automate this in your CI/CD pipeline. Never ship a prompt change without running the evaluation suite.

Production Monitoring

Real-time metrics

Metric	Collection Method	Alert Threshold
Response latency	Application logging	> 5 seconds (P95)
Error rate	Application logging	> 2%
Tool call failure rate	Tool execution logging	> 5%
Escalation rate	Agent decision logging	> 30% (or sudden change)
User feedback score	In-app feedback	Under 3.5/5 (7-day rolling average)
Cost per interaction	Token counting + pricing	> 2x baseline

Automated quality sampling

Score a random 5–10% of production interactions using LLM-as-judge daily. Track quality scores over time. Alert when scores drop below threshold or trend downward.

Human review cadence

Review Type	Cadence	Sample Size
Random sample review	Weekly	50–100 interactions
Escalated interaction review	Daily	All escalated interactions
Low-confidence response review	Daily	All responses below confidence threshold
Negative feedback review	Daily	All negative feedback interactions

Observability Tooling in 2026

What changed since 2024

Traces became first-class — full agent runs (every LLM call, tool call, retrieval, retry), not flat lists of completions.
Evals run on the trace, not just the response — you can score a trajectory ("did the agent take reasonable steps?"), not only the final string.
Replay arrived — re-run a stored production trace against a new prompt or model and diff the result. The closest thing to a unit test LLM agents have.
Dataset management is built in — failures sampled from production one-click into the golden set.
OpenTelemetry-native ingest — most platforms accept OTel spans directly, so you do not commit to a vendor SDK.

Categorized tool roundup

Phoenix by Arize — Open-source tracing + eval, OpenTelemetry-native. Best for OTel-first shops. Gotcha: the OSS-vs-commercial split is confusing on first read; pick which side you are on early.

TruLens — Code-first eval with composable feedback functions. Gotcha: project velocity has been uneven; verify activity before committing.

LangSmith — LangChain's own observability and eval platform. Best if you are on LangChain/LangGraph. Gotcha: framework lock-in — moving off LangChain means moving off LangSmith.

Helicone — LLM gateway with eval built in. Best for single-point capture across providers. Gotcha: a gateway is in your critical path — its outages are yours.

OpenAI Evals — OpenAI's OSS eval framework. Best as a Python-native starting point for OpenAI-ecosystem teams. Gotcha: framework, not a dashboard — ship the visualization yourself.

Which to pick — a short decision tree

OSS, self-hosted, full data ownership → Langfuse.
SaaS, prioritizing dev velocity → Braintrust.
OpenTelemetry-native, already on Arize or building OTel-first → Phoenix.
RAG-heavy agent where retrieval quality is the main question → Ragas (often layered on top of one of the above).
Already on LangChain/LangGraph → LangSmith — moving to anything else is unnecessary effort.
Already deep in the OpenAI ecosystem with no platform needs → OpenAI Evals as a starting point, graduate when needed.

Online vs Offline Evals — When to Run Each

Offline evals are non-negotiable. They are the regression guard. They catch model swaps and prompt edits before they reach users.

Production Tracing — OpenTelemetry for LLM Apps

A few rules that have saved us pain:

Tag every span with cost. Compute cost at ingest time, store it on the span. Querying cost-per-trace later without this is impossibly slow.
Sample aggressively at scale. At 1M+ traces/day, full-fidelity capture is not affordable. We sample 100% of errored or escalated runs, 5% of normal runs, and 1% of high-frequency repetitive runs.
Redact at the SDK layer, not the platform. Sensitive prompts and tool arguments must never leave your perimeter unredacted. Do the redaction client-side before the span is exported.
Keep raw responses for 7-30 days; keep scored summaries forever. This is a cost/value trade-off we have re-tuned twice.

For broader agent orchestration context and MCP-based agent infra, see our related guides — the tracing pattern is identical across orchestration frameworks.

Eval as CI Gate — The Engineering Pattern

The pattern, in pseudocode

# .ci/eval_gate.py — runs on every PR that touches /prompts, /models, or /retrieval

GOLDEN_SET = load_dataset("golden/v2026-05") # versioned, frozen per quarter
JUDGE = "claude-4-5-sonnet-2026-04-15" # pinned, not floating
CRITERIA = ["correctness", "relevance", "groundedness", "tone", "safety"]

baseline = run_eval(branch="main", dataset=GOLDEN_SET, judge=JUDGE, criteria=CRITERIA)
candidate = run_eval(branch="HEAD", dataset=GOLDEN_SET, judge=JUDGE, criteria=CRITERIA)

# Gate 1: aggregate score must not regress by more than 1%
assert candidate.overall >= baseline.overall - 0.01, "overall regression"

# Gate 2: no individual criterion may regress by more than 5%
for c in CRITERIA:
 assert candidate[c] >= baseline[c] - 0.05, f"{c} regression"

# Gate 3: no individual example may regress from PASS to FAIL on safety
for example in GOLDEN_SET.where(criteria="safety"):
 assert not (baseline.passed(example) and not candidate.passed(example))

# Gate 4: eval budget — fail if this PR's eval cost exceeds the per-PR budget
assert candidate.cost_usd < EVAL_BUDGET_PER_PR, "eval budget exceeded"

post_to_pr(candidate.summary) # diff table on the PR

Patterns that matter

Pin the judge model. Floating judge versions are the most common cause of "the test suite passes today, fails tomorrow." We pin to a dated model ID and bump it explicitly on a quarterly cadence.
Version the golden set. A golden set is a product artifact. We tag it (golden/v2026-05) and ship new versions every 60-90 days — partly to add real-world failures, partly because stale golden sets quietly stop catching the failures that matter.
Budget per PR. The eval budget is a hard cap. If a PR's eval run would exceed it (because someone added 5x more criteria), the CI fails before the eval runs. This single check has saved us more than $4k/month of eval drift over the last year.
Surface the diff, not just pass/fail. Post a per-criterion delta table to the PR. Engineers fix things faster when they can see "correctness +2%, tone -1%, groundedness flat" than when they see a binary "FAIL."

The whole point: model and prompt changes become as boring as schema migrations. They either pass the gate or they don't.

Evaluation Tools — Quick Reference

Tool	Type	Best For
Langfuse	OSS LLM observability	Self-hosted tracing, scoring, eval pipelines
Braintrust	SaaS eval platform	Fast eval iteration, replay, dataset diffing
Phoenix by Arize	OSS LLM tracing + eval	OpenTelemetry-native shops
Ragas	RAG eval library	Faithfulness, answer relevance, context precision
TruLens	Eval framework	Composable feedback functions
LangSmith	LangChain's platform	Teams on LangChain/LangGraph
Helicone	LLM gateway + eval	Single-point capture across providers
OpenAI Evals	OSS eval framework	OpenAI-ecosystem teams starting from scratch
DeepEval	OSS eval framework	Unit testing for LLM applications
Custom scripts	Python + your data	Full control, no vendor lock-in

Getting Started

Build your evaluation dataset — Start with 50 examples from real use cases. Grow to 200+ over the first month. Version it (golden/v2026-05) and plan a refresh every 60-90 days.
Pick observability tooling early — Langfuse, Braintrust, Phoenix, LangSmith — pick one in week 1 rather than waiting until you have 10k traces to wade through. The cost of switching is lower than the cost of flying blind.
Set up LLM-as-judge — Pin the judge model, cross-family with the agent, calibrate against 50-100 human labels.
Integrate into CI/CD as a gate — No prompt or model change ships without passing the eval-score thresholds. Budget eval cost per PR.
Deploy production monitoring + tracing — OpenTelemetry-compatible spans, cost on every span, sample 1-5% for online evals.
Establish human review cadence — Weekly reviews of sampled interactions, with one-click promotion of failures into the golden set.

The Evaluation Stack

Level 1: Pre-deployment evaluation

Level 2: Pre-release regression

Level 3: Production monitoring

Building an Evaluation Dataset

What a good evaluation dataset contains

How many examples do you need?

How to collect evaluation examples

Categories to cover

Evaluation Metrics

For response quality

For RAG quality

For agent behavior

LLM-as-Judge — How to Do It Without Burning Your Eval Budget

The basic pattern

Best practices for LLM-as-judge

Three failure modes that show up in production

Regression Testing Workflow

Production Monitoring

Real-time metrics

Automated quality sampling

Human review cadence

Observability Tooling in 2026

What changed since 2024

Categorized tool roundup

Which to pick — a short decision tree

Online vs Offline Evals — When to Run Each

Production Tracing — OpenTelemetry for LLM Apps

Eval as CI Gate — The Engineering Pattern

The pattern, in pseudocode

Patterns that matter

Evaluation Tools — Quick Reference

Getting Started

Frequently Asked Questions

How do you measure AI agent quality?

What metrics matter most for agent evaluation?

How often should you test AI agents?

What tools are best for testing AI agents?

Frequently Asked Questions

Explore Related Solutions

Need Help Building Your Project?

Related Articles

AI Browser Automation in 2026: ChatGPT Agent, Computer Use, and What Actually Ships

AI Cost Optimization at Scale: How We Cut LLM Bills 60% Without Quality Loss

Blockchain Development in 2026: What's Actually Worth Building

The Evaluation Stack

Level 1: Pre-deployment evaluation

Level 2: Pre-release regression

Level 3: Production monitoring

Building an Evaluation Dataset

What a good evaluation dataset contains

How many examples do you need?

How to collect evaluation examples

Categories to cover

Evaluation Metrics

For response quality

For RAG quality

For agent behavior

LLM-as-Judge — How to Do It Without Burning Your Eval Budget

The basic pattern

Best practices for LLM-as-judge

Three failure modes that show up in production

Regression Testing Workflow

Production Monitoring

Real-time metrics

Automated quality sampling

Human review cadence

Observability Tooling in 2026

What changed since 2024

Categorized tool roundup

Which to pick — a short decision tree

Online vs Offline Evals — When to Run Each

Production Tracing — OpenTelemetry for LLM Apps

Eval as CI Gate — The Engineering Pattern

The pattern, in pseudocode

Patterns that matter

Evaluation Tools — Quick Reference

Getting Started

Frequently Asked Questions

How do you measure AI agent quality?