AI Agent Testing and Evaluation: How to Measure Quality Before and After Launch
Author
ZTABS Team
Date Published
Testing AI agents is fundamentally different from testing traditional software. A function either returns the correct value or it does not. An AI agent returns responses that exist on a spectrum from perfect to acceptable to wrong to harmful — and the same input can produce different outputs on different runs. You cannot test AI agents with simple assertions. You need evaluation frameworks that measure quality at scale.
Most AI projects that fail in production fail because of inadequate evaluation, not inadequate models. The team builds a demo that works on 20 hand-picked examples, ships it, and discovers that real-world inputs are nothing like their test set. This guide covers how to build the evaluation infrastructure that prevents this.
The Evaluation Stack
Production AI testing operates at three levels.
Level 1: Pre-deployment evaluation
Testing before the agent goes live. This catches problems before they reach users.
- Unit evaluation — Does each component (retrieval, generation, tool calling) work correctly in isolation?
- Integration evaluation — Do the components work together correctly?
- End-to-end evaluation — Given real-world inputs, does the agent produce acceptable outputs?
- Adversarial evaluation — Does the agent handle edge cases, malicious inputs, and unexpected scenarios safely?
Level 2: Pre-release regression
Testing before each update (prompt change, model upgrade, knowledge base update).
- Regression suite — Run the full evaluation dataset to ensure the update does not degrade existing quality
- A/B comparison — Compare new version against current version on the same inputs
- Canary deployment — Roll out to a small percentage of traffic and monitor before full deployment
Level 3: Production monitoring
Continuous evaluation of the live system.
- Automated scoring — Score a sample of live interactions on quality dimensions
- Human review — Human evaluators review a random sample of agent interactions
- User feedback — Thumbs up/down, ratings, and explicit feedback from users
- Drift detection — Alert when agent quality degrades over time
Building an Evaluation Dataset
The evaluation dataset is the foundation of your testing strategy. Get this right and everything else follows.
What a good evaluation dataset contains
| Component | Description | Example |
|-----------|-------------|---------|
| Input | The user message or query | "How do I reset my password?" |
| Expected output | The ideal response (or acceptable response criteria) | "Navigate to Settings > Security > Reset Password..." |
| Context (if RAG) | The documents the agent should retrieve | Password reset documentation |
| Expected tools (if agent) | Which tools should be called | search_knowledge_base("password reset") |
| Metadata | Category, difficulty, source | category: "account", difficulty: "easy" |
How many examples do you need?
| Purpose | Minimum | Recommended | |---------|---------|-------------| | Initial development | 50 | 100 | | Pre-release regression | 100 | 200–500 | | Comprehensive evaluation | 200 | 500–1,000 |
How to collect evaluation examples
- From real user data — The best source. Sample from actual user interactions. Annotate with correct answers.
- From domain experts — Have subject matter experts write realistic queries and expected answers.
- From error analysis — When the agent fails in production, add the failure case to the evaluation set.
- Synthetic generation — Use an LLM to generate variations of existing examples. Useful for expanding coverage, but verify quality.
Categories to cover
- Common queries (60%) — The bread-and-butter questions your agent handles daily
- Edge cases (20%) — Ambiguous inputs, unusual phrasing, multi-part questions
- Adversarial inputs (10%) — Prompt injection attempts, off-topic queries, harmful requests
- Boundary cases (10%) — Questions at the edge of the agent's scope (should escalate vs answer)
Evaluation Metrics
For response quality
| Metric | What It Measures | How to Calculate | |--------|-----------------|------------------| | Correctness | Is the answer factually accurate? | LLM-as-judge or human review | | Relevance | Does the answer address the actual question? | LLM-as-judge scoring 1–5 | | Completeness | Does the answer cover all aspects of the question? | LLM-as-judge or checklist | | Groundedness | Is the answer supported by retrieved context (not hallucinated)? | Compare claims against source documents | | Harmlessness | Does the answer avoid harmful, biased, or inappropriate content? | Automated content classifiers + human review |
For RAG quality
| Metric | What It Measures | How to Calculate | |--------|-----------------|------------------| | Retrieval precision | What percentage of retrieved chunks are relevant? | Human annotation of retrieved chunks | | Retrieval recall | What percentage of relevant chunks were retrieved? | Compare against known-relevant documents | | Context utilization | Does the LLM actually use the retrieved context? | Compare response against context content | | Citation accuracy | Are citations correct and pointing to actual sources? | Verify each citation against source |
For agent behavior
| Metric | What It Measures | How to Calculate | |--------|-----------------|------------------| | Tool selection accuracy | Does the agent call the right tool? | Compare against expected tools in eval set | | Tool argument accuracy | Are the arguments passed to tools correct? | Validate against expected arguments | | Step efficiency | Does the agent complete the task in a reasonable number of steps? | Count LLM calls per task | | Escalation accuracy | Does the agent correctly escalate when it should? | Compare escalation decisions against labels | | Boundary adherence | Does the agent stay within its defined scope? | Test with out-of-scope inputs |
LLM-as-Judge
The most scalable evaluation method is using another LLM to judge the quality of agent responses.
judge_prompt = """
You are evaluating a customer support AI agent's response.
Customer question: {question}
Agent response: {response}
Reference answer: {reference}
Rate the response on these dimensions (1-5 scale):
1. Correctness: Is the information factually accurate?
2. Relevance: Does it address the customer's actual question?
3. Completeness: Does it cover all necessary information?
4. Tone: Is the tone appropriate and professional?
5. Actionability: Can the customer act on this response?
For each dimension, provide the score and a brief justification.
Return as JSON.
"""
Best practices for LLM-as-judge
- Use a stronger model as judge — If your agent uses GPT-4o-mini, use GPT-4o as the judge
- Provide reference answers — Judges are more accurate when they have a gold standard to compare against
- Use structured rubrics — Specific scoring criteria produce more consistent results than open-ended evaluation
- Validate with human agreement — Check that your LLM judge agrees with human evaluators on a sample (aim for 80%+ agreement)
- Use multiple judge prompts — Average scores across different prompt framings to reduce bias
Regression Testing Workflow
Every time you change prompts, models, or knowledge base content:
1. Run full evaluation suite against current version → baseline scores
2. Make the change
3. Run full evaluation suite against new version → new scores
4. Compare: overall accuracy, per-category scores, worst-case examples
5. If new version is better overall AND no category regresses more than 5%:
→ Approve for deployment
6. If any category regresses significantly:
→ Investigate, fix, re-evaluate
Automate this in your CI/CD pipeline. Never ship a prompt change without running the evaluation suite.
Production Monitoring
Real-time metrics
| Metric | Collection Method | Alert Threshold | |--------|------------------|-----------------| | Response latency | Application logging | > 5 seconds (P95) | | Error rate | Application logging | > 2% | | Tool call failure rate | Tool execution logging | > 5% | | Escalation rate | Agent decision logging | > 30% (or sudden change) | | User feedback score | In-app feedback | < 3.5/5 (7-day rolling average) | | Cost per interaction | Token counting + pricing | > 2x baseline |
Automated quality sampling
Score a random 5–10% of production interactions using LLM-as-judge daily. Track quality scores over time. Alert when scores drop below threshold or trend downward.
Human review cadence
| Review Type | Cadence | Sample Size | |-------------|---------|-------------| | Random sample review | Weekly | 50–100 interactions | | Escalated interaction review | Daily | All escalated interactions | | Low-confidence response review | Daily | All responses below confidence threshold | | Negative feedback review | Daily | All negative feedback interactions |
Evaluation Tools
| Tool | Type | Best For | |------|------|---------| | Langfuse | Open-source LLM observability | Tracing, scoring, evaluation pipelines | | LangSmith | LangChain's evaluation platform | Teams using LangChain/LangGraph | | Braintrust | AI evaluation platform | Systematic eval datasets and scoring | | Ragas | RAG evaluation framework | RAG-specific metrics (retrieval quality, groundedness) | | DeepEval | Open-source eval framework | Unit testing for LLM applications | | Custom scripts | Python + your eval dataset | Full control, no vendor lock-in |
Getting Started
- Build your evaluation dataset — Start with 50 examples from real use cases. Grow to 200+ over the first month.
- Set up LLM-as-judge — Automate quality scoring so you can evaluate at scale.
- Integrate into CI/CD — No prompt or model change ships without passing the evaluation suite.
- Deploy production monitoring — Track quality metrics from day one. Set alerts.
- Establish human review cadence — Weekly reviews of sampled interactions.
Frequently Asked Questions
How do you measure AI agent quality?
AI agent quality is measured across multiple dimensions rather than a single pass/fail metric. The most important dimensions are correctness (factual accuracy), relevance (does the response address the user's actual question), completeness (does it cover all aspects), and groundedness (is the response supported by retrieved context rather than hallucinated). Production teams typically combine automated LLM-as-judge scoring on these dimensions with periodic human review of sampled interactions to get a comprehensive picture. Tracking these metrics over time with tools like Langfuse or LangSmith reveals quality trends that a one-time test cannot capture.
What metrics matter most for agent evaluation?
The metrics that matter most depend on your use case, but three are nearly universal. Task completion rate measures how often the agent successfully resolves the user's request without human intervention. Correctness measures factual accuracy against a reference answer set. Escalation accuracy measures whether the agent correctly identifies cases it cannot handle and routes them to a human. For RAG-based systems, retrieval precision and groundedness are also critical — an agent that retrieves the wrong documents will generate plausible-sounding but incorrect answers.
How often should you test AI agents?
You should run your full evaluation suite before every deployment — any prompt change, model upgrade, or knowledge base update can cause unexpected regressions. Beyond pre-deployment testing, production agents need continuous monitoring: automated quality scoring on 5–10% of live interactions daily, weekly human review of 50–100 sampled conversations, and daily review of all escalated or negatively-rated interactions. The cadence matters because LLM behavior can drift as providers update models, and your users' questions evolve over time in ways your original test set may not cover.
What tools are best for testing AI agents?
The best tooling depends on your stack and scale. For teams using LangChain or LangGraph, LangSmith provides integrated tracing and evaluation. Langfuse is a strong open-source alternative that works with any framework and offers evaluation pipelines, scoring, and observability. Ragas is purpose-built for RAG evaluation metrics like retrieval quality and groundedness. For teams that want full control, custom Python scripts with your own evaluation dataset and LLM-as-judge scoring remain a practical approach — especially early on when your evaluation criteria are still evolving. Many production teams at scale combine two or more of these tools to cover different evaluation needs.
For help building evaluation infrastructure for your AI agents, explore our AI agent development services or contact us. We build evaluation suites as part of every agent deployment.
Need Help Building Your Project?
From web apps and mobile apps to AI solutions and SaaS platforms — we ship production software for 300+ clients.
Related Articles
AI Agent Orchestration: How to Coordinate Agents in Production
AI agent orchestration is how you coordinate multiple agents, tools, and workflows into reliable production systems. This guide covers orchestration patterns, frameworks, state management, error handling, and the protocols (MCP, A2A) that make it work.
10 min readAI Agents for Accounting & Finance: Bookkeeping, AP/AR, and Reporting
AI agents automate accounting tasks — invoice processing, expense management, reconciliation, and financial reporting — reducing manual work by 60–80% while improving accuracy. This guide covers use cases, ROI, compliance, and implementation.
14 min readAI Agents for Agriculture: Precision Farming, Crop Monitoring, and Supply Chain
AI agents are helping farmers and agribusinesses optimize crop yields, reduce input costs, monitor field conditions, and manage supply chains more efficiently. This guide covers practical use cases, technology requirements, and implementation strategies for agriculture and AgTech companies.