25 Questions to Ask an AI Development Company Before You Hire Them
Author
ZTABS Team
Date Published
Hiring the wrong AI development company is one of the most expensive mistakes a business can make. You spend $50,000–$200,000, wait 3–6 months, and end up with a demo that works in a meeting room but breaks in production. The market is flooded with agencies that repackaged their web development services as "AI development" after ChatGPT launched. Separating real AI expertise from marketing is hard.
These 25 questions are designed to reveal whether a company can actually build production AI — or just sell it. Ask all of them before signing a contract.
Production Experience (Questions 1–5)
These questions test whether the company has shipped real AI systems, not just prototypes.
1. "Can you show me an AI agent or LLM-powered system you built that is running in production right now?"
Why it matters: Anyone can build a demo. Production means the system handles real users, real data, edge cases, scale, and has been running for months. If they can only show demos, prototypes, or conference presentations, that is a red flag.
Good answer: They show you a live system, explain the architecture, and share metrics (uptime, accuracy, usage volume).
Red flag: "We have several in development" or they show a polished video instead of a live system.
2. "How many AI agents or LLM applications have you deployed to production?"
Why it matters: One production deployment could be luck. Five or more suggests repeatable capability.
Good answer: Specific number with brief descriptions of different use cases and industries.
Red flag: Vague answers like "many" or "several" without specifics.
3. "What was the hardest production issue you encountered with an AI system, and how did you resolve it?"
Why it matters: This reveals real-world experience. Teams that have dealt with hallucination crises, model deprecation, cost explosions, and latency problems at scale have wisdom that prototype-only teams lack.
Good answer: A specific, detailed story about a real problem with a clear resolution.
Red flag: Generic answers about "prompt tuning" or inability to describe a specific incident.
4. "Can I talk to a client whose AI system you built?"
Why it matters: References that you initiate are more valuable than curated testimonials. A company confident in their work will connect you with past clients.
Good answer: "Absolutely. Here are two clients you can speak with."
Red flag: Reluctance, NDAs on every project, or offering only written testimonials.
5. "What percentage of your AI projects have made it from prototype to production?"
Why it matters: Many AI projects die in the pilot phase. A company with a high prototype-to-production rate has both the technical chops and the project management to ship.
Good answer: 70%+ with explanations for why the rest did not proceed (client pivoted, budget change — not "the tech didn't work").
Red flag: Below 50%, or they cannot answer the question.
Technical Depth (Questions 6–12)
These questions test whether the team understands AI engineering at a deep level.
6. "Which AI agent frameworks do you use, and why?"
Why it matters: The answer reveals whether they have opinions based on experience or just use whatever is trending.
Good answer: They explain trade-offs — "We use LangGraph for complex single-agent systems because of its control flow. CrewAI for multi-agent workflows where speed of development matters. The choice depends on the use case." See our framework comparison for context.
Red flag: They only know one framework, or they answer "we use ChatGPT."
7. "How do you evaluate AI agent accuracy in production?"
Why it matters: Without evaluation, you cannot know if the agent is working or slowly degrading. This is one of the most neglected aspects of AI development.
Good answer: They describe evaluation datasets, automated scoring, human review sampling, regression testing, and drift detection.
Red flag: "We test it manually before launch" with no ongoing evaluation plan.
8. "How do you handle hallucination in production systems?"
Why it matters: Every LLM hallucinates. The question is how the team mitigates it.
Good answer: Multi-layered approach — RAG for grounding, citation verification, confidence scoring, output guardrails, and human-in-the-loop for high-risk outputs.
Red flag: "We use a good prompt" or "GPT-4o doesn't hallucinate much."
9. "What is your approach to RAG architecture?"
Why it matters: Most AI agents need retrieval-augmented generation. The implementation quality directly affects accuracy.
Good answer: They discuss chunking strategies, embedding models, hybrid search (semantic + keyword), re-ranking, metadata filtering, and evaluation of retrieval quality. See our RAG architecture guide for what good looks like.
Red flag: They mention RAG but cannot explain their chunking strategy or evaluation approach.
10. "How do you handle model deprecation and migration?"
Why it matters: LLM providers deprecate models regularly. GPT-4 was superseded by GPT-4o, which will eventually be superseded. A good team plans for this.
Good answer: They have a migration process — regression testing against the evaluation suite, prompt adjustment, staged rollout with monitoring.
Red flag: They have not considered this, or they say "we'll just switch the model name."
11. "What is your approach to prompt engineering and management?"
Why it matters: Prompts are the most critical and most fragile part of an AI system.
Good answer: Version-controlled prompts, evaluation datasets, A/B testing framework, prompt performance monitoring. See our prompt engineering guide.
Red flag: Prompts are hardcoded in the codebase with no version control or evaluation.
12. "How do you implement guardrails for AI agents?"
Why it matters: Production agents need boundaries — input filtering, output validation, action whitelisting, and kill switches. See our AI governance guide.
Good answer: They describe a multi-layer guardrail system covering input, output, and action-level controls.
Red flag: They have not built systems that needed guardrails, or they rely solely on the LLM's built-in safety.
Pricing and Process (Questions 13–19)
These questions reveal whether you will get predictable delivery or scope creep.
13. "Can you break down the total cost — development, infrastructure, LLM APIs, and ongoing maintenance?"
Why it matters: Many companies quote only development costs. The real cost includes monthly infrastructure, LLM API costs, and maintenance. Hidden costs are the norm in AI projects — a good partner surfaces them upfront.
Good answer: Detailed breakdown across all cost categories. See our AI agent development cost guide for what a complete breakdown looks like.
Red flag: Only development cost quoted, or "we'll figure out infrastructure costs later."
14. "What is included in post-launch support, and what costs extra?"
Why it matters: AI systems need ongoing prompt optimization, knowledge base updates, bug fixes, model migration, and monitoring. If post-launch support is not in the contract, you are on your own.
Good answer: Clear description of included support (e.g., "3 months of prompt optimization and bug fixes included, then optional retainer at $X/month").
Red flag: No post-launch support, or vague "we'll be available."
15. "What is your development process from kickoff to production?"
Why it matters: A clear process with defined milestones reduces risk of scope creep and delays.
Good answer: Discovery → Architecture → MVP build → Testing/Evaluation → Staged production rollout → Optimization. With defined timelines and deliverables at each stage.
Red flag: No defined process, or a process that does not include evaluation and staged rollout.
16. "What happens when scope changes mid-project?"
Why it matters: Scope always changes. The question is how it is managed.
Good answer: Change request process with impact assessment (cost, timeline) before proceeding. Regular checkpoints to catch scope drift early.
Red flag: "We'll figure it out" or a rigid process that does not accommodate any changes.
17. "Who will actually do the work — the people in this meeting, or a different team?"
Why it matters: Some agencies sell with senior architects and staff with junior developers. Know who builds your product.
Good answer: They introduce the team and their relevant experience. Senior engineers are on the project, not just advising.
Red flag: "Our dedicated team will be assigned after contract signing."
18. "What is the minimum engagement size?"
Why it matters: If their minimum is $200,000 and your budget is $50,000, do not waste time. If they will take any budget regardless of scope, they may under-deliver.
Good answer: Clear minimum with explanation of what it covers.
19. "Do I own the code and IP?"
Why it matters: You should own 100% of the code, models, prompts, and data generated during the project. Some companies retain partial ownership or license it back to you.
Good answer: "Yes, full ownership transfers to you. It is in our standard contract."
Red flag: Anything less than full ownership, or "we can discuss IP terms."
Post-Launch and Scale (Questions 20–25)
20. "How do you monitor AI agent performance after launch?"
Good answer: Observability stack (LangSmith, Langfuse, or custom), automated accuracy scoring, cost monitoring, latency tracking, and regular review cadence.
21. "What is your on-call process when the AI agent breaks in production?"
Good answer: Defined SLAs, escalation paths, and incident response procedures.
22. "How do you handle scaling — what happens when my usage grows 10x?"
Good answer: Architecture designed for scale from the start — horizontal scaling, cost optimization (model routing, caching), and load testing.
23. "Can you help us eventually bring this in-house?"
Good answer: Knowledge transfer, documentation, and training are part of the engagement. They are not trying to create permanent dependency.
24. "What security certifications do you hold?"
Good answer: SOC 2, ISO 27001, or demonstrable security practices. Relevant if you are in a regulated industry.
25. "What would you recommend we NOT do?"
Why it matters: A trustworthy partner pushes back on bad ideas. If they agree with everything you say, they are selling, not advising.
Good answer: Honest feedback — "I would not recommend building a multi-agent system for this use case. A single agent with good tool calling will be simpler, cheaper, and more reliable."
Red flag: Agreement with every request without questioning scope or approach.
Your Evaluation Scorecard
Score each company on a 1–5 scale for:
| Category | Weight | Score (1–5) | Weighted | |----------|--------|------------|----------| | Production experience | 25% | ___ | ___ | | Technical depth | 25% | ___ | ___ | | Pricing transparency | 20% | ___ | ___ | | Process maturity | 15% | ___ | ___ | | Post-launch support | 15% | ___ | ___ | | Total | 100% | | ___ |
Compare companies using this scorecard for an objective, evidence-based decision.
Next Steps
- Best AI agent development companies in 2026 — Our curated list with detailed evaluations
- Best AI development companies for startups — If you are early-stage
- AI agent development cost guide — Understand what projects should cost
- AI readiness assessment — Make sure you are ready before hiring
Ready to talk to a team that can answer all 25 of these questions? Contact ZTABS for a free consultation and detailed estimate within 48 hours.
Need Help Building Your Project?
From web apps and mobile apps to AI solutions and SaaS platforms — we ship production software for 300+ clients.
Related Articles
AI Agent Orchestration: How to Coordinate Agents in Production
AI agent orchestration is how you coordinate multiple agents, tools, and workflows into reliable production systems. This guide covers orchestration patterns, frameworks, state management, error handling, and the protocols (MCP, A2A) that make it work.
10 min readAI Agent Testing and Evaluation: How to Measure Quality Before and After Launch
You cannot ship an AI agent to production without a testing strategy. This guide covers evaluation datasets, accuracy metrics, regression testing, production monitoring, and the tools and frameworks for testing AI agents systematically.
10 min readAI Agents for Accounting & Finance: Bookkeeping, AP/AR, and Reporting
AI agents automate accounting tasks — invoice processing, expense management, reconciliation, and financial reporting — reducing manual work by 60–80% while improving accuracy. This guide covers use cases, ROI, compliance, and implementation.