25 Questions to Ask an AI Development Company Before Hiring (2026)

Hiring the wrong AI development company is one of the most expensive mistakes a business can make. You spend $50,000–$200,000, wait 3–6 months, and end up with a demo that works in a meeting room but breaks in production. The market is flooded with agencies that repackaged their web development services as "AI development" after ChatGPT launched. Separating real AI expertise from marketing is hard.

These 25 questions are designed to reveal whether a company can actually build production AI — or just sell it. Ask all of them before signing a contract.

Production Experience (Questions 1–5)

These questions test whether the company has shipped real AI systems, not just prototypes.

1. "Can you show me an AI agent or LLM-powered system you built that is running in production right now?"

Why it matters: Anyone can build a demo. Production means the system handles real users, real data, edge cases, scale, and has been running for months. If they can only show demos, prototypes, or conference presentations, that is a red flag.

Good answer: They show you a live system, explain the architecture, and share metrics (uptime, accuracy, usage volume).

Red flag: "We have several in development" or they show a polished video instead of a live system.

2. "How many AI agents or LLM applications have you deployed to production?"

Why it matters: One production deployment could be luck. Five or more suggests repeatable capability.

Good answer: Specific number with brief descriptions of different use cases and industries.

Red flag: Vague answers like "many" or "several" without specifics.

3. "What was the hardest production issue you encountered with an AI system, and how did you resolve it?"

Why it matters: This reveals real-world experience. Teams that have dealt with hallucination crises, model deprecation, cost explosions, and latency problems at scale have wisdom that prototype-only teams lack.

Good answer: A specific, detailed story about a real problem with a clear resolution.

Red flag: Generic answers about "prompt tuning" or inability to describe a specific incident.

4. "Can I talk to a client whose AI system you built?"

Why it matters: References that you initiate are more valuable than curated testimonials. A company confident in their work will connect you with past clients.

Good answer: "Absolutely. Here are two clients you can speak with."

Red flag: Reluctance, NDAs on every project, or offering only written testimonials.

5. "What percentage of your AI projects have made it from prototype to production?"

Why it matters: Many AI projects die in the pilot phase. A company with a high prototype-to-production rate has both the technical chops and the project management to ship.

Good answer: 70%+ with explanations for why the rest did not proceed (client pivoted, budget change — not "the tech didn't work").

Red flag: Below 50%, or they cannot answer the question.

Technical Depth (Questions 6–12)

These questions test whether the team understands AI engineering at a deep level.

6. "Which AI agent frameworks do you use, and why?"

Why it matters: The answer reveals whether they have opinions based on experience or just use whatever is trending.

Good answer: They explain trade-offs — "We use LangGraph for complex single-agent systems because of its control flow. CrewAI for multi-agent workflows where speed of development matters. The choice depends on the use case." See our framework comparison for context.

Red flag: They only know one framework, or they answer "we use ChatGPT."

7. "How do you evaluate AI agent accuracy in production?"

Why it matters: Without evaluation, you cannot know if the agent is working or slowly degrading. This is one of the most neglected aspects of AI development.

Good answer: They describe evaluation datasets, automated scoring, human review sampling, regression testing, and drift detection.

Red flag: "We test it manually before launch" with no ongoing evaluation plan.

8. "How do you handle hallucination in production systems?"

Why it matters: Every LLM hallucinates. The question is how the team mitigates it.

Good answer: Multi-layered approach — RAG for grounding, citation verification, confidence scoring, output guardrails, and human-in-the-loop for high-risk outputs.

Red flag: "We use a good prompt" or "GPT-4o doesn't hallucinate much."

9. "What is your approach to RAG architecture?"

Why it matters: Most AI agents need retrieval-augmented generation. The implementation quality directly affects accuracy.

Good answer: They discuss chunking strategies, embedding models, hybrid search (semantic + keyword), re-ranking, metadata filtering, and evaluation of retrieval quality. See our RAG architecture guide for what good looks like.

Red flag: They mention RAG but cannot explain their chunking strategy or evaluation approach.

10. "How do you handle model deprecation and migration?"

Why it matters: LLM providers deprecate models regularly. GPT-4 was superseded by GPT-4o, which will eventually be superseded. A good team plans for this.

Good answer: They have a migration process — regression testing against the evaluation suite, prompt adjustment, staged rollout with monitoring.

Red flag: They have not considered this, or they say "we'll just switch the model name."

11. "What is your approach to prompt engineering and management?"

Why it matters: Prompts are the most critical and most fragile part of an AI system.

Good answer: Version-controlled prompts, evaluation datasets, A/B testing framework, prompt performance monitoring. See our prompt engineering guide.

Red flag: Prompts are hardcoded in the codebase with no version control or evaluation.

12. "How do you implement guardrails for AI agents?"

Why it matters: Production agents need boundaries — input filtering, output validation, action whitelisting, and kill switches. See our AI governance guide.

Good answer: They describe a multi-layer guardrail system covering input, output, and action-level controls.

Red flag: They have not built systems that needed guardrails, or they rely solely on the LLM's built-in safety.

Pricing and Process (Questions 13–19)

These questions reveal whether you will get predictable delivery or scope creep.

13. "Can you break down the total cost — development, infrastructure, LLM APIs, and ongoing maintenance?"

Why it matters: Many companies quote only development costs. The real cost includes monthly infrastructure, LLM API costs, and maintenance. Hidden costs are the norm in AI projects — a good partner surfaces them upfront.

Good answer: Detailed breakdown across all cost categories. See our AI agent development cost guide for what a complete breakdown looks like.

Red flag: Only development cost quoted, or "we'll figure out infrastructure costs later."

14. "What is included in post-launch support, and what costs extra?"

Why it matters: AI systems need ongoing prompt optimization, knowledge base updates, bug fixes, model migration, and monitoring. If post-launch support is not in the contract, you are on your own.

Good answer: Clear description of included support (e.g., "3 months of prompt optimization and bug fixes included, then optional retainer at $X/month").

Red flag: No post-launch support, or vague "we'll be available."

15. "What is your development process from kickoff to production?"

Why it matters: A clear process with defined milestones reduces risk of scope creep and delays.

Good answer: Discovery → Architecture → MVP build → Testing/Evaluation → Staged production rollout → Optimization. With defined timelines and deliverables at each stage.

Red flag: No defined process, or a process that does not include evaluation and staged rollout.

16. "What happens when scope changes mid-project?"

Why it matters: Scope always changes. The question is how it is managed.

Good answer: Change request process with impact assessment (cost, timeline) before proceeding. Regular checkpoints to catch scope drift early.

Red flag: "We'll figure it out" or a rigid process that does not accommodate any changes.

17. "Who will actually do the work — the people in this meeting, or a different team?"

Why it matters: Some agencies sell with senior architects and staff with junior developers. Know who builds your product.

Good answer: They introduce the team and their relevant experience. Senior engineers are on the project, not just advising.

Red flag: "Our dedicated team will be assigned after contract signing."

18. "What is the minimum engagement size?"

Why it matters: If their minimum is $200,000 and your budget is $50,000, do not waste time. If they will take any budget regardless of scope, they may under-deliver.

Good answer: Clear minimum with explanation of what it covers.

19. "Do I own the code and IP?"

Why it matters: You should own 100% of the code, models, prompts, and data generated during the project. Some companies retain partial ownership or license it back to you.

Good answer: "Yes, full ownership transfers to you. It is in our standard contract."

Red flag: Anything less than full ownership, or "we can discuss IP terms."

Post-Launch and Scale (Questions 20–25)

20. "How do you monitor AI agent performance after launch?"

Good answer: Observability stack (LangSmith, Langfuse, or custom), automated accuracy scoring, cost monitoring, latency tracking, and regular review cadence.

21. "What is your on-call process when the AI agent breaks in production?"

Good answer: Defined SLAs, escalation paths, and incident response procedures.

22. "How do you handle scaling — what happens when my usage grows 10x?"

Good answer: Architecture designed for scale from the start — horizontal scaling, cost optimization (model routing, caching), and load testing.

23. "Can you help us eventually bring this in-house?"

Good answer: Knowledge transfer, documentation, and training are part of the engagement. They are not trying to create permanent dependency.

24. "What security certifications do you hold?"

Good answer: SOC 2, ISO 27001, or demonstrable security practices. Relevant if you are in a regulated industry.

25. "What would you recommend we NOT do?"

Why it matters: A trustworthy partner pushes back on bad ideas. If they agree with everything you say, they are selling, not advising.

Good answer: Honest feedback — "I would not recommend building a multi-agent system for this use case. A single agent with good tool calling will be simpler, cheaper, and more reliable."

Red flag: Agreement with every request without questioning scope or approach.

Your Evaluation Scorecard

Score each company on a 1–5 scale for:

Category	Weight	Score (1–5)	Weighted
Production experience	25%	___	___
Technical depth	25%	___	___
Pricing transparency	20%	___	___
Process maturity	15%	___	___
Post-launch support	15%	___	___
Total	100%		___

Compare companies using this scorecard for an objective, evidence-based decision.

Next Steps

Best AI agent development companies in 2026 — Our curated list with detailed evaluations
Best AI development companies for startups — If you are early-stage
AI agent development cost guide — Understand what projects should cost
AI readiness assessment — Make sure you are ready before hiring

Ready to talk to a team that can answer all 25 of these questions? Contact ZTABS for a free consultation and detailed estimate within 48 hours.

These 25 questions are designed to reveal whether a company can actually build production AI — or just sell it. Ask all of them before signing a contract.

Production Experience (Questions 1–5)

These questions test whether the company has shipped real AI systems, not just prototypes.

1. "Can you show me an AI agent or LLM-powered system you built that is running in production right now?"

Good answer: They show you a live system, explain the architecture, and share metrics (uptime, accuracy, usage volume).

Red flag: "We have several in development" or they show a polished video instead of a live system.

2. "How many AI agents or LLM applications have you deployed to production?"

Why it matters: One production deployment could be luck. Five or more suggests repeatable capability.

Good answer: Specific number with brief descriptions of different use cases and industries.

Red flag: Vague answers like "many" or "several" without specifics.

3. "What was the hardest production issue you encountered with an AI system, and how did you resolve it?"

Good answer: A specific, detailed story about a real problem with a clear resolution.

Red flag: Generic answers about "prompt tuning" or inability to describe a specific incident.

4. "Can I talk to a client whose AI system you built?"

Why it matters: References that you initiate are more valuable than curated testimonials. A company confident in their work will connect you with past clients.

Good answer: "Absolutely. Here are two clients you can speak with."

Red flag: Reluctance, NDAs on every project, or offering only written testimonials.

5. "What percentage of your AI projects have made it from prototype to production?"

Why it matters: Many AI projects die in the pilot phase. A company with a high prototype-to-production rate has both the technical chops and the project management to ship.

Good answer: 70%+ with explanations for why the rest did not proceed (client pivoted, budget change — not "the tech didn't work").

Red flag: Below 50%, or they cannot answer the question.

Technical Depth (Questions 6–12)

These questions test whether the team understands AI engineering at a deep level.

6. "Which AI agent frameworks do you use, and why?"

Why it matters: The answer reveals whether they have opinions based on experience or just use whatever is trending.

Red flag: They only know one framework, or they answer "we use ChatGPT."

7. "How do you evaluate AI agent accuracy in production?"

Why it matters: Without evaluation, you cannot know if the agent is working or slowly degrading. This is one of the most neglected aspects of AI development.

Good answer: They describe evaluation datasets, automated scoring, human review sampling, regression testing, and drift detection.

Red flag: "We test it manually before launch" with no ongoing evaluation plan.

8. "How do you handle hallucination in production systems?"

Why it matters: Every LLM hallucinates. The question is how the team mitigates it.

Good answer: Multi-layered approach — RAG for grounding, citation verification, confidence scoring, output guardrails, and human-in-the-loop for high-risk outputs.

Red flag: "We use a good prompt" or "GPT-4o doesn't hallucinate much."

9. "What is your approach to RAG architecture?"

Why it matters: Most AI agents need retrieval-augmented generation. The implementation quality directly affects accuracy.

Red flag: They mention RAG but cannot explain their chunking strategy or evaluation approach.

10. "How do you handle model deprecation and migration?"

Why it matters: LLM providers deprecate models regularly. GPT-4 was superseded by GPT-4o, which will eventually be superseded. A good team plans for this.

Good answer: They have a migration process — regression testing against the evaluation suite, prompt adjustment, staged rollout with monitoring.

Red flag: They have not considered this, or they say "we'll just switch the model name."

11. "What is your approach to prompt engineering and management?"

Why it matters: Prompts are the most critical and most fragile part of an AI system.

Good answer: Version-controlled prompts, evaluation datasets, A/B testing framework, prompt performance monitoring. See our prompt engineering guide.

Red flag: Prompts are hardcoded in the codebase with no version control or evaluation.

12. "How do you implement guardrails for AI agents?"

Why it matters: Production agents need boundaries — input filtering, output validation, action whitelisting, and kill switches. See our AI governance guide.

Good answer: They describe a multi-layer guardrail system covering input, output, and action-level controls.

Red flag: They have not built systems that needed guardrails, or they rely solely on the LLM's built-in safety.

Pricing and Process (Questions 13–19)

These questions reveal whether you will get predictable delivery or scope creep.

13. "Can you break down the total cost — development, infrastructure, LLM APIs, and ongoing maintenance?"

Good answer: Detailed breakdown across all cost categories. See our AI agent development cost guide for what a complete breakdown looks like.

Red flag: Only development cost quoted, or "we'll figure out infrastructure costs later."

14. "What is included in post-launch support, and what costs extra?"

Good answer: Clear description of included support (e.g., "3 months of prompt optimization and bug fixes included, then optional retainer at $X/month").

Red flag: No post-launch support, or vague "we'll be available."

15. "What is your development process from kickoff to production?"

Why it matters: A clear process with defined milestones reduces risk of scope creep and delays.

Good answer: Discovery → Architecture → MVP build → Testing/Evaluation → Staged production rollout → Optimization. With defined timelines and deliverables at each stage.

Red flag: No defined process, or a process that does not include evaluation and staged rollout.

16. "What happens when scope changes mid-project?"

Why it matters: Scope always changes. The question is how it is managed.

Good answer: Change request process with impact assessment (cost, timeline) before proceeding. Regular checkpoints to catch scope drift early.

Red flag: "We'll figure it out" or a rigid process that does not accommodate any changes.

17. "Who will actually do the work — the people in this meeting, or a different team?"

Why it matters: Some agencies sell with senior architects and staff with junior developers. Know who builds your product.

Good answer: They introduce the team and their relevant experience. Senior engineers are on the project, not just advising.

Red flag: "Our dedicated team will be assigned after contract signing."

18. "What is the minimum engagement size?"

Why it matters: If their minimum is $200,000 and your budget is $50,000, do not waste time. If they will take any budget regardless of scope, they may under-deliver.

Good answer: Clear minimum with explanation of what it covers.

19. "Do I own the code and IP?"

Why it matters: You should own 100% of the code, models, prompts, and data generated during the project. Some companies retain partial ownership or license it back to you.

Good answer: "Yes, full ownership transfers to you. It is in our standard contract."

Red flag: Anything less than full ownership, or "we can discuss IP terms."

Post-Launch and Scale (Questions 20–25)

20. "How do you monitor AI agent performance after launch?"

Good answer: Observability stack (LangSmith, Langfuse, or custom), automated accuracy scoring, cost monitoring, latency tracking, and regular review cadence.

21. "What is your on-call process when the AI agent breaks in production?"

Good answer: Defined SLAs, escalation paths, and incident response procedures.

22. "How do you handle scaling — what happens when my usage grows 10x?"

Good answer: Architecture designed for scale from the start — horizontal scaling, cost optimization (model routing, caching), and load testing.

23. "Can you help us eventually bring this in-house?"

Good answer: Knowledge transfer, documentation, and training are part of the engagement. They are not trying to create permanent dependency.

24. "What security certifications do you hold?"

Good answer: SOC 2, ISO 27001, or demonstrable security practices. Relevant if you are in a regulated industry.

25. "What would you recommend we NOT do?"

Why it matters: A trustworthy partner pushes back on bad ideas. If they agree with everything you say, they are selling, not advising.

Good answer: Honest feedback — "I would not recommend building a multi-agent system for this use case. A single agent with good tool calling will be simpler, cheaper, and more reliable."

Red flag: Agreement with every request without questioning scope or approach.

Your Evaluation Scorecard

Score each company on a 1–5 scale for:

Category	Weight	Score (1–5)	Weighted
Production experience	25%	___	___
Technical depth	25%	___	___
Pricing transparency	20%	___	___
Process maturity	15%	___	___
Post-launch support	15%	___	___
Total	100%		___

Compare companies using this scorecard for an objective, evidence-based decision.

Next Steps

Best AI agent development companies in 2026 — Our curated list with detailed evaluations
Best AI development companies for startups — If you are early-stage
AI agent development cost guide — Understand what projects should cost
AI readiness assessment — Make sure you are ready before hiring

Ready to talk to a team that can answer all 25 of these questions? Contact ZTABS for a free consultation and detailed estimate within 48 hours.

Production Experience (Questions 1–5)

1. "Can you show me an AI agent or LLM-powered system you built that is running in production right now?"

2. "How many AI agents or LLM applications have you deployed to production?"

3. "What was the hardest production issue you encountered with an AI system, and how did you resolve it?"

4. "Can I talk to a client whose AI system you built?"

5. "What percentage of your AI projects have made it from prototype to production?"

Technical Depth (Questions 6–12)

6. "Which AI agent frameworks do you use, and why?"

7. "How do you evaluate AI agent accuracy in production?"

8. "How do you handle hallucination in production systems?"

9. "What is your approach to RAG architecture?"

10. "How do you handle model deprecation and migration?"

11. "What is your approach to prompt engineering and management?"

12. "How do you implement guardrails for AI agents?"

Pricing and Process (Questions 13–19)

13. "Can you break down the total cost — development, infrastructure, LLM APIs, and ongoing maintenance?"

14. "What is included in post-launch support, and what costs extra?"

15. "What is your development process from kickoff to production?"

16. "What happens when scope changes mid-project?"

17. "Who will actually do the work — the people in this meeting, or a different team?"

18. "What is the minimum engagement size?"

19. "Do I own the code and IP?"

Post-Launch and Scale (Questions 20–25)

20. "How do you monitor AI agent performance after launch?"

21. "What is your on-call process when the AI agent breaks in production?"

22. "How do you handle scaling — what happens when my usage grows 10x?"

23. "Can you help us eventually bring this in-house?"

24. "What security certifications do you hold?"

25. "What would you recommend we NOT do?"

Your Evaluation Scorecard

Next Steps

Explore Related Solutions

Need Help Building Your Project?

Related Articles

AI Browser Automation in 2026: ChatGPT Agent, Computer Use, and What Actually Ships

AI Cost Optimization at Scale: How We Cut LLM Bills 60% Without Quality Loss

Blockchain Development in 2026: What's Actually Worth Building

Production Experience (Questions 1–5)

1. "Can you show me an AI agent or LLM-powered system you built that is running in production right now?"

2. "How many AI agents or LLM applications have you deployed to production?"

3. "What was the hardest production issue you encountered with an AI system, and how did you resolve it?"

4. "Can I talk to a client whose AI system you built?"

5. "What percentage of your AI projects have made it from prototype to production?"

Technical Depth (Questions 6–12)

6. "Which AI agent frameworks do you use, and why?"

7. "How do you evaluate AI agent accuracy in production?"

8. "How do you handle hallucination in production systems?"

9. "What is your approach to RAG architecture?"

10. "How do you handle model deprecation and migration?"

11. "What is your approach to prompt engineering and management?"

12. "How do you implement guardrails for AI agents?"

Pricing and Process (Questions 13–19)

13. "Can you break down the total cost — development, infrastructure, LLM APIs, and ongoing maintenance?"

14. "What is included in post-launch support, and what costs extra?"

15. "What is your development process from kickoff to production?"

16. "What happens when scope changes mid-project?"

17. "Who will actually do the work — the people in this meeting, or a different team?"

18. "What is the minimum engagement size?"

19. "Do I own the code and IP?"

Post-Launch and Scale (Questions 20–25)

20. "How do you monitor AI agent performance after launch?"

21. "What is your on-call process when the AI agent breaks in production?"

22. "How do you handle scaling — what happens when my usage grows 10x?"

23. "Can you help us eventually bring this in-house?"

24. "What security certifications do you hold?"

25. "What would you recommend we NOT do?"

Your Evaluation Scorecard

Next Steps

Explore Related Solutions

Need Help Building Your Project?

Related Articles

AI Browser Automation in 2026: ChatGPT Agent, Computer Use, and What Actually Ships

AI Cost Optimization at Scale: How We Cut LLM Bills 60% Without Quality Loss

Blockchain Development in 2026: What's Actually Worth Building