How We Built Chatsy: Lessons from Shipping a Production AI Customer-Support Platform
TL;DR: Chatsy is the AI customer-support platform we built and operate. This is what we learned shipping agentic AI in production — what worked, what we rebuilt twice, what we'd do differently if we started today.
Chatsy is the AI customer-support platform we built and operate at ZTABS — one of 10 AI-powered SaaS products we ship and one of 17 production SaaS products in our portfolio. It's been live for over a year. This is the engineering retrospective: what worked, what we rebuilt twice, what we'd do differently if we started today.
If you're shipping production AI agents, especially in the customer-support category, these are the decisions you'll make whether you build on Chatsy, build on Intercom Fin, or build your own from scratch.
TL;DR — five engineering decisions that shaped Chatsy
- Agents that call APIs, not chatbots that answer questions. The single biggest product decision. Built around tool calling from day one, not bolted on.
- Stack: Next.js + TypeScript + Postgres + Redis + multi-provider LLM. Boring, proven, swappable.
- OAuth2 framework for customer-system integrations. The thing competitors copy us on six months later.
- Knowledge base + ticketing + mailbox in the same product. Resisting "AI-only" lets the platform actually replace the helpdesk.
- Quality matters more than model. We rebuilt the agent loop twice. Neither rebuild involved a model swap.
The full stack details, the rebuilds, and what we got wrong below.
Why we started — the gap in the market in 2024
In 2024, the AI chatbot market split into two:
- Knowledge-base chatbots (Intercom's Fin v1, Zendesk Answer Bot, dozens of "RAG over your docs" startups). They could answer questions if the answer was in the docs. They couldn't do anything else.
- Workflow chatbots (decision-tree builders like Drift, the older Zendesk flows). They could route and capture lead info but couldn't reason.
We saw a third position open up: agentic AI that takes action. An agent that can process a refund, update an order, modify a subscription tier, query a shipping status — by calling the customer's actual APIs through OAuth2. We bet that this was the category that would matter in 2025-2026, not "better RAG."
That bet was right. By mid-2025, the term "agentic" was everywhere; by 2026 every major support platform has an agent story. We were 6-12 months ahead because the underlying technology (tool calling in GPT-4 and Claude 3) had stabilized enough to ship on.
The stack — and why we picked each piece
The stack we converged on:
- Frontend: Next.js 14+ (App Router), TypeScript end-to-end, Tailwind, Radix UI
- Backend: Next.js API routes (initially) then dedicated Node services for heavy queue work
- Database: PostgreSQL (Supabase managed in early days, self-hosted on AWS RDS at scale)
- Cache + queue: Redis (Upstash early, Redis on AWS ElastiCache later)
- LLM providers: OpenAI (initially) → multi-provider (OpenAI + Anthropic + open-source) by mid-2025
- Vector store: pgvector (Postgres extension) — never needed to graduate to a dedicated vector DB at our scale
- Auth: Auth.js + custom OAuth2 framework for customer-system integrations
- Observability: Sentry for errors, custom analytics for conversation outcomes, Logflare for retention
- Hosting: Vercel for the marketing site, AWS for the application
Why this stack and not the obvious alternatives:
Why Postgres + pgvector and not a dedicated vector DB? Operating two databases doubles the operational surface. pgvector was fast enough for our chunk count even at scale. The "you must use Pinecone / Weaviate / Qdrant" advice is right above a certain corpus size — we were never there.
Why Next.js and not a dedicated backend framework? The customer-facing dashboard and the API routes share authentication, schemas, and TypeScript types. Two codebases would have been more work, not less. We did extract worker services for long-running jobs (knowledge-base crawls, async LLM calls).
Why multi-provider LLM? Vendor lock-in is the biggest risk in the AI category. When Anthropic launched Claude 3.5 Sonnet and it became the best coding model overnight, we wanted to be able to point our customers at it within a week. We were. See Claude vs GPT vs Gemini in 2026 for the comparison we use internally to pick defaults.
The agentic loop — what we rebuilt twice
The agent loop is the heart of Chatsy. We've rebuilt it twice.
Version 1: Single-turn function calling (early 2024)
Stateless function-call format. Each user message gets one LLM call with the tools as parameters. The LLM returns either a text response or a function call. If a function call: execute, append result to history, call LLM again. Continue until text response.
This worked for simple cases. It broke on:
- Multi-step workflows. "Process a refund and email the customer" needed two function calls plus a response. The LLM would do one and forget the other.
- Context bloat. Each turn appended to history; long sessions hit context limits.
- Loop detection. Without a stop condition, agents would call the same tool repeatedly when the result wasn't what they expected.
Version 2: ReAct-style loop with explicit state machine (mid-2024)
We rewrote around a Reason-Act-Observe loop with explicit state tracking — Thought, Action, Observation cycles instrumented separately. Loop limit at 12 iterations max with cost cap.
This was better. It broke on:
- Tool composition. When the agent needed to chain three OAuth-protected actions across two customer systems, the explicit ReAct format produced verbose intermediate "thinking" tokens that bloated cost.
- Failure recovery. When an API returned an unexpected error, the agent would loop trying variations rather than gracefully escalating to a human.
Version 3: Anthropic-style tool-use with structured human-handoff (late 2024 → ongoing)
Current architecture: native tool-use API (Anthropic and OpenAI both support this cleanly now), structured handoff to human agent as an explicit tool the LLM can call, hard limits on tool calls per conversation, observable per-tool cost and error rate.
This is the version in production. We expect to rebuild again in 12-18 months as model capabilities shift — that's the nature of building on a fast-moving foundation.
The OAuth2 integration framework — our quietly hardest engineering investment
The feature that makes Chatsy useful in production is the ability to plug into a customer's existing systems. Shopify for order status. Stripe for refunds. Zendesk for ticket sync. HubSpot for CRM updates. Each one requires OAuth2 onboarding, token refresh, scope management, webhook handling, and graceful failure when the integration token expires.
We built a generic integration framework so adding a new integration is hours of work, not weeks:
- Manifest-driven: each integration is a TypeScript file describing endpoints, auth flow, scopes, and tools-it-exposes-to-the-agent.
- Token-storage abstraction: encrypted at rest with per-tenant keys, automatic refresh, fail-loud on permanent revocation.
- Per-action audit log: every tool call the agent makes through an integration is logged with user-id, tenant-id, request, response, timestamp. Customers can audit "what did the AI do on behalf of my user X?"
- Sandbox mode: new integrations ship in dry-run before going live — agents simulate the call but don't execute, so we can validate prompts before they touch production data.
The framework took us 8-10 weeks to build at the start of 2024 and has paid off every quarter since. Without it, each new integration would be 2-3 weeks of bespoke work. With it, the third-party integration count grew from 4 to 20+ over the year.
What we got wrong
Honest list:
We over-indexed on RAG quality early. The first 6 months of investment went into knowledge-base chunking, embedding strategies, hybrid search, and re-ranking. All of that mattered, but less than the agentic loop. RAG quality is necessary; it's not sufficient. If you're picking what to spend engineering time on, agent reliability beats retrieval polish.
We launched with too many AI models exposed in the UI. We thought "15+ AI models" was a feature; for most customers it was a confusion. Now we default to a sensible model and surface alternatives only when the customer asks. Defaults matter more than choice.
We underestimated the support burden of the support product. The irony: shipping a customer-support platform created a customer-support obligation. We had to staff up our own support team in mid-2024. Build the customer-support burden into your roadmap when shipping a customer-support product.
The free tier was too generous. Early-stage free tier was generous enough that some businesses ran their support on it forever. We've revised the free tier twice. Pricing is a product feature; iterate on it like one.
We waited too long to instrument cost-per-conversation. Until we built proper observability for LLM cost per conversation, we were billing on conversation volume without knowing which customers were unprofitable. Some of our enterprise customers had 10x the average cost-per-conversation because their flows involved chained tool calls. Instrument cost from day one; price after.
Numbers we can share
A few quantitative anchors for founders building in this category:
- MVP cost: ~$180K-$260K equivalent engineering effort, 14-18 weeks calendar. That includes the agentic loop, knowledge-base CMS, ticketing system, OAuth2 framework, dashboard, and the first 4 integrations.
- Time to break even on a self-funded SaaS at our pricing: ~22-30 months from launch. That's typical for B2B SaaS in our band; the AI category isn't faster. Published benchmarks put median CAC payback for private B2B SaaS in the high-teens to low-20s of months, with mid-market deals often pushing toward two years.
- Cost-per-conversation at production scale: typically a few cents on average, with long-tail high-cost conversations 5-10x the average. The shape of the cost distribution matters more than the mean. For reference, vendor list prices for AI-resolved conversations in 2026 land around $0.99/resolution for Intercom Fin and ~$1.50-$2.00 for Zendesk AI; Salesforce Agentforce sits higher.
- Conversation deflection rate: customers see 40-70% deflection once the knowledge base is mature and the agent has 5+ integrations to the customer's systems. Below 5 integrations, deflection rate caps around 30%. For reference, published 2025-2026 benchmarks land Intercom Fin around 50% (with Fin 2 reaching ~60% on tier-1 tickets and ~67% across its customer base by Fin's own reporting) and Zendesk AI roughly 35-45% on configured intents.
If we started today, what would we change
Five changes we'd make in a 2026 greenfield start:
- Anthropic native tool use from day one. In 2024 we built our own ReAct loop; today the vendor APIs are good enough that you should not.
- MCP servers, not bespoke integrations. MCP is the standard now. New integrations should be MCP servers; the framework should be MCP-client-shaped.
- Multi-tenant from day one. We had to refactor for multi-tenant at month 8. Costly.
- Cost observability before feature breadth. Build the per-conversation cost dashboard before the third integration ships.
- Pricing iteration earlier. We launched on intuitive pricing and refined it 4 times. Skip 2 of those iterations by doing real pricing research up front.
What ZTABS builds for clients
Chatsy is one of 10 AI-powered products we operate. We also build custom AI agents for clients who need something Chatsy doesn't do out-of-the-box — different verticals (healthcare-specific HIPAA flows, financial-services regulatory wrappers, vertical-specific compliance), different integration profiles (your stack, not OAuth2 standard), or different agent shapes (voice, mobile-native, multi-agent orchestration).
If you're building something agentic and want to skip the rebuilds we did, that's what we sell as a /services/ai-agent-development engagement. Typical timeline 8-14 weeks from kickoff to production MVP, depending on integration count.
Related reading
- Chatsy — product page on ztabs.co
- AI agent development cost: how much does it cost to build an AI agent?
- AI agent orchestration guide — building production agents
- Claude vs GPT vs Gemini in 2026 — picking the model behind your agent
- MCP Protocol Explained — the standard for agent tool use in 2026
- AI integration for business — frameworks and build vs buy
- Customer service chatbot guide — the broader category
- ZTABS AI development services
- Hire AI/ML engineers from ZTABS
This retrospective covers Chatsy's engineering journey from early 2024 through May 2026. Specific cost numbers, integration counts, deflection rates, and model-mix details are tagged for editorial verification — the operating numbers shift over time as we iterate on the product.
Frequently Asked Questions
What is Chatsy and what makes it different from other AI chatbots?
Chatsy is an AI customer-support platform we built at ZTABS that deploys agents which take actions, not just answer questions. Most AI chatbots can only respond with text from a knowledge base. Chatsy agents call APIs (process refunds, update orders, modify subscriptions, query order status) through OAuth2 integrations backed by a full support stack — knowledge base, ticketing, shared mailbox, live chat with human takeover. Setup takes under 15 minutes.
How does Chatsy compare to Intercom Fin or Zendesk AI?
Intercom Fin and Zendesk AI are tightly bound to their parent platforms — they require you to be on Intercom or Zendesk first. Chatsy is a standalone platform that integrates with your existing stack via OAuth2 and works alongside any helpdesk. We optimize for teams who don't want to migrate their entire support stack just to get agentic AI. We also support 15+ AI models so you choose the underlying intelligence.
What stack does Chatsy run on?
Next.js + TypeScript on the frontend, PostgreSQL for relational state, Redis for queue and cache, OpenAI and Anthropic as the model providers (with 15+ supported), OAuth2 for customer-system integrations. AES-256 at rest, TLS 1.3 in transit, GDPR-compliant data handling.
How much did Chatsy cost to build to MVP?
Roughly $180K-$260K of equivalent engineering effort and 14-18 weeks calendar to first MVP — followed by continuous iteration. That includes the agentic loop, knowledge-base CMS, ticketing system, OAuth2 framework, dashboard, and the first 4 integrations. Most of that effort goes into the supporting platform, not the chatbot itself; the agentic loop is the smaller share of the build.
Is Chatsy free?
Chatsy has a free starter plan suitable for small teams or low volume. Paid plans scale with monthly active conversations and feature tier. Enterprise plans include SOC 2 reporting, custom action integrations, and dedicated support.
Can I use Chatsy for my own product without rebuilding it?
Yes. Chatsy is a SaaS product anyone can sign up for at chatsy.app. For teams that want a deeply customized agentic support system built specifically for their stack — different from Chatsy's out-of-the-box flows — we offer custom builds through our AI development services.
Explore Related Solutions
Need Help Building Your Project?
From web apps and mobile apps to AI solutions and SaaS platforms — we ship production software for 300+ clients.
Related Articles
AI Browser Automation in 2026: ChatGPT Agent, Computer Use, and What Actually Ships
AI browser automation matured in 2024-2026. OpenAI's ChatGPT agent (and its CUA model), Anthropic Computer Use, browser-use, and Playwright MCP all ship. Here's what works in production, what breaks, and how to pick between them — from a team that's shipped agentic browser automation for clients in retail, travel, and ops automation.
10 min readAI Cost Optimization at Scale: How We Cut LLM Bills 60% Without Quality Loss
Running 10 in-house AI products and 100+ client AI deployments, we have a playbook for cutting LLM bills without losing quality. Model routing, prompt caching, output minimization, structured outputs, and the cost gotchas teams find at $20K-$200K/month.
10 min readBlockchain Development in 2026: What's Actually Worth Building
After two cycles of hype-and-bust, blockchain in 2026 has a small set of use cases that actually work in production — and a long list that still don't. This is the honest engineer's guide to what's worth building, what's not, and which stack to pick if you must.