Function Calling in LLMs: How AI Agents Use Tools (Practical Guide)
Author
Date Published
TL;DR: Function calling (tool use) is what gives AI agents the ability to interact with the real world — searching databases, calling APIs, and taking actions. This guide covers how function calling works across GPT-4o, Claude, and Gemini, with code examples and production patterns.
Function calling — also called tool use — is the capability that transforms LLMs from text generators into AI agents that can act in the real world. Without function calling, an LLM can only produce text. With function calling, an LLM can search databases, call APIs, send emails, process payments, update CRM records, and execute virtually any operation you define.
If you are building AI agents, function calling is the most important capability to understand deeply. It is the bridge between the LLM's reasoning and your application's functionality.
How Function Calling Works
The flow is straightforward once you understand the mechanics. Vendor reference: OpenAI Function Calling guide[1], Anthropic Tool Use docs[2], Google Gemini function calling[3].
Step 1: Define functions
You describe the functions (tools) the LLM can call — name, description, parameters, and parameter types. These descriptions are passed to the LLM as part of the prompt.
Step 2: LLM decides to call a function
Based on the user's message and the available function descriptions, the LLM decides whether to call a function and which one. The LLM does not execute the function — it returns a structured JSON object indicating which function to call and what arguments to pass.
Step 3: Your code executes the function
Your application receives the function call request, executes the actual function (API call, database query, etc.), and returns the result to the LLM.
Step 4: LLM incorporates the result
The LLM receives the function result and uses it to generate its final response to the user.
User: "What's the weather in Houston?"
↓
LLM: "I should call get_weather with city='Houston'" (function call)
↓
Your code: calls weather API → returns "72°F, partly cloudy"
↓
LLM: "The current weather in Houston is 72°F and partly cloudy."
The LLM never has direct access to your systems. It can only request that you execute functions on its behalf. This separation is critical for security.
Function Calling Across Providers
OpenAI (GPT-4o, GPT-4o-mini)
from openai import OpenAI
client = OpenAI()
tools = [
{
"type": "function",
"function": {
"name": "get_order_status",
"description": "Look up the current status of a customer order by order ID",
"parameters": {
"type": "object",
"properties": {
"order_id": {
"type": "string",
"description": "The order ID (e.g., ORD-12345)"
}
},
"required": ["order_id"]
}
}
},
{
"type": "function",
"function": {
"name": "search_knowledge_base",
"description": "Search the company knowledge base for product information, policies, and FAQs",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query"
}
},
"required": ["query"]
}
}
}
]
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful customer support agent."},
{"role": "user", "content": "Where is my order ORD-12345?"}
],
tools=tools,
tool_choice="auto"
)
tool_call = response.choices[0].message.tool_calls[0]
# tool_call.function.name == "get_order_status"
# tool_call.function.arguments == '{"order_id": "ORD-12345"}'
Anthropic (Claude)
import anthropic
client = anthropic.Anthropic()
tools = [
{
"name": "get_order_status",
"description": "Look up the current status of a customer order by order ID",
"input_schema": {
"type": "object",
"properties": {
"order_id": {
"type": "string",
"description": "The order ID (e.g., ORD-12345)"
}
},
"required": ["order_id"]
}
}
]
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system="You are a helpful customer support agent.",
tools=tools,
messages=[
{"role": "user", "content": "Where is my order ORD-12345?"}
]
)
for block in response.content:
if block.type == "tool_use":
# block.name == "get_order_status"
# block.input == {"order_id": "ORD-12345"}
pass
Google (Gemini)
import google.generativeai as genai
get_order_status = genai.protos.FunctionDeclaration(
name="get_order_status",
description="Look up the current status of a customer order",
parameters=genai.protos.Schema(
type=genai.protos.Type.OBJECT,
properties={
"order_id": genai.protos.Schema(type=genai.protos.Type.STRING)
},
required=["order_id"]
)
)
model = genai.GenerativeModel(
"gemini-1.5-pro",
tools=[genai.protos.Tool(function_declarations=[get_order_status])]
)
response = model.generate_content("Where is my order ORD-12345?")
The Tool Loop Pattern
In practice, an AI agent often needs to call multiple tools in sequence — search for information, then look up a record, then take an action. This requires a tool loop.
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message}
]
while True:
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools
)
assistant_message = response.choices[0].message
if not assistant_message.tool_calls:
# No more tool calls — return the final response
print(assistant_message.content)
break
messages.append(assistant_message)
for tool_call in assistant_message.tool_calls:
result = execute_tool(tool_call.function.name, tool_call.function.arguments)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": str(result)
})
This loop continues until the LLM generates a text response instead of a tool call — signaling that it has gathered enough information to answer.
Limiting tool calls
Without limits, the agent could loop indefinitely. Always set a maximum.
MAX_TOOL_CALLS = 5
tool_call_count = 0
while tool_call_count < MAX_TOOL_CALLS:
response = client.chat.completions.create(...)
if not assistant_message.tool_calls:
break
tool_call_count += len(assistant_message.tool_calls)
Writing Good Function Descriptions
The quality of your function descriptions directly determines how reliably the LLM selects the right function and passes correct arguments. This is prompt engineering for tools.
Rules for function descriptions
Be specific about what the function does:
Bad: "Gets data"
Good: "Retrieves the current shipping status, tracking number, and estimated
delivery date for a customer order by its order ID"
Describe parameter formats and constraints:
Bad: "date": { "type": "string", "description": "Date" }
Good: "date": { "type": "string", "description": "Date in YYYY-MM-DD format (e.g., 2026-03-04)" }
Specify when to use the function vs when not to:
"description": "Search the product catalog for items matching a query.
Use this for product-related questions. Do NOT use this for order status
or account questions — use get_order_status or get_account_info instead."
Use enum types to restrict parameter values:
"status_filter": {
"type": "string",
"enum": ["pending", "shipped", "delivered", "returned"],
"description": "Filter orders by status"
}
Parallel Function Calling
GPT-4o and Claude can request multiple function calls in a single response. This is useful when the agent needs independent data from multiple sources.
User: "Compare my last order with my current subscription"
LLM responds with TWO tool calls:
1. get_recent_orders(customer_id="C-123", limit=1)
2. get_subscription(customer_id="C-123")
Both calls can execute in parallel, and results are returned together. This halves the latency compared to sequential calls.
import asyncio
async def handle_parallel_calls(tool_calls):
tasks = [
execute_tool_async(tc.function.name, tc.function.arguments)
for tc in tool_calls
]
results = await asyncio.gather(*tasks)
return results
Security Considerations
Function calling gives the LLM indirect access to your systems. Take security seriously.
Input validation
Never trust the LLM's arguments blindly. Validate every parameter before execution.
def get_order_status(order_id: str) -> str:
if not re.match(r'^ORD-\d{5,10}$', order_id):
return "Invalid order ID format"
order = db.orders.find_one({"id": order_id})
if not order:
return "Order not found"
return f"Status: {order['status']}, Tracking: {order['tracking']}"
Authorization boundaries
Not every function should be available to every user. Implement per-user tool access.
def get_available_tools(user_role: str) -> list:
base_tools = [search_knowledge_base, get_order_status]
if user_role == "admin":
return base_tools + [process_refund, update_account]
return base_tools
Rate limiting
Prevent runaway agents from overwhelming your systems.
Audit logging
Log every function call with the user context, arguments, result, and timestamp. This is essential for debugging, security, and compliance. See our AI governance guide.
Function Calling vs MCP
Function calling is the low-level mechanism. Model Context Protocol (MCP) is the high-level standard built on top of it.
| Aspect | Function Calling | MCP | |--------|-----------------|-----| | Scope | Single model, single application | Universal standard across models | | Discovery | You define tools in code | Client auto-discovers tools from servers | | Portability | Model-specific API format | Works across any MCP-compatible model | | Ecosystem | Custom per project | Growing ecosystem of pre-built servers | | Best for | Simple applications with few tools | Complex applications with many tools or multi-model support |
Rule of thumb: Use native function calling for simple agents with 1–5 tools. Use MCP when you have many tools, need model portability, or are building a platform.
Structured Outputs and Strict Mode (2026)
OpenAI's Structured Outputs (launched August 2024, stable in gpt-4o-2024-08-06 and later) and Anthropic's tool-use schema validation have dramatically reduced schema-compliance failures. Enable them.
# OpenAI: strict Structured Outputs
tools = [
{
"type": "function",
"function": {
"name": "create_support_ticket",
"description": "Create a support ticket after gathering all required fields",
"parameters": {
"type": "object",
"properties": {
"subject": {"type": "string"},
"priority": {
"type": "string",
"enum": ["low", "medium", "high", "urgent"]
},
"customer_id": {"type": "string"},
"description": {"type": "string"}
},
"required": ["subject", "priority", "customer_id", "description"],
"additionalProperties": False
},
"strict": True # Guarantees 100% schema compliance
}
}
]
With strict: True, OpenAI guarantees the model's tool-call arguments will validate against your JSON schema — no missing required fields, no unexpected properties, no type errors. This moves schema failures from a runtime concern to a schema-design concern.
Tradeoffs to know:
- First request with a new strict schema adds ~1–2 seconds of schema compilation latency. Subsequent calls are cached.
- Not every schema feature is supported —
minLength,maxLength, complex$refpatterns, and recursive schemas are limited. Stick to types, enums, required, andadditionalProperties: false. - Strict mode is only available on
gpt-4o-2024-08-06and later; older snapshots fall back to best-effort.
Anthropic's equivalent (tool input_schema with required fields) hits roughly 98–99% compliance on Claude Sonnet 4 without a dedicated strict flag, per Anthropic's published evaluations.
Cost Economics of Tool Schemas
Every tool definition you pass adds input tokens to every call. A realistic e-commerce agent with 12 tools averages 2,500 tokens of schema per request. At GPT-4o prices ($2.50/1M input), that's $0.00625 per request in schema alone — before the user prompt, conversation history, or RAG context.
Optimization tactics in order of payoff:
- Tool routing — Send 3–5 tools per request instead of 12. Typical savings: 60–75% on schema tokens.
- Prompt caching — Anthropic caches tool schemas at ~10% of input price for cache hits; OpenAI cached input at 50%. For high-QPS agents, this is the largest single cost lever.
- Schema compression — Drop wordy descriptions for obvious fields; keep them for ambiguous ones. A 30% description trim rarely hurts accuracy and shaves 10–15% of schema tokens.
- Model routing — GPT-4o-mini handles 80%+ of well-defined tool calls with near-parity accuracy at 1/17th the price of GPT-4o. Reserve the flagship model for reasoning-heavy paths.
Tool-Call Evaluation Harness
Most teams evaluate LLM agents by vibes. That does not survive production. Stand up a proper eval from day one.
Minimum viable eval dataset:
- 50 labeled conversations per critical user intent
- For each: expected tool-call sequence, expected final-response category, expected failure modes (if any)
- Replay on every prompt change and every model release
Metrics to track per eval run:
- Tool-selection accuracy — did the agent pick the right tool? Target: >95% on well-scoped agents.
- Argument accuracy — did it pass the right arguments? Target: >93%.
- Sequence accuracy — did it complete the workflow correctly? Target: >90%.
- Unnecessary tool calls — did it over-call? Target: <5% of sessions with extra calls.
- Refusal correctness — did it correctly refuse out-of-scope requests? Target: >98%.
Promptfoo, Langfuse, and Phoenix (Arize) all support replay-style evaluation suites; stitching one together with plain pytest and a JSON eval file also works for teams just starting out.
Getting Started
- Start with one function. Build an agent that can call a single function reliably before adding more.
- Write excellent function descriptions. This is the highest-leverage optimization for function calling reliability.
- Implement the tool loop pattern. This is the standard architecture for AI agents.
- Add validation, rate limiting, and logging from day one.
- Test with adversarial inputs. What happens when the user tries to trick the agent into calling functions it should not?
- Build the eval harness before the agent ships. Retrofitting evals after launch is 5x more work and produces a weaker safety net.
For help building AI agents with production-grade function calling, explore our AI agent development services or contact us. Our team builds agents across customer support, e-commerce, and enterprise automation using both native function calling and MCP.
Frequently Asked Questions
How much does function calling add to the per-request cost compared to plain chat completion?
Function calling itself does not carry a separate fee on OpenAI or Anthropic, but the tool schema you pass in counts as input tokens on every call, which typically adds 200 to 800 tokens per request depending on how many functions you expose. At scale, that schema overhead can raise inference cost by 5 to 15 percent, so pruning unused tools per request is a real optimization. Caching via Anthropic prompt caching or OpenAI cached input can claw most of it back.
Is OpenAI function calling better than Anthropic tool use for a production RAG app?
Both providers now hit roughly comparable accuracy on well-specified tool schemas, and the choice usually comes down to parallel tool call behavior and how each handles partial failures. Anthropic tends to be more conservative about calling tools without clear signals, while OpenAI fires more aggressively, which matters for agent designs where over-calling is the failure mode. Run the same eval set against both before committing.
Can function calling scale to 50 or 100 tools on a single agent?
Hard ceiling: accuracy degrades noticeably once you push past roughly 20 tools in a single prompt, and beyond 50 tools the model starts hallucinating function signatures. The fix is tool routing, where a lightweight classifier picks the relevant 3 to 5 tools per request and only those get passed to the main model. MCP servers and retrieval-based tool selection make this pattern straightforward.
What breaks first when function calling hits real production traffic?
Schema validation errors are the first failure mode, usually because the model returns a string where an integer is expected or omits a required field under load. Strict JSON mode or structured outputs cuts this dramatically, but you still need retry logic and a dead-letter queue for malformed calls. The second failure is tool timeout cascades when one slow backend blocks the whole agent chain.
Explore Related Solutions
Need Help Building Your Project?
From web apps and mobile apps to AI solutions and SaaS platforms — we ship production software for 300+ clients.
Related Articles
What Is Agentic AI? How Autonomous Agents Are Changing Software in 2026
Agentic AI refers to autonomous AI systems that can plan, reason, use tools, and take actions without step-by-step human instructions. This guide explains how agentic AI works, how it differs from generative AI, real use cases, and how to evaluate whether your business is ready for it.
10 min readRAG System Development Cost: Full Breakdown for 2026
How much does it cost to build a RAG system? Full breakdown covering development, vector databases, embedding models, LLM APIs, infrastructure, and ongoing maintenance. Includes cost ranges by complexity and tips to reduce costs.
11 min read25 Questions to Ask an AI Development Company Before You Hire Them
Asking the right questions separates good AI development partners from expensive mistakes. Here are 25 questions that reveal whether a company can actually deliver production AI — covering experience, technical depth, pricing, process, and post-launch support.