Function Calling in LLMs: How AI Agents Use Tools (2026 Guide)

Function calling — also called tool use — is the capability that transforms LLMs from text generators into AI agents that can act in the real world. Without function calling, an LLM can only produce text. With function calling, an LLM can search databases, call APIs, send emails, process payments, update CRM records, and execute virtually any operation you define.

If you are building AI agents, function calling is the most important capability to understand deeply. It is the bridge between the LLM's reasoning and your application's functionality.

How Function Calling Works

The flow is straightforward once you understand the mechanics. Vendor reference: OpenAI Function Calling guide^[1], Anthropic Tool Use docs^[2], Google Gemini function calling^[3].

Step 1: Define functions

You describe the functions (tools) the LLM can call — name, description, parameters, and parameter types. These descriptions are passed to the LLM as part of the prompt.

Step 2: LLM decides to call a function

Based on the user's message and the available function descriptions, the LLM decides whether to call a function and which one. The LLM does not execute the function — it returns a structured JSON object indicating which function to call and what arguments to pass.

Step 3: Your code executes the function

Your application receives the function call request, executes the actual function (API call, database query, etc.), and returns the result to the LLM.

Step 4: LLM incorporates the result

The LLM receives the function result and uses it to generate its final response to the user.

User: "What's the weather in Houston?"
    ↓
LLM: "I should call get_weather with city='Houston'"  (function call)
    ↓
Your code: calls weather API → returns "72°F, partly cloudy"
    ↓
LLM: "The current weather in Houston is 72°F and partly cloudy."

The LLM never has direct access to your systems. It can only request that you execute functions on its behalf. This separation is critical for security.

Function Calling Across Providers

OpenAI (GPT-4o, GPT-4o-mini)

from openai import OpenAI

client = OpenAI()

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_order_status",
            "description": "Look up the current status of a customer order by order ID",
            "parameters": {
                "type": "object",
                "properties": {
                    "order_id": {
                        "type": "string",
                        "description": "The order ID (e.g., ORD-12345)"
                    }
                },
                "required": ["order_id"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search_knowledge_base",
            "description": "Search the company knowledge base for product information, policies, and FAQs",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "The search query"
                    }
                },
                "required": ["query"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful customer support agent."},
        {"role": "user", "content": "Where is my order ORD-12345?"}
    ],
    tools=tools,
    tool_choice="auto"
)

tool_call = response.choices[0].message.tool_calls[0]
# tool_call.function.name == "get_order_status"
# tool_call.function.arguments == '{"order_id": "ORD-12345"}'

Anthropic (Claude)

import anthropic

client = anthropic.Anthropic()

tools = [
    {
        "name": "get_order_status",
        "description": "Look up the current status of a customer order by order ID",
        "input_schema": {
            "type": "object",
            "properties": {
                "order_id": {
                    "type": "string",
                    "description": "The order ID (e.g., ORD-12345)"
                }
            },
            "required": ["order_id"]
        }
    }
]

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system="You are a helpful customer support agent.",
    tools=tools,
    messages=[
        {"role": "user", "content": "Where is my order ORD-12345?"}
    ]
)

for block in response.content:
    if block.type == "tool_use":
        # block.name == "get_order_status"
        # block.input == {"order_id": "ORD-12345"}
        pass

Google (Gemini)

import google.generativeai as genai

get_order_status = genai.protos.FunctionDeclaration(
    name="get_order_status",
    description="Look up the current status of a customer order",
    parameters=genai.protos.Schema(
        type=genai.protos.Type.OBJECT,
        properties={
            "order_id": genai.protos.Schema(type=genai.protos.Type.STRING)
        },
        required=["order_id"]
    )
)

model = genai.GenerativeModel(
    "gemini-1.5-pro",
    tools=[genai.protos.Tool(function_declarations=[get_order_status])]
)

response = model.generate_content("Where is my order ORD-12345?")

The Tool Loop Pattern

In practice, an AI agent often needs to call multiple tools in sequence — search for information, then look up a record, then take an action. This requires a tool loop.

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_message}
]

while True:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        tools=tools
    )

    assistant_message = response.choices[0].message

    if not assistant_message.tool_calls:
        # No more tool calls — return the final response
        print(assistant_message.content)
        break

    messages.append(assistant_message)

    for tool_call in assistant_message.tool_calls:
        result = execute_tool(tool_call.function.name, tool_call.function.arguments)
        messages.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": str(result)
        })

This loop continues until the LLM generates a text response instead of a tool call — signaling that it has gathered enough information to answer.

Limiting tool calls

Without limits, the agent could loop indefinitely. Always set a maximum.

MAX_TOOL_CALLS = 5
tool_call_count = 0

while tool_call_count < MAX_TOOL_CALLS:
    response = client.chat.completions.create(...)
    if not assistant_message.tool_calls:
        break
    tool_call_count += len(assistant_message.tool_calls)

Writing Good Function Descriptions

The quality of your function descriptions directly determines how reliably the LLM selects the right function and passes correct arguments. This is prompt engineering for tools.

Rules for function descriptions

Be specific about what the function does:

Bad:  "Gets data"
Good: "Retrieves the current shipping status, tracking number, and estimated
       delivery date for a customer order by its order ID"

Describe parameter formats and constraints:

Bad:  "date": { "type": "string", "description": "Date" }
Good: "date": { "type": "string", "description": "Date in YYYY-MM-DD format (e.g., 2026-03-04)" }

Specify when to use the function vs when not to:

"description": "Search the product catalog for items matching a query.
Use this for product-related questions. Do NOT use this for order status
or account questions — use get_order_status or get_account_info instead."

Use enum types to restrict parameter values:

"status_filter": {
    "type": "string",
    "enum": ["pending", "shipped", "delivered", "returned"],
    "description": "Filter orders by status"
}

Parallel Function Calling

GPT-4o and Claude can request multiple function calls in a single response. This is useful when the agent needs independent data from multiple sources.

User: "Compare my last order with my current subscription"

LLM responds with TWO tool calls:
  1. get_recent_orders(customer_id="C-123", limit=1)
  2. get_subscription(customer_id="C-123")

Both calls can execute in parallel, and results are returned together. This halves the latency compared to sequential calls.

import asyncio

async def handle_parallel_calls(tool_calls):
    tasks = [
        execute_tool_async(tc.function.name, tc.function.arguments)
        for tc in tool_calls
    ]
    results = await asyncio.gather(*tasks)
    return results

Security Considerations

Function calling gives the LLM indirect access to your systems. Take security seriously.

Input validation

Never trust the LLM's arguments blindly. Validate every parameter before execution.

def get_order_status(order_id: str) -> str:
    if not re.match(r'^ORD-\d{5,10}$', order_id):
        return "Invalid order ID format"
    order = db.orders.find_one({"id": order_id})
    if not order:
        return "Order not found"
    return f"Status: {order['status']}, Tracking: {order['tracking']}"

Authorization boundaries

Not every function should be available to every user. Implement per-user tool access.

def get_available_tools(user_role: str) -> list:
    base_tools = [search_knowledge_base, get_order_status]
    if user_role == "admin":
        return base_tools + [process_refund, update_account]
    return base_tools

Rate limiting

Prevent runaway agents from overwhelming your systems.

Audit logging

Log every function call with the user context, arguments, result, and timestamp. This is essential for debugging, security, and compliance. See our AI governance guide.

Function Calling vs MCP

Function calling is the low-level mechanism. Model Context Protocol (MCP) is the high-level standard built on top of it.

Aspect	Function Calling	MCP
Scope	Single model, single application	Universal standard across models
Discovery	You define tools in code	Client auto-discovers tools from servers
Portability	Model-specific API format	Works across any MCP-compatible model
Ecosystem	Custom per project	Growing ecosystem of pre-built servers
Best for	Simple applications with few tools	Complex applications with many tools or multi-model support

Rule of thumb: Use native function calling for simple agents with 1–5 tools. Use MCP when you have many tools, need model portability, or are building a platform.

Structured Outputs and Strict Mode (2026)

OpenAI's Structured Outputs (launched August 2024, stable in gpt-4o-2024-08-06 and later) and Anthropic's tool-use schema validation have dramatically reduced schema-compliance failures. Enable them.

# OpenAI: strict Structured Outputs
tools = [
    {
        "type": "function",
        "function": {
            "name": "create_support_ticket",
            "description": "Create a support ticket after gathering all required fields",
            "parameters": {
                "type": "object",
                "properties": {
                    "subject": {"type": "string"},
                    "priority": {
                        "type": "string",
                        "enum": ["low", "medium", "high", "urgent"]
                    },
                    "customer_id": {"type": "string"},
                    "description": {"type": "string"}
                },
                "required": ["subject", "priority", "customer_id", "description"],
                "additionalProperties": False
            },
            "strict": True  # Guarantees 100% schema compliance
        }
    }
]

With strict: True, OpenAI guarantees the model's tool-call arguments will validate against your JSON schema — no missing required fields, no unexpected properties, no type errors. This moves schema failures from a runtime concern to a schema-design concern.

Tradeoffs to know:

First request with a new strict schema adds ~1–2 seconds of schema compilation latency. Subsequent calls are cached.
Not every schema feature is supported — minLength, maxLength, complex $ref patterns, and recursive schemas are limited. Stick to types, enums, required, and additionalProperties: false.
Strict mode is only available on gpt-4o-2024-08-06 and later; older snapshots fall back to best-effort.

Anthropic's equivalent (tool input_schema with required fields) hits roughly 98–99% compliance on Claude Sonnet 4 without a dedicated strict flag, per Anthropic's published evaluations.

Cost Economics of Tool Schemas

Every tool definition you pass adds input tokens to every call. A realistic e-commerce agent with 12 tools averages 2,500 tokens of schema per request. At GPT-4o prices ($2.50/1M input), that's $0.00625 per request in schema alone — before the user prompt, conversation history, or RAG context.

Optimization tactics in order of payoff:

Tool routing — Send 3–5 tools per request instead of 12. Typical savings: 60–75% on schema tokens.
Prompt caching — Anthropic caches tool schemas at ~10% of input price for cache hits; OpenAI cached input at 50%. For high-QPS agents, this is the largest single cost lever.
Schema compression — Drop wordy descriptions for obvious fields; keep them for ambiguous ones. A 30% description trim rarely hurts accuracy and shaves 10–15% of schema tokens.
Model routing — GPT-4o-mini handles 80%+ of well-defined tool calls with near-parity accuracy at 1/17th the price of GPT-4o. Reserve the flagship model for reasoning-heavy paths.

Tool-Call Evaluation Harness

Most teams evaluate LLM agents by vibes. That does not survive production. Stand up a proper eval from day one.

Minimum viable eval dataset:

50 labeled conversations per critical user intent
For each: expected tool-call sequence, expected final-response category, expected failure modes (if any)
Replay on every prompt change and every model release

Metrics to track per eval run:

Tool-selection accuracy — did the agent pick the right tool? Target: >95% on well-scoped agents.
Argument accuracy — did it pass the right arguments? Target: >93%.
Sequence accuracy — did it complete the workflow correctly? Target: >90%.
Unnecessary tool calls — did it over-call? Target: <5% of sessions with extra calls.
Refusal correctness — did it correctly refuse out-of-scope requests? Target: >98%.

Promptfoo, Langfuse, and Phoenix (Arize) all support replay-style evaluation suites; stitching one together with plain pytest and a JSON eval file also works for teams just starting out.

Getting Started

Start with one function. Build an agent that can call a single function reliably before adding more.
Write excellent function descriptions. This is the highest-leverage optimization for function calling reliability.
Implement the tool loop pattern. This is the standard architecture for AI agents.
Add validation, rate limiting, and logging from day one.
Test with adversarial inputs. What happens when the user tries to trick the agent into calling functions it should not?
Build the eval harness before the agent ships. Retrofitting evals after launch is 5x more work and produces a weaker safety net.

For help building AI agents with production-grade function calling, explore our AI agent development services or contact us. Our team builds agents across customer support, e-commerce, and enterprise automation using both native function calling and MCP.

Frequently Asked Questions

How much does function calling add to the per-request cost compared to plain chat completion?

Function calling itself does not carry a separate fee on OpenAI or Anthropic, but the tool schema you pass in counts as input tokens on every call, which typically adds 200 to 800 tokens per request depending on how many functions you expose. At scale, that schema overhead can raise inference cost by 5 to 15 percent, so pruning unused tools per request is a real optimization. Caching via Anthropic prompt caching or OpenAI cached input can claw most of it back.

Is OpenAI function calling better than Anthropic tool use for a production RAG app?

Both providers now hit roughly comparable accuracy on well-specified tool schemas, and the choice usually comes down to parallel tool call behavior and how each handles partial failures. Anthropic tends to be more conservative about calling tools without clear signals, while OpenAI fires more aggressively, which matters for agent designs where over-calling is the failure mode. Run the same eval set against both before committing.

Can function calling scale to 50 or 100 tools on a single agent?

Hard ceiling: accuracy degrades noticeably once you push past roughly 20 tools in a single prompt, and beyond 50 tools the model starts hallucinating function signatures. The fix is tool routing, where a lightweight classifier picks the relevant 3 to 5 tools per request and only those get passed to the main model. MCP servers and retrieval-based tool selection make this pattern straightforward.

What breaks first when function calling hits real production traffic?

Schema validation errors are the first failure mode, usually because the model returns a string where an integer is expected or omits a required field under load. Strict JSON mode or structured outputs cuts this dramatically, but you still need retry logic and a dead-letter queue for malformed calls. The second failure is tool timeout cascades when one slow backend blocks the whole agent chain.

If you are building AI agents, function calling is the most important capability to understand deeply. It is the bridge between the LLM's reasoning and your application's functionality.

How Function Calling Works

The flow is straightforward once you understand the mechanics. Vendor reference: OpenAI Function Calling guide^[1], Anthropic Tool Use docs^[2], Google Gemini function calling^[3].

Step 1: Define functions

You describe the functions (tools) the LLM can call — name, description, parameters, and parameter types. These descriptions are passed to the LLM as part of the prompt.

Step 2: LLM decides to call a function

Step 3: Your code executes the function

Your application receives the function call request, executes the actual function (API call, database query, etc.), and returns the result to the LLM.

Step 4: LLM incorporates the result

The LLM receives the function result and uses it to generate its final response to the user.

User: "What's the weather in Houston?"
    ↓
LLM: "I should call get_weather with city='Houston'"  (function call)
    ↓
Your code: calls weather API → returns "72°F, partly cloudy"
    ↓
LLM: "The current weather in Houston is 72°F and partly cloudy."

The LLM never has direct access to your systems. It can only request that you execute functions on its behalf. This separation is critical for security.

Function Calling Across Providers

OpenAI (GPT-4o, GPT-4o-mini)

from openai import OpenAI

client = OpenAI()

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_order_status",
            "description": "Look up the current status of a customer order by order ID",
            "parameters": {
                "type": "object",
                "properties": {
                    "order_id": {
                        "type": "string",
                        "description": "The order ID (e.g., ORD-12345)"
                    }
                },
                "required": ["order_id"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search_knowledge_base",
            "description": "Search the company knowledge base for product information, policies, and FAQs",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "The search query"
                    }
                },
                "required": ["query"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful customer support agent."},
        {"role": "user", "content": "Where is my order ORD-12345?"}
    ],
    tools=tools,
    tool_choice="auto"
)

tool_call = response.choices[0].message.tool_calls[0]
# tool_call.function.name == "get_order_status"
# tool_call.function.arguments == '{"order_id": "ORD-12345"}'

Anthropic (Claude)

import anthropic

client = anthropic.Anthropic()

tools = [
    {
        "name": "get_order_status",
        "description": "Look up the current status of a customer order by order ID",
        "input_schema": {
            "type": "object",
            "properties": {
                "order_id": {
                    "type": "string",
                    "description": "The order ID (e.g., ORD-12345)"
                }
            },
            "required": ["order_id"]
        }
    }
]

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system="You are a helpful customer support agent.",
    tools=tools,
    messages=[
        {"role": "user", "content": "Where is my order ORD-12345?"}
    ]
)

for block in response.content:
    if block.type == "tool_use":
        # block.name == "get_order_status"
        # block.input == {"order_id": "ORD-12345"}
        pass

Google (Gemini)

import google.generativeai as genai

get_order_status = genai.protos.FunctionDeclaration(
    name="get_order_status",
    description="Look up the current status of a customer order",
    parameters=genai.protos.Schema(
        type=genai.protos.Type.OBJECT,
        properties={
            "order_id": genai.protos.Schema(type=genai.protos.Type.STRING)
        },
        required=["order_id"]
    )
)

model = genai.GenerativeModel(
    "gemini-1.5-pro",
    tools=[genai.protos.Tool(function_declarations=[get_order_status])]
)

response = model.generate_content("Where is my order ORD-12345?")

The Tool Loop Pattern

In practice, an AI agent often needs to call multiple tools in sequence — search for information, then look up a record, then take an action. This requires a tool loop.

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_message}
]

while True:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        tools=tools
    )

    assistant_message = response.choices[0].message

    if not assistant_message.tool_calls:
        # No more tool calls — return the final response
        print(assistant_message.content)
        break

    messages.append(assistant_message)

    for tool_call in assistant_message.tool_calls:
        result = execute_tool(tool_call.function.name, tool_call.function.arguments)
        messages.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": str(result)
        })

This loop continues until the LLM generates a text response instead of a tool call — signaling that it has gathered enough information to answer.

Limiting tool calls

Without limits, the agent could loop indefinitely. Always set a maximum.

MAX_TOOL_CALLS = 5
tool_call_count = 0

while tool_call_count < MAX_TOOL_CALLS:
    response = client.chat.completions.create(...)
    if not assistant_message.tool_calls:
        break
    tool_call_count += len(assistant_message.tool_calls)

Writing Good Function Descriptions

The quality of your function descriptions directly determines how reliably the LLM selects the right function and passes correct arguments. This is prompt engineering for tools.

Rules for function descriptions

Be specific about what the function does:

Bad:  "Gets data"
Good: "Retrieves the current shipping status, tracking number, and estimated
       delivery date for a customer order by its order ID"

Describe parameter formats and constraints:

Bad:  "date": { "type": "string", "description": "Date" }
Good: "date": { "type": "string", "description": "Date in YYYY-MM-DD format (e.g., 2026-03-04)" }

Specify when to use the function vs when not to:

"description": "Search the product catalog for items matching a query.
Use this for product-related questions. Do NOT use this for order status
or account questions — use get_order_status or get_account_info instead."

Use enum types to restrict parameter values:

"status_filter": {
    "type": "string",
    "enum": ["pending", "shipped", "delivered", "returned"],
    "description": "Filter orders by status"
}

Parallel Function Calling

GPT-4o and Claude can request multiple function calls in a single response. This is useful when the agent needs independent data from multiple sources.

User: "Compare my last order with my current subscription"

LLM responds with TWO tool calls:
  1. get_recent_orders(customer_id="C-123", limit=1)
  2. get_subscription(customer_id="C-123")

Both calls can execute in parallel, and results are returned together. This halves the latency compared to sequential calls.

import asyncio

async def handle_parallel_calls(tool_calls):
    tasks = [
        execute_tool_async(tc.function.name, tc.function.arguments)
        for tc in tool_calls
    ]
    results = await asyncio.gather(*tasks)
    return results

Security Considerations

Function calling gives the LLM indirect access to your systems. Take security seriously.

Input validation

Never trust the LLM's arguments blindly. Validate every parameter before execution.

def get_order_status(order_id: str) -> str:
    if not re.match(r'^ORD-\d{5,10}$', order_id):
        return "Invalid order ID format"
    order = db.orders.find_one({"id": order_id})
    if not order:
        return "Order not found"
    return f"Status: {order['status']}, Tracking: {order['tracking']}"

Authorization boundaries

Not every function should be available to every user. Implement per-user tool access.

def get_available_tools(user_role: str) -> list:
    base_tools = [search_knowledge_base, get_order_status]
    if user_role == "admin":
        return base_tools + [process_refund, update_account]
    return base_tools

Rate limiting

Prevent runaway agents from overwhelming your systems.

Audit logging

Log every function call with the user context, arguments, result, and timestamp. This is essential for debugging, security, and compliance. See our AI governance guide.

Function Calling vs MCP

Function calling is the low-level mechanism. Model Context Protocol (MCP) is the high-level standard built on top of it.

Aspect	Function Calling	MCP
Scope	Single model, single application	Universal standard across models
Discovery	You define tools in code	Client auto-discovers tools from servers
Portability	Model-specific API format	Works across any MCP-compatible model
Ecosystem	Custom per project	Growing ecosystem of pre-built servers
Best for	Simple applications with few tools	Complex applications with many tools or multi-model support

Rule of thumb: Use native function calling for simple agents with 1–5 tools. Use MCP when you have many tools, need model portability, or are building a platform.

Structured Outputs and Strict Mode (2026)

# OpenAI: strict Structured Outputs
tools = [
    {
        "type": "function",
        "function": {
            "name": "create_support_ticket",
            "description": "Create a support ticket after gathering all required fields",
            "parameters": {
                "type": "object",
                "properties": {
                    "subject": {"type": "string"},
                    "priority": {
                        "type": "string",
                        "enum": ["low", "medium", "high", "urgent"]
                    },
                    "customer_id": {"type": "string"},
                    "description": {"type": "string"}
                },
                "required": ["subject", "priority", "customer_id", "description"],
                "additionalProperties": False
            },
            "strict": True  # Guarantees 100% schema compliance
        }
    }
]

Tradeoffs to know:

First request with a new strict schema adds ~1–2 seconds of schema compilation latency. Subsequent calls are cached.
Not every schema feature is supported — minLength, maxLength, complex $ref patterns, and recursive schemas are limited. Stick to types, enums, required, and additionalProperties: false.
Strict mode is only available on gpt-4o-2024-08-06 and later; older snapshots fall back to best-effort.

Anthropic's equivalent (tool input_schema with required fields) hits roughly 98–99% compliance on Claude Sonnet 4 without a dedicated strict flag, per Anthropic's published evaluations.

Cost Economics of Tool Schemas

Optimization tactics in order of payoff:

Tool routing — Send 3–5 tools per request instead of 12. Typical savings: 60–75% on schema tokens.
Prompt caching — Anthropic caches tool schemas at ~10% of input price for cache hits; OpenAI cached input at 50%. For high-QPS agents, this is the largest single cost lever.
Schema compression — Drop wordy descriptions for obvious fields; keep them for ambiguous ones. A 30% description trim rarely hurts accuracy and shaves 10–15% of schema tokens.
Model routing — GPT-4o-mini handles 80%+ of well-defined tool calls with near-parity accuracy at 1/17th the price of GPT-4o. Reserve the flagship model for reasoning-heavy paths.

Tool-Call Evaluation Harness

Most teams evaluate LLM agents by vibes. That does not survive production. Stand up a proper eval from day one.

Minimum viable eval dataset:

50 labeled conversations per critical user intent
For each: expected tool-call sequence, expected final-response category, expected failure modes (if any)
Replay on every prompt change and every model release

Metrics to track per eval run:

Tool-selection accuracy — did the agent pick the right tool? Target: >95% on well-scoped agents.
Argument accuracy — did it pass the right arguments? Target: >93%.
Sequence accuracy — did it complete the workflow correctly? Target: >90%.
Unnecessary tool calls — did it over-call? Target: <5% of sessions with extra calls.
Refusal correctness — did it correctly refuse out-of-scope requests? Target: >98%.

Promptfoo, Langfuse, and Phoenix (Arize) all support replay-style evaluation suites; stitching one together with plain pytest and a JSON eval file also works for teams just starting out.

Getting Started

Start with one function. Build an agent that can call a single function reliably before adding more.
Write excellent function descriptions. This is the highest-leverage optimization for function calling reliability.
Implement the tool loop pattern. This is the standard architecture for AI agents.
Add validation, rate limiting, and logging from day one.
Test with adversarial inputs. What happens when the user tries to trick the agent into calling functions it should not?
Build the eval harness before the agent ships. Retrofitting evals after launch is 5x more work and produces a weaker safety net.

How Function Calling Works

Step 1: Define functions

Step 2: LLM decides to call a function

Step 3: Your code executes the function

Step 4: LLM incorporates the result

Function Calling Across Providers

OpenAI (GPT-4o, GPT-4o-mini)

Anthropic (Claude)

Google (Gemini)

The Tool Loop Pattern

Limiting tool calls

Writing Good Function Descriptions

Rules for function descriptions

Parallel Function Calling

Security Considerations

Input validation

Authorization boundaries

Rate limiting

Audit logging

Function Calling vs MCP

Structured Outputs and Strict Mode (2026)

Cost Economics of Tool Schemas

Tool-Call Evaluation Harness

Getting Started

Frequently Asked Questions

How much does function calling add to the per-request cost compared to plain chat completion?

Is OpenAI function calling better than Anthropic tool use for a production RAG app?

Can function calling scale to 50 or 100 tools on a single agent?

What breaks first when function calling hits real production traffic?

Explore Related Solutions

Need Help Building Your Project?

Related Articles

AI Browser Automation in 2026: ChatGPT Agent, Computer Use, and What Actually Ships

AI Cost Optimization at Scale: How We Cut LLM Bills 60% Without Quality Loss

Blockchain Development in 2026: What's Actually Worth Building

How Function Calling Works

Step 1: Define functions

Step 2: LLM decides to call a function

Step 3: Your code executes the function

Step 4: LLM incorporates the result

Function Calling Across Providers

OpenAI (GPT-4o, GPT-4o-mini)

Anthropic (Claude)

Google (Gemini)

The Tool Loop Pattern

Limiting tool calls

Writing Good Function Descriptions

Rules for function descriptions

Parallel Function Calling

Security Considerations

Input validation

Authorization boundaries

Rate limiting

Audit logging

Function Calling vs MCP

Structured Outputs and Strict Mode (2026)

Cost Economics of Tool Schemas

Tool-Call Evaluation Harness

Getting Started

Frequently Asked Questions

How much does function calling add to the per-request cost compared to plain chat completion?

Is OpenAI function calling better than Anthropic tool use for a production RAG app?

Can function calling scale to 50 or 100 tools on a single agent?

What breaks first when function calling hits real production traffic?

Explore Related Solutions

Need Help Building Your Project?

Related Articles

AI Browser Automation in 2026: ChatGPT Agent, Computer Use, and What Actually Ships

AI Cost Optimization at Scale: How We Cut LLM Bills 60% Without Quality Loss

Blockchain Development in 2026: What's Actually Worth Building