Building a tool-calling agent

Sail supports tool calling through the Responses API, letting you build agents that call external tools and reason over the results across multiple turns.

How it works

Send a user message along with tool definitions to /v1/responses.
The model may return one or more function_call items instead of (or alongside) text.
Execute the tools locally, then send the results back as function_call_output items in a new request — together with the full conversation history.
Repeat until the model responds with text only.

Each request includes the entire conversation so far. Append response.output items directly to your conversation list — they are valid input items with no conversion needed — then append the function_call_output results.

Full example: multi-turn weather agent

This example uses zai-org/GLM-5.2-FP8 to build a two-turn conversation where the model calls a weather tool and then answers a follow-up question using context from the first turn.

import json
import time

from openai import OpenAI

client = OpenAI(
    base_url="https://api.sailresearch.com/v1",
    api_key="YOUR_SAIL_API_KEY",
)

MODEL = "zai-org/GLM-5.2-FP8"

TOOLS = [
    {
        "type": "function",
        "name": "get_weather",
        "description": "Get the current weather for a location.",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "City and state, e.g. San Francisco, CA",
                }
            },
            "required": ["location"],
            "additionalProperties": False,
        },
        "strict": True,
    },
]

TOOL_DISPATCH = {"get_weather": lambda location: '{"temperature": "62°F", "condition": "Foggy"}'}


def poll(response, timeout=300):
    start = time.time()
    while response.status not in ("completed", "failed", "cancelled"):
        if time.time() - start > timeout:
            raise TimeoutError(f"{response.id} did not complete within {timeout}s")
        time.sleep(2)
        response = client.responses.retrieve(response.id)
    if response.status != "completed":
        raise RuntimeError(f"{response.id} status: {response.status}")
    return response


def agent_turn(conversation, user_message):
    """Send a user message and loop until the model stops calling tools."""
    conversation.append({"role": "user", "content": user_message})

    while True:
        response = client.responses.create(
            model=MODEL,
            input=conversation,
            tools=TOOLS,
            max_output_tokens=4096,
            background=True,
        )
        response = poll(response)

        tool_calls = [
            item for item in (response.output or [])
            if getattr(item, "type", None) == "function_call"
        ]
        conversation.extend(response.output)

        if not tool_calls:
            return response

        for call in tool_calls:
            args = json.loads(call.arguments)
            output = TOOL_DISPATCH[call.name](**args)
            conversation.append(
                {"type": "function_call_output", "call_id": call.call_id, "output": output}
            )


conversation = []

# Turn 1: triggers a get_weather tool call, then the model summarizes the result
response = agent_turn(conversation, "What's the weather in San Francisco?")
print("Turn 1:", response.output_text)

# Turn 2: follow-up reuses conversation context
response = agent_turn(conversation, "How about New York — warmer or colder?")
print("Turn 2:", response.output_text)

What happens under the hood

Turn 1 — the model receives the user question plus the tool definition. It calls get_weather for San Francisco. After we send the tool result back, a second request is made and the model produces a text summary.
Turn 2 — the full conversation (including Turn 1’s tool call and result) is sent again. The model calls get_weather for New York, gets the result, and compares it with the San Francisco data it already has in context.

Tips

Choose a completion window for your agent loop. The default standard window gives a good balance of cost and trajectory time for most agent workloads; reach for priority when individual turns are latency-sensitive, or flex for background batches where hours-scale queueing is fine. See Completion windows for response time distributions and pricing.
background=True is recommended for long-running agents. Sail is throughput-optimized, so requests may take longer than a typical low-latency API. Background mode avoids HTTP timeouts and lets you poll for completion.
Send the full conversation in each request. Include all prior messages, response.output items, and tool results. Output items from previous responses can be appended directly — no serialization or conversion is needed.
strict: true on tool parameters enables structured output guarantees — the model’s arguments JSON will always conform to your schema.
Parallel tool calls are supported by default. The model may return multiple function_call items in a single response.
Add a per-request Idempotency-Key header so retries will use the stored response instead of re-running inference and double charging. See Idempotent Requests.

​How it works

​Full example: multi-turn weather agent

​What happens under the hood

​Tips

How it works

Full example: multi-turn weather agent

What happens under the hood

Tips