Inference

sail.inference provides thin wrappers over Sail’s hosted inference endpoints. They POST the JSON payload as given and return the raw JSON response as a dict. When a Voyage is active, the wrappers attach correlation headers so the model call shows up on the Voyage timeline, scoped to the active span and agent.

import sail

resp = sail.inference.responses.create(
    model="zai-org/GLM-5.2-FP8",
    input="Say hello in one sentence.",
)

chat = sail.inference.chat.completions.create(
    model="zai-org/GLM-5.2-FP8",
    messages=[{"role": "user", "content": "hello"}],
)

Async code uses the same methods through .aio:

resp = await sail.inference.responses.create.aio(
    model="zai-org/GLM-5.2-FP8",
    input="Say hello in one sentence.",
)

responses.create

def create(
    *,
    voyage: Voyage | None = None,
    headers: Mapping[str, str] | None = None,
    timeout: float | None = None,
    **payload,
) -> dict

POSTs payload to /v1/responses and returns the raw JSON response dict.

Parameter	Default	Description
`**payload`		The request body (e.g. `model`, `input`). Sent as-is.
`voyage`	`None`	Correlate with an explicit Voyage. Defaults to the current Voyage.
`headers`	`None`	Extra request headers to merge.
`timeout`	`None`	Request timeout in seconds; defaults to a bounded 600s.

chat.completions.create

def create(
    *,
    voyage: Voyage | None = None,
    headers: Mapping[str, str] | None = None,
    timeout: float | None = None,
    **payload,
) -> dict

POSTs payload to /v1/chat/completions and returns the raw JSON response dict. Same parameters as responses.create.

Voyage correlation

If a current Voyage exists (or you pass voyage=), the wrappers add the X-Sail-Voyage-Id header plus the active span/agent context so the dashboard attributes the model call to the right place on the timeline. Pass voyage= to correlate with a specific Voyage, or call inference with no active Voyage for ordinary uncorrelated inference.

with voyage.agent("Reviewer"):
    with voyage.span("draft"):
        # Auto-attributed to this agent/span.
        sail.inference.responses.create(model="zai-org/GLM-5.2-FP8", input="...")

Auto-spans: a wrapper call made with no active span gets a real, timed span created around it automatically (named after the calling function when derivable), so the model call is attributed to a span instead of appearing unscoped. Auto-spans are marked as auto in the dashboard. Explicit spans always win, so synthesis happens only where you declared nothing. Set SAIL_VOYAGE_AUTO_SPANS=0 to disable.

@sail.agent("Reviewer")
@sail.span("draft review")
def explicit_span():
    sail.inference.responses.create(model="zai-org/GLM-5.2-FP8", input="...")


@sail.agent("Reviewer")
def auto_span():
    # No active span here, so Sail creates a timed auto-span for this model call.
    sail.inference.responses.create(model="zai-org/GLM-5.2-FP8", input="...")

Auto-spans do not infer an agent. Wrap the function or block in @sail.agent(...) / with voyage.agent(...) when ownership should appear in the dashboard.

Streaming with the SDK wrapper

The high-level sail.inference.* wrappers return parsed JSON objects and do not expose a streaming iterator. Passing stream=True to those wrappers raises sail.InferenceError before sending the request. Use a raw HTTP client, the OpenAI SDK, or the Anthropic SDK pointed at Sail when you need API-level streaming.

Raw HTTP / OpenAI clients

Wrap an OpenAI-style client pointed at Sail’s API once, and every call attributes itself. Headers are computed at call time, so there is no construction-time snapshot that can go stale. Un-spanned calls get the same synthesized auto-spans as the sail.inference wrappers.

import os

from openai import OpenAI
import sail

client = sail.voyage.wrap_openai(
    OpenAI(
        base_url="https://api.sailresearch.com/v1",
        api_key=os.environ["SAIL_API_KEY"],
    )
)

with sail.voyage.agent("Reviewer"):
    client.responses.create(model="zai-org/GLM-5.2-FP8", input="...")  # auto-attributed

wrap_openai wraps responses.create, responses.retrieve, and chat.completions.create in place (whichever exist), is idempotent, and resolves the context-local current Voyage on each call with the process-wide fallback. Pass voyage= to pin one. For any other HTTP client, call the endpoint directly and attach the attribution headers yourself with sail.voyage.headers(). The helper carries the full context (voyage id plus the span/agent active at call time), so compute it per request, never once at client construction:

import json, os, urllib.request
import sail

sail.voyage.create(name="raw-client")

api_url = "https://api.sailresearch.com"

headers = sail.voyage.headers({"Content-Type": "application/json"})
headers["Authorization"] = "Bearer " + os.environ["SAIL_API_KEY"]

req = urllib.request.Request(
    api_url.rstrip("/") + "/v1/responses",
    data=json.dumps({"model": "zai-org/GLM-5.2-FP8", "input": "hello"}).encode(),
    headers=headers,
    method="POST",
)

Voyage limitations

Voyages is telemetry only. It records and attributes your runs; it is not an agent framework and adds no orchestration or tool abstractions.

Errors

Inference wrappers raise sail.InferenceError (e.g. for unsupported wrapper options such as stream=True, or a missing API key) and sail.InferenceHTTPError for non-2xx responses. See Errors.

Inference APIs

Usage API

Sailbox

Voyages

Tinker

CLI

SDKs

responses.create

chat.completions.create

Voyage correlation

Streaming with the SDK wrapper

Raw HTTP / OpenAI clients

Voyage limitations

Errors

​responses.create

​chat.completions.create

​Voyage correlation

​Streaming with the SDK wrapper

​Raw HTTP / OpenAI clients

​Voyage limitations

​Errors

responses.create

chat.completions.create

Voyage correlation

Streaming with the SDK wrapper

Raw HTTP / OpenAI clients

Voyage limitations

Errors