Documentation Index
Fetch the complete documentation index at: https://docs.sailresearch.com/llms.txt
Use this file to discover all available pages before exploring further.
When you need to submit thousands (or tens of thousands) of requests to Sail, serial requests become a major bottleneck. Sail offers two ways to handle high-volume workloads: the Batch API and the Responses API.
Batch API
The Batch API lets you submit up to 100,000 requests in a single call (max 256 MB per batch).
- Submit a batch — Send all your requests in one
POST /batches call.
- Poll for status — Check
GET /batches/{batch_id} until all requests are completed.
- Retrieve results — Fetch individual results via
GET /batches/{batch_id}/{custom_id}.
To view all your previously submitted batches, use GET /batches.
Attach an Idempotency-Key header on submission so a client retry after a network blip replays the original batch reservation instead of creating a duplicate. See Idempotent Requests.
Batch items default to metadata.completion_window: "standard" when the field
is omitted. If you set it explicitly, it must be either "standard" or
"flex" — the low-latency "asap" and "priority" tiers are rejected for
batch items. For latency-sensitive workloads, use the Responses
API instead. See Completion
Windows for the full tier definitions and per-model
availability.
Python example
First, install the requests library:
import time
import requests
BASE_URL = "https://api.sailresearch.com/v1"
API_KEY = "YOUR_KEY_HERE"
HEADERS = {"Authorization": f"Bearer {API_KEY}"}
# 1. Submit a batch
batch_requests = [
{
"custom_id": f"request-{i}",
"params": {
"model": "zai-org/GLM-5.1-FP8",
"max_output_tokens": 2048,
"input": [{"role": "user", "content": f"What is a fun fact about the number {i}?"}],
"metadata": {"completion_window": "standard"},
},
}
for i in range(100)
]
resp = requests.post(
f"{BASE_URL}/batches",
headers=HEADERS,
json={
"endpoint": "/v1/responses",
"label": "my-batch-job",
"requests": batch_requests,
},
)
batch = resp.json()
batch_id = batch["id"]
print(f"Created batch {batch_id} with {batch['request_counts']} requests")
# 2. Poll until all requests are completed
while True:
status_resp = requests.get(f"{BASE_URL}/batches/{batch_id}", headers=HEADERS)
status = status_resp.json()
request_status = status["request_status"]
done = sum(1 for r in request_status.values() if r["status"] in ("COMPLETED", "FAILED", "CANCELLED"))
print(f"Progress: {done}/{len(request_status)}")
if done == len(request_status):
break
time.sleep(5)
# 3. Retrieve results
for custom_id, info in request_status.items():
if info["status"] == "COMPLETED":
result = requests.get(
f"{BASE_URL}/batches/{batch_id}/{custom_id}", headers=HEADERS
).json()
output_text = "".join(
c["text"]
for item in result.get("output", [])
for c in item.get("content", [])
if c.get("type") == "output_text"
)
print(f"{custom_id}: {output_text[:100]}...")
Responses API with background mode
You can also submit requests individually using the Responses API with background=True. For large-volume workloads (1,000+ requests), we recommend:
- Use
AsyncOpenAI with DefaultAioHttpClient() — The OpenAI SDK’s built-in aiohttp client is more efficient than the default httpx backend for high-concurrency workloads.
- Gate concurrency with an
asyncio.Semaphore — This gives you fine-grained control over how many simultaneous connections are opened (e.g. 200), preventing connection exhaustion.
- Submit all requests concurrently with
background=True, then poll — Fire off all submissions in parallel (gated by the semaphore), collect the response IDs, and poll for completions in a separate loop.
- Send a per-request
Idempotency-Key — So a retry after a transient failure replays the reservation instead of duplicating inference work. See Idempotent Requests.
Python example
First, install tqdm and the OpenAI SDK with the aiohttp extra:
pip install tqdm
pip install 'openai[aiohttp]'
import argparse
import asyncio
from tqdm import tqdm
from openai import DefaultAioHttpClient
from openai import AsyncOpenAI
async def main(
num_requests: int,
input_tokens: int,
max_output_tokens: int,
model: str,
):
base_url = "https://api.sailresearch.com/v1"
api_key = "YOUR_KEY_HERE"
async with AsyncOpenAI(
base_url=base_url,
api_key=api_key,
http_client=DefaultAioHttpClient(),
) as client:
# List supported models
models = await client.models.list()
supported_models = [m.id for m in models.data]
print("Supported Models:")
print(supported_models)
# Submit requests concurrently
sem = asyncio.Semaphore(200)
pbar_submit = tqdm(total=num_requests, desc="Submitting requests")
async def submit(i):
filler = ""
if input_tokens > 0:
filler = " " + ("word " * input_tokens)
content = f"TASK {i}: What is a fun fact about the number {i}? Then, find the word at index {i} in the following sequence of words: {filler}"
async with sem:
response = await client.responses.create(
model=model,
input=[{"role": "user", "content": content}],
max_output_tokens=max_output_tokens,
background=True,
)
pbar_submit.update(1)
return response.id
response_ids = await asyncio.gather(*[submit(i) for i in range(num_requests)])
pbar_submit.close()
response_ids = list(response_ids)
print(f"Created {len(response_ids)} response IDs")
# Poll for completions
completed = {}
poll_sem = asyncio.Semaphore(200)
pbar = tqdm(total=len(response_ids), desc="Polling responses")
async def poll(response_id):
for _ in range(3600): # 1 hour timeout
async with poll_sem:
response = await client.responses.retrieve(response_id)
if response.status == "completed":
completed[response_id] = response
pbar.update(1)
return
await asyncio.sleep(1)
print(f"Timed out waiting for {response_id}")
await asyncio.gather(*[poll(rid) for rid in response_ids])
pbar.close()
for response_id in response_ids:
resp = completed[response_id]
output_text = "".join(
c.text
for item in resp.output
for c in getattr(item, "content", []) or []
if getattr(c, "type", None) == "output_text"
)
print(f"\n\n{response_id} completed with output:\n{output_text}")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--num-requests", type=int, default=20000)
parser.add_argument("--input-tokens", type=int, default=50)
parser.add_argument("--max-output-tokens", type=int, default=4000)
parser.add_argument("--model", type=str, default="zai-org/GLM-5.1-FP8")
args = parser.parse_args()
asyncio.run(
main(
num_requests=args.num_requests,
input_tokens=args.input_tokens,
max_output_tokens=args.max_output_tokens,
model=args.model,
)
)
The semaphore limit of 200 is a good starting point. Lower it if you run into connection errors or timeouts; raise it if you have headroom and want faster submission throughput.