Sending Requests at Scale

When you need to submit thousands (or tens of thousands) of requests to Sail, serial requests become a major bottleneck. Sail offers two ways to handle high-volume workloads: the Batch API and the Responses API.

Batch API

The Batch API lets you submit up to 100,000 requests in a single call (max 256 MB per batch).

Submit a batch — Send all your requests in one POST /batches call.
Poll for status — Check GET /batches/{batch_id} until all requests are completed.
Retrieve results — Fetch individual results via GET /batches/{batch_id}/{custom_id}.

To view all your previously submitted batches, use GET /batches. Attach an Idempotency-Key header on submission so a client retry after a network blip replays the original batch reservation instead of creating a duplicate. See Idempotent Requests.

Batch items default to metadata.completion_window: "standard" when the field is omitted. If you set it explicitly, it must be either "standard" or "flex" — the low-latency "asap" and "priority" tiers are rejected for batch items. For latency-sensitive workloads, use the Responses API instead. See Completion Windows for the full tier definitions and per-model availability.

Python example

First, install the requests library:

pip install requests

import time
import requests

BASE_URL = "https://api.sailresearch.com/v1"
API_KEY = "YOUR_KEY_HERE"
HEADERS = {"Authorization": f"Bearer {API_KEY}"}

# 1. Submit a batch
batch_requests = [
    {
        "custom_id": f"request-{i}",
        "params": {
            "model": "zai-org/GLM-5.1-FP8",
            "max_output_tokens": 2048,
            "input": [{"role": "user", "content": f"What is a fun fact about the number {i}?"}],
            "metadata": {"completion_window": "standard"},
        },
    }
    for i in range(100)
]

resp = requests.post(
    f"{BASE_URL}/batches",
    headers=HEADERS,
    json={
        "endpoint": "/v1/responses",
        "label": "my-batch-job",
        "requests": batch_requests,
    },
)
batch = resp.json()
batch_id = batch["id"]
print(f"Created batch {batch_id} with {batch['request_counts']} requests")

# 2. Poll until all requests are completed
while True:
    status_resp = requests.get(f"{BASE_URL}/batches/{batch_id}", headers=HEADERS)
    status = status_resp.json()
    request_status = status["request_status"]

    done = sum(1 for r in request_status.values() if r["status"] in ("COMPLETED", "FAILED", "CANCELLED"))
    print(f"Progress: {done}/{len(request_status)}")

    if done == len(request_status):
        break
    time.sleep(5)

# 3. Retrieve results
for custom_id, info in request_status.items():
    if info["status"] == "COMPLETED":
        result = requests.get(
            f"{BASE_URL}/batches/{batch_id}/{custom_id}", headers=HEADERS
        ).json()
        output_text = "".join(
            c["text"]
            for item in result.get("output", [])
            for c in item.get("content", [])
            if c.get("type") == "output_text"
        )
        print(f"{custom_id}: {output_text[:100]}...")

Responses API with background mode

You can also submit requests individually using the Responses API with background=True. For large-volume workloads (1,000+ requests), we recommend:

Use AsyncOpenAI with DefaultAioHttpClient() — The OpenAI SDK’s built-in aiohttp client is more efficient than the default httpx backend for high-concurrency workloads.
Gate concurrency with an asyncio.Semaphore — This gives you fine-grained control over how many simultaneous connections are opened (e.g. 200), preventing connection exhaustion.
Submit all requests concurrently with background=True, then poll — Fire off all submissions in parallel (gated by the semaphore), collect the response IDs, and poll for completions in a separate loop.
Send a per-request Idempotency-Key — So a retry after a transient failure replays the reservation instead of duplicating inference work. See Idempotent Requests.

Python example

First, install tqdm and the OpenAI SDK with the aiohttp extra:

pip install tqdm
pip install 'openai[aiohttp]'

import argparse
import asyncio

from tqdm import tqdm
from openai import DefaultAioHttpClient
from openai import AsyncOpenAI

async def main(
    num_requests: int,
    input_tokens: int,
    max_output_tokens: int,
    model: str,
):
    base_url = "https://api.sailresearch.com/v1"
    api_key = "YOUR_KEY_HERE"

    async with AsyncOpenAI(
        base_url=base_url,
        api_key=api_key,
        http_client=DefaultAioHttpClient(),
    ) as client:
        # List supported models
        models = await client.models.list()
        supported_models = [m.id for m in models.data]
        print("Supported Models:")
        print(supported_models)

        # Submit requests concurrently
        sem = asyncio.Semaphore(200)
        pbar_submit = tqdm(total=num_requests, desc="Submitting requests")

        async def submit(i):
            filler = ""
            if input_tokens > 0:
                filler = " " + ("word " * input_tokens)
            content = f"TASK {i}: What is a fun fact about the number {i}? Then, find the word at index {i} in the following sequence of words: {filler}"

            async with sem:
                response = await client.responses.create(
                    model=model,
                    input=[{"role": "user", "content": content}],
                    max_output_tokens=max_output_tokens,
                    background=True,
                )
            pbar_submit.update(1)
            return response.id

        response_ids = await asyncio.gather(*[submit(i) for i in range(num_requests)])
        pbar_submit.close()
        response_ids = list(response_ids)
        print(f"Created {len(response_ids)} response IDs")

        # Poll for completions
        completed = {}
        poll_sem = asyncio.Semaphore(200)
        pbar = tqdm(total=len(response_ids), desc="Polling responses")

        async def poll(response_id):
            for _ in range(3600):  # 1 hour timeout
                async with poll_sem:
                    response = await client.responses.retrieve(response_id)
                if response.status == "completed":
                    completed[response_id] = response
                    pbar.update(1)
                    return
                await asyncio.sleep(1)
            print(f"Timed out waiting for {response_id}")

        await asyncio.gather(*[poll(rid) for rid in response_ids])
        pbar.close()

        for response_id in response_ids:
            resp = completed[response_id]
            output_text = "".join(
                c.text
                for item in resp.output
                for c in getattr(item, "content", []) or []
                if getattr(c, "type", None) == "output_text"
            )
            print(f"\n\n{response_id} completed with output:\n{output_text}")


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--num-requests", type=int, default=20000)
    parser.add_argument("--input-tokens", type=int, default=50)
    parser.add_argument("--max-output-tokens", type=int, default=4000)
    parser.add_argument("--model", type=str, default="zai-org/GLM-5.1-FP8")
    args = parser.parse_args()

    asyncio.run(
        main(
            num_requests=args.num_requests,
            input_tokens=args.input_tokens,
            max_output_tokens=args.max_output_tokens,
            model=args.model,
        )
    )

The semaphore limit of 200 is a good starting point. Lower it if you run into connection errors or timeouts; raise it if you have headroom and want faster submission throughput.

Documentation Index

​Batch API

​Python example

​Responses API with background mode

​Python example

Batch API

Python example

Responses API with background mode

Python example