What to get right when you move ZeroGPU from a first call into production.
Keep API keys in environment variables, never in source or client-side code. Handle the common status codes explicitly: 401 (bad API key), 403 (bad project ID), and 429 (rate limit). Set sensible timeouts and retries with backoff on 429 and 5xx. For large, non-real-time jobs, use the Batch and Files API instead of the synchronous endpoint - streaming is not supported in batch mode. Already on the OpenAI SDK? The drop-in client shown in the Introduction is the recommended way to call ZeroGPU from application code.
Read your x-api-key and x-project-id from the environment (or a secrets manager) and inject them at deploy time. Never commit them, never ship them in a browser bundle or mobile app - a key embedded in client-side code is a public key.
ZeroGPU calls authenticate from your backend. If you need to call from a browser or mobile client, proxy the request through a server you control so the key stays server-side.
# .env (git-ignored) - load with your process manager or dotenvexport ZEROGPU_API_KEY="zgpu-..."export ZEROGPU_PROJECT_ID="..."
Branch on the status code. Authentication and authorization errors are permanent - retrying them just burns time and quota. Rate limits and server errors are transient - those are the ones to retry.
Status
Meaning
What to do
200
Success
Parse and use the response.
400
Bad request
Fix the request body. Do not retry.
401
Bad API key
Check x-api-key. Do not retry.
403
Bad project ID
Check x-project-id and permissions. Do not retry.
420
Input over token limit
Shorten the input. Do not retry unchanged.
429
Rate limited
Back off and retry. Honor Retry-After.
5xx
Server error
Retry with exponential backoff.
Treat 408 (request timeout) and 409 (conflict) the same as 5xx for retry purposes. Network errors and client-side timeouts are retriable too.
If you call ZeroGPU through the drop-in OpenAI client, timeouts and retries are built in - set timeout and max_retries once on the client. The SDK retries 408, 409, 429, and 5xx with exponential backoff and respects Retry-After automatically.
import osfrom openai import OpenAIclient = OpenAI( base_url="https://api.zerogpu.ai/v1", api_key="unused", # ZeroGPU authenticates via the headers below default_headers={ "x-api-key": os.environ["ZEROGPU_API_KEY"], "x-project-id": os.environ["ZEROGPU_PROJECT_ID"], }, timeout=30.0, # seconds, per request max_retries=5, # exponential backoff on 408 / 409 / 429 / 5xx)resp = client.responses.create( model="llama-3.1-8b-instruct-fast", input="Your input text here...",)print(resp.output)
You can override either value per request, for example a longer timeout on a heavy call: client.responses.create(..., timeout=60.0).
When you call the HTTP API directly, implement the loop yourself: a per-request timeout, a retriable-status check, and exponential backoff with jitter that honors Retry-After.
# curl's built-in --retry handles 408/429/500/502/503/504 with exponential# backoff and honors Retry-After. It does NOT retry 401/403/400 - exactly right.curl --retry 5 --retry-delay 1 --max-time 30 \ https://api.zerogpu.ai/v1/responses \ -H "content-type: application/json" \ -H "x-api-key: $ZEROGPU_API_KEY" \ -H "x-project-id: $ZEROGPU_PROJECT_ID" \ -d '{ "model": "llama-3.1-8b-instruct-fast", "input": "Your input text here..." }'
For large, non-real-time workloads, the Batch and Files API is the right tool instead of looping over the synchronous endpoint. It processes up to 50,000 requests within a 24-hour window at a discounted rate and sidesteps per-request rate limits entirely - so you don’t need a retry loop at all.
You need…
Use
A single immediate response
The synchronous endpoint (POST /v1/responses) with the retry loop above
Thousands of completions, can wait minutes-to-hours