Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.zerogpu.ai/llms.txt

Use this file to discover all available pages before exploring further.

Every line in a batch input file must target /v1/chat/completions in the url field, and the endpoint field on the create call must match. This matches OpenAI’s published batchable surface for this API version.
urlUse caseSync equivalent
/v1/chat/completionsOpenAI-style chat completionsPOST /v1/chat/completions
The request body inside each JSONL line’s body field is the same body you would send synchronously. The response in the output JSONL’s response.body is the same response the synchronous endpoint would return.
Other ZeroGPU endpoints are sync-onlyInternal routes like /summary, /v1/responses, /usecase/classify/iab, and /gliner remain reachable as synchronous calls but are not batchable. Submitting a JSONL line with one of these url values is rejected at create time with “is not an allowed batch endpoint”.

/v1/chat/completions

Standard OpenAI-style chat completions. Suitable for any model that supports chat-completion semantics.

Request body (inside JSONL body)

{
  "model": "<model-id>",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user",   "content": "What is the capital of France?" }
  ]
}
FieldRequiredDescription
modelYesA model ID enabled for your project.
messagesYesArray of {role, content}. role is one of user, assistant, system, tool, developer.
stream-Must not be true. Rejected at batch creation.
Per-model token limits apply (configured in model metadata). If input tokens exceed the model’s max_tokens setting, behavior depends on the model’s error_on_max_token flag: either the line fails with code: "context_length_exceeded" (HTTP 400), or the input is silently truncated. Verify with your model’s configuration.

Response body (inside output JSONL response.body)

{
  "id":      "chatcmpl-...",
  "object":  "chat.completion",
  "created": 1736295000,
  "model":   "<model-id>",
  "choices": [
    {
      "index": 0,
      "message": {
        "role":        "assistant",
        "content":     "Paris.",
        "annotations": []
      },
      "finish_reason": "stop",
      "logprobs":      null
    }
  ],
  "usage": {
    "prompt_tokens":     12,
    "completion_tokens": 2,
    "total_tokens":      14,
    "prompt_tokens_details":     { "cached_tokens": 0, "audio_tokens": 0 },
    "completion_tokens_details": {
      "reasoning_tokens": 0,
      "audio_tokens": 0,
      "accepted_prediction_tokens": 0,
      "rejected_prediction_tokens": 0
    }
  },
  "service_tier":       "default",
  "system_fingerprint": null
}
system_fingerprint is always null today. ZeroGPU does not yet expose backend-determinism fingerprints. The field is present for SDK compatibility (OpenAI’s ChatCompletion type declares it as optional but real OpenAI populates it on every response). For classification models invoked via /v1/chat/completions, choices[0].message.content is a JSON-stringified classification response rather than free-form text. Parse it as JSON to extract the structured result.

Next steps

JSONL format →

How to wrap each request body in a valid input JSONL line.

Batches API reference →

Create, list, retrieve, cancel; status lifecycle and limits.

Examples →

End-to-end walkthrough in curl and Python.