Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.zerogpu.ai/llms.txt

Use this file to discover all available pages before exploring further.

The router decides which edge device handles each request. It evaluates four signals in milliseconds:
SignalWhat it optimizes
Geographic proximityLowest network latency
Device capabilityEnough compute for the requested model
Current loadAvoids overloaded nodes
Model availabilityRoutes to nodes with the NLM already cached

Request flow

1

Request arrives

API gateway extracts the model identifier, payload size, and origin IP.
2

Node selected

Router queries the network topology. Best node that satisfies all constraints wins.
3

Inference runs

Request forwarded to edge device. NLM processes input, returns result.
4

Fallback if needed

No edge node responds in time? Request goes to cloud infrastructure. Same response format — your app doesn’t know the difference.

Cloud fallback

Availability guarantee. If the edge network can’t serve a request — capacity, model, or device constraints — cloud-hosted replicas handle it transparently. Trade-off: Cloud fallback may have slightly higher latency than edge, but it ensures 100% availability. Your integration code doesn’t change either way.

Distributed Inference

The full architecture behind edge compute.

API Reference

Endpoint spec for /v1/responses.