> ## Documentation Index
> Fetch the complete documentation index at: https://docs.zerogpu.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# How ZeroGPU works

> ZeroGPU turns idle compute into one programmable inference layer - specialized small models, an edge-powered network, and routing handled for you.

ZeroGPU turns idle compute into one programmable inference layer. You send a request; we run it on a specialized small or nano model, on the cheapest compute that can serve it well.

## Specialized small and nano models

ZeroGPU runs purpose-built ZLMs (ZeroGPU Language Models) for high-volume tasks like IAB classification and signal extraction, alongside a catalog of open small and nano models - DeBERTa, GLiNER, LFM2.5, Llama 3.1 8B. Small enough to run at the edge, good enough for production. See the [Model Catalog](/docs/model-catalog).

## An edge-powered network

Requests run across a hybrid of:

<Columns cols={3}>
  <Card title="Edge devices" icon="mobile-screen">
    Phones, gaming PCs.
  </Card>

  <Card title="Optimized edge servers" icon="server">
    Mid-sized models, higher load.
  </Card>

  <Card title="Cloud fallback" icon="cloud">
    Consistent performance and burst capacity.
  </Card>
</Columns>

## Routing

For each request, ZeroGPU picks the right model and the right compute by capability, availability, and load - and routes geographically to cut latency. You call one endpoint; the orchestration is handled for you.

<Note>
  This is network-side routing (which model, which compute). It's distinct from the ZeroGPU Router in the skills/plugins, which decides - on the client side - which steps of an agent run to offload to ZeroGPU at all. See [Integrations → Skills + CLI (Claude Code)](/integrations/claude-code-plugin).
</Note>
