Skip to main content
ZeroGPU turns idle compute into one programmable inference layer. You send a request; we run it on a specialized small or nano model, on the cheapest compute that can serve it well.

Specialized small and nano models

ZeroGPU runs purpose-built ZLMs (ZeroGPU Language Models) for high-volume tasks like IAB classification and signal extraction, alongside a catalog of open small and nano models - DeBERTa, GLiNER, LFM2.5, Llama 3.1 8B. Small enough to run at the edge, good enough for production. See the Model Catalog.

An edge-powered network

Requests run across a hybrid of:

Edge devices

Phones, gaming PCs.

Optimized edge servers

Mid-sized models, higher load.

Cloud fallback

Consistent performance and burst capacity.

Routing

For each request, ZeroGPU picks the right model and the right compute by capability, availability, and load - and routes geographically to cut latency. You call one endpoint; the orchestration is handled for you.
This is network-side routing (which model, which compute). It’s distinct from the ZeroGPU Router in the skills/plugins, which decides - on the client side - which steps of an agent run to offload to ZeroGPU at all. See Integrations → Skills + CLI (Claude Code).