How ZeroGPU works

ZeroGPU turns idle compute into one programmable inference layer. You send a request; we run it on a specialized small or nano model, on the cheapest compute that can serve it well.

Specialized small and nano models

ZeroGPU runs purpose-built ZLMs (ZeroGPU Language Models) for high-volume tasks like IAB classification and signal extraction, alongside a catalog of open small and nano models - DeBERTa, GLiNER, LFM2.5, Llama 3.1 8B. Small enough to run at the edge, good enough for production. See the Model Catalog.

An edge-powered network

Requests run across a hybrid of:

Edge devices

Phones, gaming PCs.

Optimized edge servers

Mid-sized models, higher load.

Cloud fallback

Consistent performance and burst capacity.

Routing

For each request, ZeroGPU picks the right model and the right compute by capability, availability, and load - and routes geographically to cut latency. You call one endpoint; the orchestration is handled for you.

This is network-side routing (which model, which compute). It’s distinct from the ZeroGPU Router in the skills/plugins, which decides - on the client side - which steps of an agent run to offload to ZeroGPU at all. See Integrations → Claude Code Plugin.

Introduction Quickstart

⌘I

Get Started

Models

Guides

Platform

How ZeroGPU works

Specialized small and nano models

An edge-powered network

Edge devices

Optimized edge servers

Cloud fallback

Routing

​Specialized small and nano models

​An edge-powered network

Edge devices

Optimized edge servers

Cloud fallback

​Routing

Specialized small and nano models

An edge-powered network

Routing