Distributed Inference

Why centralized inference is expensive
How ZeroGPU distributes it
What you get

Traditional inference: every request goes to a GPU data center, regardless of task complexity or user location. You pay for reserved capacity even when it’s idle. ZeroGPU: requests run on a distributed network of edge devices — laptops, phones, servers, browsers — using Nano Language Models that don’t need GPUs.

Why centralized inference is expensive

Problem	Cost
Traffic spikes	Over-provision GPUs or accept latency spikes
Oversized models	LLMs consume GPU resources for tasks that don’t need them
Regional egress	Data round-trips to distant data centers add latency and fees
Idle capacity	Reserved instances cost money 24/7

How ZeroGPU distributes it

ZeroGPU distributed inference architecture showing Your App connecting to the ZeroGPU Router, which distributes requests to Edge Devices running NLMs with Cloud Fallback

Your app sends a request to ZeroGPU
Router picks the best edge node by location, capacity, and model availability
Edge device runs the NLM and returns the result
Cloud fallback catches requests when no edge node is available

What you get

Scale horizontally — more devices = more capacity, no GPU procurement
Pay for usage — not reserved instances sitting idle
Lower latency — inference runs near the user, not across the country
Resilience — no single point of failure; traffic reroutes automatically

Trade-off: Edge inference depends on device availability. ZeroGPU mitigates this with automatic cloud fallback — your app never notices the difference.

Geo-Aware Routing

How the router picks the optimal node.

Quickstart

Send your first distributed inference request.

Nano Language Models Geo-Aware Routing

Getting Started

Models

Platform

Core Concepts

Resources

Distributed Inference

Why centralized inference is expensive

How ZeroGPU distributes it

What you get

Geo-Aware Routing

Quickstart

Getting Started

Models

Platform

Core Concepts

Resources

Documentation Index

​Why centralized inference is expensive

​How ZeroGPU distributes it

​What you get

Geo-Aware Routing

Quickstart

Why centralized inference is expensive

How ZeroGPU distributes it

What you get