NLMs vs LLMs
| LLMs | NLMs | |
|---|---|---|
| Parameters | 7B – 400B+ | Sub-1B |
| Runs on | GPU clusters | CPU, mobile, browser |
| Output | Variable | Predictable, task-specific |
| Cost | High | Low |
| Latency | 100ms – seconds | Single-digit milliseconds |
| Best for | Open-ended generation | Classification, extraction, routing |
What NLMs handle well
- Content classification — categorize into taxonomies at scale
- Intent routing — map user queries to the right handler
- Entity extraction — pull names, dates, amounts from unstructured text
- Content moderation — flag violations in real time
- Summarization — condense documents and conversations
- Sentiment analysis — positive/negative/neutral at high throughput
”Why not just use a small LLM?”
Different architecture, different goals:- Single-task fine-tuning — every parameter optimized for one job
- CPU-native — quantized and compiled for edge hardware, not adapted from GPU-first designs
- Deterministic output — consistent results production systems can rely on
Available models
| Model | Use case |
|---|---|
zlm-v1-summary-cloud | Text summarization |
zlm-v1-iab-classify-cloud | IAB content classification |
model field when calling the API.
API Reference
Send requests to NLMs via
/v1/responses.
