Route prompts between local and cloud LLMs based on task complexity.
Use local models (Ollama, llama.cpp, vLLM) for simple tasks — save money and keep data private. Automatically escalate to cloud APIs (OpenAI, Claude, Gemini, XiDao) when the prompt needs frontier reasoning.
In 2026, the local AI movement is real. But not every task needs a 70B+ model, and not every task can be handled by a 7B model. local-llm-router gives you the best of both worlds:
- 80% of prompts (summarization, extraction, simple Q&A) → local model (free, private, fast)
- 20% of prompts (complex reasoning, code generation, multi-step planning) → cloud API (capable, expensive)
- 🔀 Smart routing — classify prompt complexity before sending to a model
- 🏠 Local-first — defaults to Ollama/llama.cpp, falls back to cloud only when needed
- 💰 Cost tracking — logs every routing decision with latency and estimated cost
- ⚡ Low latency — complexity scoring adds <50ms overhead
- 📊 Dashboard output — JSON logs compatible with Grafana, Prometheus, or simple CLI stats
- 🔧 YAML config — define your models, thresholds, and routing rules in one file
- 🔄 OpenAI-compatible — works with any provider that exposes
/v1/chat/completions
pip install local-llm-router
# Create a default config
llm-router init
# Route a single prompt
llm-router route "Summarize this article" --file article.txt
# Run as a proxy server (OpenAI-compatible endpoint)
llm-router serve --port 8080
# View routing stats
llm-router stats --last 24h# config.yaml
models:
local:
- name: llama3.2-3b
provider: ollama
endpoint: http://localhost:11434
max_tokens: 4096
cost_per_1k: 0.0
- name: qwen2.5-7b
provider: ollama
endpoint: http://localhost:11434
max_tokens: 8192
cost_per_1k: 0.0
cloud:
- name: claude-sonnet-4-20250514
provider: openai-compatible
endpoint: https://api.xidao.online
api_key: ${XIDAO_API_KEY}
max_tokens: 8192
cost_per_1k: 0.003
- name: gpt-4.1
provider: openai-compatible
endpoint: https://api.openai.com
api_key: ${OPENAI_API_KEY}
max_tokens: 16384
cost_per_1k: 0.002
routing:
complexity_threshold: 0.6 # Above this → cloud
scorer: keyword-and-length # Options: keyword-and-length, classifier, llm-judge
fallback_chain:
- qwen2.5-7b
- claude-sonnet-4-20250514
- gpt-4.1
timeout_ms: 30000
retry_attempts: 2
logging:
format: json
output: ./logs/routing.jsonl
log_prompts: false # Privacy: don't log prompt content by defaultrouting:
scorer: keyword-and-length
complexity_rules:
high_complexity_keywords:
- "analyze"
- "compare and contrast"
- "write code"
- "debug"
- "multi-step"
- "chain of thought"
low_complexity_keywords:
- "summarize"
- "extract"
- "list"
- "translate"
- "format"
length_thresholds:
short: 200 # tokens — likely simple
long: 2000 # tokens — likely complexUses a tiny classifier model (runs locally) to score prompt complexity:
routing:
scorer: classifier
classifier_model: local-complexity-v1 # Ships built-in, ~10MBAsks a small local model to judge complexity before routing:
routing:
scorer: llm-judge
judge_model: llama3.2-3b
judge_prompt: "Rate this task complexity 1-10. Task: {prompt}"
threshold: 6Run as an OpenAI-compatible proxy — point your existing code at it:
llm-router serve --port 8080# Your existing code works unchanged
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1")
response = client.chat.completions.create(
model="auto", # Router picks local or cloud
messages=[{"role": "user", "content": "Explain quantum computing"}]
)# CLI summary
llm-router stats --last 7d
# Output:
# ┌─────────────────────┬─────────┬──────────┬───────────┬──────────┐
# │ Model │ Queries │ Avg $/q │ Total $ │ Avg ms │
# ├─────────────────────┼─────────┼──────────┼───────────┼──────────┤
# │ llama3.2-3b (local) │ 847 │ $0.000 │ $0.000 │ 120 │
# │ qwen2.5-7b (local) │ 203 │ $0.000 │ $0.000 │ 280 │
# │ claude-sonnet-4 │ 112 │ $0.003 │ $0.336 │ 890 │
# │ gpt-4.1 │ 38 │ $0.002 │ $0.076 │ 650 │
# └─────────────────────┴─────────┴──────────┴───────────┴──────────┘
# Total: 1,200 queries | $0.412 spent | 87.5% local routing rate
# Estimated savings vs cloud-only: $3.18 (88.6%)from local_llm_router import Router
router = Router.from_config("config.yaml")
# Simple routing
result = router.route("Summarize this text: ...")
print(result.model) # "llama3.2-3b"
print(result.local) # True
print(result.complexity) # 0.32
print(result.latency_ms) # 145
# Force a specific model
result = router.route("Write a sorting algorithm", force_model="gpt-4.1")
# Get routing stats
stats = router.stats(period="24h")
print(f"Local routing rate: {stats.local_rate:.1%}")
print(f"Total cost: ${stats.total_cost:.3f}")- Python 3.10+
- Ollama (for local models) — or any OpenAI-compatible local server
- One cloud API key (optional — fully local mode works without any cloud access)
# Basic (local-only mode)
pip install local-llm-router
# With cloud support
pip install local-llm-router[cloud]
# Development
git clone https://github.com/XidaoApi/local-llm-router.git
cd local-llm-router
pip install -e ".[dev]"┌──────────┐ ┌──────────────┐ ┌───────────────────┐
│ Prompt │────▶│ Complexity │────▶│ Routing Decision │
│ Input │ │ Scorer │ │ (threshold check) │
└──────────┘ └──────────────┘ └─────────┬─────────┘
│
┌────────────────────┼────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ Local Model │ │ Cloud Model │ │ Fallback │
│ (Ollama) │ │ (API) │ │ Chain │
└─────────────┘ └──────────────┘ └──────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────┐
│ Response + Cost/Latency Log │
└─────────────────────────────────────────────────────┘
| Project | Description |
|---|---|
| llm-cost-calculator | Compare LLM pricing across 50+ providers |
| llm-api-bench | Benchmark LLM providers on latency and cost |
| llm-failover-router-demo | Production failover with circuit breakers |
| xidao-cookbook | Routing recipes and migration guides |
MIT — use it however you want.