local-llm-router

Route prompts between local and cloud LLMs based on task complexity.

Use local models (Ollama, llama.cpp, vLLM) for simple tasks — save money and keep data private. Automatically escalate to cloud APIs (OpenAI, Claude, Gemini, XiDao) when the prompt needs frontier reasoning.

Why?

In 2026, the local AI movement is real. But not every task needs a 70B+ model, and not every task can be handled by a 7B model. local-llm-router gives you the best of both worlds:

80% of prompts (summarization, extraction, simple Q&A) → local model (free, private, fast)
20% of prompts (complex reasoning, code generation, multi-step planning) → cloud API (capable, expensive)

Features

🔀 Smart routing — classify prompt complexity before sending to a model
🏠 Local-first — defaults to Ollama/llama.cpp, falls back to cloud only when needed
💰 Cost tracking — logs every routing decision with latency and estimated cost
⚡ Low latency — complexity scoring adds <50ms overhead
📊 Dashboard output — JSON logs compatible with Grafana, Prometheus, or simple CLI stats
🔧 YAML config — define your models, thresholds, and routing rules in one file
🔄 OpenAI-compatible — works with any provider that exposes /v1/chat/completions

Quick Start

pip install local-llm-router

# Create a default config
llm-router init

# Route a single prompt
llm-router route "Summarize this article" --file article.txt

# Run as a proxy server (OpenAI-compatible endpoint)
llm-router serve --port 8080

# View routing stats
llm-router stats --last 24h

Configuration

# config.yaml
models:
  local:
    - name: llama3.2-3b
      provider: ollama
      endpoint: http://localhost:11434
      max_tokens: 4096
      cost_per_1k: 0.0
      
    - name: qwen2.5-7b
      provider: ollama
      endpoint: http://localhost:11434
      max_tokens: 8192
      cost_per_1k: 0.0

  cloud:
    - name: claude-sonnet-4-20250514
      provider: openai-compatible
      endpoint: https://api.xidao.online
      api_key: ${XIDAO_API_KEY}
      max_tokens: 8192
      cost_per_1k: 0.003
      
    - name: gpt-4.1
      provider: openai-compatible
      endpoint: https://api.openai.com
      api_key: ${OPENAI_API_KEY}
      max_tokens: 16384
      cost_per_1k: 0.002

routing:
  complexity_threshold: 0.6  # Above this → cloud
  scorer: keyword-and-length  # Options: keyword-and-length, classifier, llm-judge
  fallback_chain:
    - qwen2.5-7b
    - claude-sonnet-4-20250514
    - gpt-4.1
  timeout_ms: 30000
  retry_attempts: 2

logging:
  format: json
  output: ./logs/routing.jsonl
  log_prompts: false  # Privacy: don't log prompt content by default

Routing Strategies

Keyword + Length (default, fastest)

routing:
  scorer: keyword-and-length
  complexity_rules:
    high_complexity_keywords:
      - "analyze"
      - "compare and contrast"
      - "write code"
      - "debug"
      - "multi-step"
      - "chain of thought"
    low_complexity_keywords:
      - "summarize"
      - "extract"
      - "list"
      - "translate"
      - "format"
    length_thresholds:
      short: 200    # tokens — likely simple
      long: 2000    # tokens — likely complex

Classifier (accurate, small overhead)

Uses a tiny classifier model (runs locally) to score prompt complexity:

routing:
  scorer: classifier
  classifier_model: local-complexity-v1  # Ships built-in, ~10MB

LLM Judge (most accurate, adds latency)

Asks a small local model to judge complexity before routing:

routing:
  scorer: llm-judge
  judge_model: llama3.2-3b
  judge_prompt: "Rate this task complexity 1-10. Task: {prompt}"
  threshold: 6

Proxy Server Mode

Run as an OpenAI-compatible proxy — point your existing code at it:

llm-router serve --port 8080

# Your existing code works unchanged
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1")
response = client.chat.completions.create(
    model="auto",  # Router picks local or cloud
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

Cost Dashboard

# CLI summary
llm-router stats --last 7d

# Output:
# ┌─────────────────────┬─────────┬──────────┬───────────┬──────────┐
# │ Model               │ Queries │ Avg $/q  │ Total $   │ Avg ms   │
# ├─────────────────────┼─────────┼──────────┼───────────┼──────────┤
# │ llama3.2-3b (local) │     847 │   $0.000 │    $0.000 │      120 │
# │ qwen2.5-7b (local)  │     203 │   $0.000 │    $0.000 │      280 │
# │ claude-sonnet-4     │     112 │   $0.003 │    $0.336 │      890 │
# │ gpt-4.1             │      38 │   $0.002 │    $0.076 │      650 │
# └─────────────────────┴─────────┴──────────┴───────────┴──────────┘
# Total: 1,200 queries | $0.412 spent | 87.5% local routing rate
# Estimated savings vs cloud-only: $3.18 (88.6%)

Programmatic Usage

from local_llm_router import Router

router = Router.from_config("config.yaml")

# Simple routing
result = router.route("Summarize this text: ...")
print(result.model)       # "llama3.2-3b"
print(result.local)       # True
print(result.complexity)  # 0.32
print(result.latency_ms)  # 145

# Force a specific model
result = router.route("Write a sorting algorithm", force_model="gpt-4.1")

# Get routing stats
stats = router.stats(period="24h")
print(f"Local routing rate: {stats.local_rate:.1%}")
print(f"Total cost: ${stats.total_cost:.3f}")

Requirements

Python 3.10+
Ollama (for local models) — or any OpenAI-compatible local server
One cloud API key (optional — fully local mode works without any cloud access)

Installation

# Basic (local-only mode)
pip install local-llm-router

# With cloud support
pip install local-llm-router[cloud]

# Development
git clone https://github.com/XidaoApi/local-llm-router.git
cd local-llm-router
pip install -e ".[dev]"

How It Works

┌──────────┐     ┌──────────────┐     ┌───────────────────┐
│  Prompt  │────▶│  Complexity  │────▶│  Routing Decision  │
│  Input   │     │   Scorer     │     │  (threshold check) │
└──────────┘     └──────────────┘     └─────────┬─────────┘
                                                │
                           ┌────────────────────┼────────────────────┐
                           │                    │                    │
                           ▼                    ▼                    ▼
                    ┌─────────────┐     ┌──────────────┐    ┌──────────────┐
                    │ Local Model │     │ Cloud Model  │    │  Fallback    │
                    │ (Ollama)    │     │ (API)        │    │  Chain       │
                    └─────────────┘     └──────────────┘    └──────────────┘
                           │                    │                    │
                           ▼                    ▼                    ▼
                    ┌─────────────────────────────────────────────────────┐
                    │              Response + Cost/Latency Log            │
                    └─────────────────────────────────────────────────────┘

Related Projects

Project	Description
llm-cost-calculator	Compare LLM pricing across 50+ providers
llm-api-bench	Benchmark LLM providers on latency and cost
llm-failover-router-demo	Production failover with circuit breakers
xidao-cookbook	Routing recipes and migration guides

License

MIT — use it however you want.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
local_llm_router		local_llm_router
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

local-llm-router

Why?

Features

Quick Start

Configuration

Routing Strategies

Keyword + Length (default, fastest)

Classifier (accurate, small overhead)

LLM Judge (most accurate, adds latency)

Proxy Server Mode

Cost Dashboard

Programmatic Usage

Requirements

Installation

How It Works

Related Projects

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

local-llm-router

Why?

Features

Quick Start

Configuration

Routing Strategies

Keyword + Length (default, fastest)

Classifier (accurate, small overhead)

LLM Judge (most accurate, adds latency)

Proxy Server Mode

Cost Dashboard

Programmatic Usage

Requirements

Installation

How It Works

Related Projects

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages