Skip to content

XidaoApi/local-llm-router

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

local-llm-router

License: MIT Python 3.10+ Last Commit

Route prompts between local and cloud LLMs based on task complexity.

Use local models (Ollama, llama.cpp, vLLM) for simple tasks — save money and keep data private. Automatically escalate to cloud APIs (OpenAI, Claude, Gemini, XiDao) when the prompt needs frontier reasoning.

Why?

In 2026, the local AI movement is real. But not every task needs a 70B+ model, and not every task can be handled by a 7B model. local-llm-router gives you the best of both worlds:

  • 80% of prompts (summarization, extraction, simple Q&A) → local model (free, private, fast)
  • 20% of prompts (complex reasoning, code generation, multi-step planning) → cloud API (capable, expensive)

Features

  • 🔀 Smart routing — classify prompt complexity before sending to a model
  • 🏠 Local-first — defaults to Ollama/llama.cpp, falls back to cloud only when needed
  • 💰 Cost tracking — logs every routing decision with latency and estimated cost
  • Low latency — complexity scoring adds <50ms overhead
  • 📊 Dashboard output — JSON logs compatible with Grafana, Prometheus, or simple CLI stats
  • 🔧 YAML config — define your models, thresholds, and routing rules in one file
  • 🔄 OpenAI-compatible — works with any provider that exposes /v1/chat/completions

Quick Start

pip install local-llm-router

# Create a default config
llm-router init

# Route a single prompt
llm-router route "Summarize this article" --file article.txt

# Run as a proxy server (OpenAI-compatible endpoint)
llm-router serve --port 8080

# View routing stats
llm-router stats --last 24h

Configuration

# config.yaml
models:
  local:
    - name: llama3.2-3b
      provider: ollama
      endpoint: http://localhost:11434
      max_tokens: 4096
      cost_per_1k: 0.0
      
    - name: qwen2.5-7b
      provider: ollama
      endpoint: http://localhost:11434
      max_tokens: 8192
      cost_per_1k: 0.0

  cloud:
    - name: claude-sonnet-4-20250514
      provider: openai-compatible
      endpoint: https://api.xidao.online
      api_key: ${XIDAO_API_KEY}
      max_tokens: 8192
      cost_per_1k: 0.003
      
    - name: gpt-4.1
      provider: openai-compatible
      endpoint: https://api.openai.com
      api_key: ${OPENAI_API_KEY}
      max_tokens: 16384
      cost_per_1k: 0.002

routing:
  complexity_threshold: 0.6  # Above this → cloud
  scorer: keyword-and-length  # Options: keyword-and-length, classifier, llm-judge
  fallback_chain:
    - qwen2.5-7b
    - claude-sonnet-4-20250514
    - gpt-4.1
  timeout_ms: 30000
  retry_attempts: 2

logging:
  format: json
  output: ./logs/routing.jsonl
  log_prompts: false  # Privacy: don't log prompt content by default

Routing Strategies

Keyword + Length (default, fastest)

routing:
  scorer: keyword-and-length
  complexity_rules:
    high_complexity_keywords:
      - "analyze"
      - "compare and contrast"
      - "write code"
      - "debug"
      - "multi-step"
      - "chain of thought"
    low_complexity_keywords:
      - "summarize"
      - "extract"
      - "list"
      - "translate"
      - "format"
    length_thresholds:
      short: 200    # tokens — likely simple
      long: 2000    # tokens — likely complex

Classifier (accurate, small overhead)

Uses a tiny classifier model (runs locally) to score prompt complexity:

routing:
  scorer: classifier
  classifier_model: local-complexity-v1  # Ships built-in, ~10MB

LLM Judge (most accurate, adds latency)

Asks a small local model to judge complexity before routing:

routing:
  scorer: llm-judge
  judge_model: llama3.2-3b
  judge_prompt: "Rate this task complexity 1-10. Task: {prompt}"
  threshold: 6

Proxy Server Mode

Run as an OpenAI-compatible proxy — point your existing code at it:

llm-router serve --port 8080
# Your existing code works unchanged
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1")
response = client.chat.completions.create(
    model="auto",  # Router picks local or cloud
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

Cost Dashboard

# CLI summary
llm-router stats --last 7d

# Output:
# ┌─────────────────────┬─────────┬──────────┬───────────┬──────────┐
# │ Model               │ Queries │ Avg $/q  │ Total $   │ Avg ms   │
# ├─────────────────────┼─────────┼──────────┼───────────┼──────────┤
# │ llama3.2-3b (local) │     847 │   $0.000 │    $0.000 │      120 │
# │ qwen2.5-7b (local)  │     203 │   $0.000 │    $0.000 │      280 │
# │ claude-sonnet-4     │     112 │   $0.003 │    $0.336 │      890 │
# │ gpt-4.1             │      38 │   $0.002 │    $0.076 │      650 │
# └─────────────────────┴─────────┴──────────┴───────────┴──────────┘
# Total: 1,200 queries | $0.412 spent | 87.5% local routing rate
# Estimated savings vs cloud-only: $3.18 (88.6%)

Programmatic Usage

from local_llm_router import Router

router = Router.from_config("config.yaml")

# Simple routing
result = router.route("Summarize this text: ...")
print(result.model)       # "llama3.2-3b"
print(result.local)       # True
print(result.complexity)  # 0.32
print(result.latency_ms)  # 145

# Force a specific model
result = router.route("Write a sorting algorithm", force_model="gpt-4.1")

# Get routing stats
stats = router.stats(period="24h")
print(f"Local routing rate: {stats.local_rate:.1%}")
print(f"Total cost: ${stats.total_cost:.3f}")

Requirements

  • Python 3.10+
  • Ollama (for local models) — or any OpenAI-compatible local server
  • One cloud API key (optional — fully local mode works without any cloud access)

Installation

# Basic (local-only mode)
pip install local-llm-router

# With cloud support
pip install local-llm-router[cloud]

# Development
git clone https://github.com/XidaoApi/local-llm-router.git
cd local-llm-router
pip install -e ".[dev]"

How It Works

┌──────────┐     ┌──────────────┐     ┌───────────────────┐
│  Prompt  │────▶│  Complexity  │────▶│  Routing Decision  │
│  Input   │     │   Scorer     │     │  (threshold check) │
└──────────┘     └──────────────┘     └─────────┬─────────┘
                                                │
                           ┌────────────────────┼────────────────────┐
                           │                    │                    │
                           ▼                    ▼                    ▼
                    ┌─────────────┐     ┌──────────────┐    ┌──────────────┐
                    │ Local Model │     │ Cloud Model  │    │  Fallback    │
                    │ (Ollama)    │     │ (API)        │    │  Chain       │
                    └─────────────┘     └──────────────┘    └──────────────┘
                           │                    │                    │
                           ▼                    ▼                    ▼
                    ┌─────────────────────────────────────────────────────┐
                    │              Response + Cost/Latency Log            │
                    └─────────────────────────────────────────────────────┘

Related Projects

Project Description
llm-cost-calculator Compare LLM pricing across 50+ providers
llm-api-bench Benchmark LLM providers on latency and cost
llm-failover-router-demo Production failover with circuit breakers
xidao-cookbook Routing recipes and migration guides

License

MIT — use it however you want.