Commit 6d8e5a3

committed

feat: expand model catalog + correct free-tier rate limits

Verified every slot against each provider's live API (catalog + real inference + x-ratelimit-* response headers) on 2026-04-23. Numbers come from the provider's own response or published rate-limit tables, not from third-party write-ups. Net change: +2 slots, 1 swap. 17 total slots vs the previous 15. Additions: - groq/openai/gpt-oss-20b (chat). Groq returns content + reasoning both set at max_tokens=100; reasoning-style but functional. Limits from live headers: 30 RPM / 1K RPD / 8K TPM / 200K TPD. Kept out of the summarizer group because its chain-of-thought would eat a tight max_tokens budget before emitting content. - sambanova/DeepSeek-V3.2 (chat, merge). Live probe returns content= "ok", reasoning field empty — not a reasoning model. Shares the same free-tier cap as the other SambaNova slots. Swap: - openrouter/nousresearch/hermes-3-llama-3.1-405b:free -> qwen/qwen3- next-80b-a3b-instruct:free. hermes-3 was producing repeated 429s in the daily check without a behavioural advantage. qwen3-next is the Instruct variant (not Thinking), 262K native context, RULER long-context 91.8%. Rejected / did not add: - cerebras/gpt-oss-120b and cerebras/zai-glm-4.7. Both appear in the /v1/models catalog but every one of our 3 Cerebras keys returns HTTP 404 "Model ... does not exist or you do not have access to it". They look to be gated to paid plans despite the public blog post. - openrouter/nvidia/nemotron-nano-12b-v2-vl:free. Live probe at max_tokens=100 returned message.content=null with the whole output in the `reasoning` field — our stream manager only surfaces content, so in prod this would behave as an empty-reply model. Worth revisiting once we add reasoning-field plumbing. - gemini/gemini-2.5-pro. Not free-tier on the Developer API; adding it would start real billing on a project that currently runs fully free-tier. Rate-limit corrections (all verified via real API response headers on each provider's endpoint, not from third-party docs): - Groq: per-model rpd/tpm/tpd now match the live x-ratelimit-limit-requests / tokens headers. Previously every model had rpd=14400 except gpt-oss-120b at 1000; only llama-3.1-8b-instant actually has 14.4K RPD, the rest are 1K. - Cerebras: qwen-3-235b and llama3.1-8b confirmed at 30 RPM / 14.4K RPD / 60K TPM / 1M TPD from live headers. - SambaNova: rpd dropped from 1000 to 20 per both docs and the live X-Ratelimit-Limit-Requests-Day header. Previous values were 50x the real free-tier cap — SmartRouter was seeing phantom budget. - Gemini: 2.5 Flash dropped from rpd=1000 to 250 per post-Dec-2025 quota cut; 2.5 Flash-Lite stays at 1000. Note: the failure mode we see most on Gemini is upstream 503 "high demand", which rate-limit config can't prevent — that's a routing / score-weight concern. - OpenRouter 🆓 confirmed free-tier via /auth/key response ("is_free_tier": true, no credits purchased). rpd dropped from 200 to 50 per the documented floor. If we later buy $10+ of credits, bump rpd to 1000 here (the only change needed). Unit tests: 256 passed. Backend change, so staging validation still required before merge.

1 parent 4912ebc commit 6d8e5a3Copy full SHA for 6d8e5a3

1 file changed

backend/services
- llm_client.py

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit 6d8e5a3

File tree

0 commit comments