Commit 6d8e5a3
committed
feat: expand model catalog + correct free-tier rate limits
Verified every slot against each provider's live API (catalog + real
inference + x-ratelimit-* response headers) on 2026-04-23. Numbers come
from the provider's own response or published rate-limit tables, not
from third-party write-ups.
Net change: +2 slots, 1 swap. 17 total slots vs the previous 15.
Additions:
- groq/openai/gpt-oss-20b (chat). Groq returns content + reasoning both
set at max_tokens=100; reasoning-style but functional. Limits from
live headers: 30 RPM / 1K RPD / 8K TPM / 200K TPD. Kept out of the
summarizer group because its chain-of-thought would eat a tight
max_tokens budget before emitting content.
- sambanova/DeepSeek-V3.2 (chat, merge). Live probe returns content=
"ok", reasoning field empty — not a reasoning model. Shares the same
free-tier cap as the other SambaNova slots.
Swap:
- openrouter/nousresearch/hermes-3-llama-3.1-405b:free -> qwen/qwen3-
next-80b-a3b-instruct:free. hermes-3 was producing repeated 429s in
the daily check without a behavioural advantage. qwen3-next is the
Instruct variant (not Thinking), 262K native context, RULER
long-context 91.8%.
Rejected / did not add:
- cerebras/gpt-oss-120b and cerebras/zai-glm-4.7. Both appear in the
/v1/models catalog but every one of our 3 Cerebras keys returns HTTP
404 "Model ... does not exist or you do not have access to it". They
look to be gated to paid plans despite the public blog post.
- openrouter/nvidia/nemotron-nano-12b-v2-vl:free. Live probe at
max_tokens=100 returned message.content=null with the whole output
in the `reasoning` field — our stream manager only surfaces content,
so in prod this would behave as an empty-reply model. Worth
revisiting once we add reasoning-field plumbing.
- gemini/gemini-2.5-pro. Not free-tier on the Developer API; adding it
would start real billing on a project that currently runs fully
free-tier.
Rate-limit corrections (all verified via real API response headers on
each provider's endpoint, not from third-party docs):
- Groq: per-model rpd/tpm/tpd now match the live
x-ratelimit-limit-requests / tokens headers. Previously every model
had rpd=14400 except gpt-oss-120b at 1000; only llama-3.1-8b-instant
actually has 14.4K RPD, the rest are 1K.
- Cerebras: qwen-3-235b and llama3.1-8b confirmed at 30 RPM / 14.4K
RPD / 60K TPM / 1M TPD from live headers.
- SambaNova: rpd dropped from 1000 to 20 per both docs and the live
X-Ratelimit-Limit-Requests-Day header. Previous values were 50x the
real free-tier cap — SmartRouter was seeing phantom budget.
- Gemini: 2.5 Flash dropped from rpd=1000 to 250 per post-Dec-2025
quota cut; 2.5 Flash-Lite stays at 1000. Note: the failure mode we
see most on Gemini is upstream 503 "high demand", which rate-limit
config can't prevent — that's a routing / score-weight concern.
- OpenRouter 🆓 confirmed free-tier via /auth/key response
("is_free_tier": true, no credits purchased). rpd dropped from 200
to 50 per the documented floor. If we later buy $10+ of credits,
bump rpd to 1000 here (the only change needed).
Unit tests: 256 passed. Backend change, so staging validation still
required before merge.1 parent 4912ebc commit 6d8e5a3
1 file changed
Lines changed: 210 additions & 68 deletions
0 commit comments