fix(benchmarks): classify Gemini monthly spend-cap as terminal, fail-fast bundle runner by Davincc77 · Pull Request #92 · Davincc77/klickdskill

Davincc77 · 2026-05-29T16:51:58Z

Summary

Diagnosis: workflow run 26642239431 produced 230 errors and 0 raw outputs because every Gemini call returned 429 RESOURCE_EXHAUSTED with Your project has exceeded its monthly spending cap. Please go to AI Studio at https://ai.studio/spend.... The harness's classifier treated that as transient (it matched both the 429 status and the RESOURCE_EXHAUSTED token), so each of the 230 items was retried 6 times — burning the full 120-min job on a condition only a human can fix.
Fix the classifier so messages containing monthly spending cap, spend cap, billing hard cap, ai.studio/spend, or equivalent are classified terminal, never transient — even when the 429/RESOURCE_EXHAUSTED token is also present.
Add a new TerminalProviderError exception class (subclass of ProviderError, not of TransientProviderError) plus is_terminal_billing_error(). Gemini adapter raises TerminalProviderError for the spend-cap path.
Executor _classify() returns "terminal_billing"; _call_with_retry calls the provider exactly once for that class (no backoff burn).
Bundle executor (executor_b_bundles.py) trips an abort flag on the first terminal_billing final-class result, stops dispatching new batches, records remaining work as skipped_terminal_abort, and stamps abort_reason / aborted: true into both metrics_summary.json and run_manifest.json. With the bug, a 1800-call run with retry_max=6 made up to 12,600 provider calls; with the fix it makes at most batch_size calls before aborting.
Genuine transient 429 / rate-limit / 5xx behaviour is preserved (no spend-cap text → still retries with backoff as before).
No completed benchmark results are altered. No publish / tag / release / site / npm / PyPI / Zenodo touched.

Files changed

benchmarks/v4.1/providers/base.py — TerminalProviderError, _TERMINAL_BILLING_TOKENS, is_terminal_billing_error(); is_transient_error() short-circuits on the terminal signal.
benchmarks/v4.1/providers/__init__.py — export the new symbols.
benchmarks/v4.1/providers/gemini_adapter.py — wrap spend-cap SDK exceptions as TerminalProviderError; transient + timeout paths unchanged.
benchmarks/v4.1/runner/executor.py — _classify() returns "terminal_billing"; the retry loop's existing if cls != "transient": return None short-circuit naturally applies.
benchmarks/v4.1/runner/executor_b_bundles.py — abort_reason flag set on first terminal_billing; _do_call short-circuits subsequent in-batch calls into skipped_terminal_abort; batch loop breaks; summary + manifest record aborted and abort_reason.
benchmarks/v4.1/tests/test_terminal_billing.py (NEW, 15 tests).

Testing

pytest benchmarks/v4.1/tests/ → 110 passed locally (15 new + 95 existing, no regressions).
New tests pin:
- 5 spend-cap message variants are classified terminal, not transient.
- Genuine 429 rate-limit messages (no cap text) remain transient.
- _call_with_retry calls the provider exactly 1 time on spend-cap with retry_max=8.
- Gemini adapter wraps spend-cap as TerminalProviderError (with a fake client — no network).
- Gemini adapter still wraps 503 UNAVAILABLE as TransientProviderError.
- Bundle executor with retry_max=6 over 36 dispatched calls hits the provider ≤ batch_size times (vs the buggy 252 worst-case) and records aborted: true plus abort_reason in summary + manifest.
- Bundle executor does not abort on genuine transient 429 — full retry budget is exhausted as before.

Test plan

pytest benchmarks/v4.1/tests/ -v passes locally (110/110).
Reviewer to confirm no completed benchmark artefacts are altered (none staged in this diff).
Reviewer to confirm the patch does not change behaviour for mock / non-Gemini providers.

🤖 Generated with Claude Code

…fast Run 26642239431 produced 230 errors and 0 raw outputs because every Gemini call returned 429 RESOURCE_EXHAUSTED with the message "Your project has exceeded its monthly spending cap. Please go to AI Studio at https://ai.studio/spend...". The harness misclassified this as a transient 429, retried each item 6 times, and burned the full 120-min job on a guaranteed-failing run. This patch: - Adds is_terminal_billing_error() + TerminalProviderError to providers.base; the classifier checks the spend-cap signal BEFORE the transient signal so a 429-with-cap message becomes non-retryable. - Gemini adapter wraps spend-cap SDK exceptions as TerminalProviderError. - Executor _classify() exposes a "terminal_billing" class; the retry loop calls the provider exactly once for it (no backoff burn). - Bundle executor trips an abort flag on the first terminal_billing result, stops dispatching new batches, records remaining work as skipped_terminal_abort, and stamps abort_reason into both summary and manifest. - Genuine transient 429 rate-limit messages (no cap text) keep their existing retry behaviour. - 15 new tests in tests/test_terminal_billing.py cover classifier variants, retry-loop call count, adapter wrapping (no network), and the bundle executor's fail-fast + don't-abort-on-transient paths. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Davincc77 merged commit 82c1c26 into main May 29, 2026
3 checks passed

Davincc77 deleted the fix/bench-terminal-billing-classification branch May 29, 2026 17:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(benchmarks): classify Gemini monthly spend-cap as terminal, fail-fast bundle runner#92

fix(benchmarks): classify Gemini monthly spend-cap as terminal, fail-fast bundle runner#92
Davincc77 merged 1 commit into
mainfrom
fix/bench-terminal-billing-classification

Davincc77 commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Davincc77 commented May 29, 2026

Summary

Files changed

Testing

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants