Skip to content

fix(benchmarks): classify Gemini monthly spend-cap as terminal, fail-fast bundle runner#92

Merged
Davincc77 merged 1 commit into
mainfrom
fix/bench-terminal-billing-classification
May 29, 2026
Merged

fix(benchmarks): classify Gemini monthly spend-cap as terminal, fail-fast bundle runner#92
Davincc77 merged 1 commit into
mainfrom
fix/bench-terminal-billing-classification

Conversation

@Davincc77

Copy link
Copy Markdown
Owner

Summary

  • Diagnosis: workflow run 26642239431 produced 230 errors and 0 raw outputs because every Gemini call returned 429 RESOURCE_EXHAUSTED with Your project has exceeded its monthly spending cap. Please go to AI Studio at https://ai.studio/spend.... The harness's classifier treated that as transient (it matched both the 429 status and the RESOURCE_EXHAUSTED token), so each of the 230 items was retried 6 times — burning the full 120-min job on a condition only a human can fix.
  • Fix the classifier so messages containing monthly spending cap, spend cap, billing hard cap, ai.studio/spend, or equivalent are classified terminal, never transient — even when the 429/RESOURCE_EXHAUSTED token is also present.
  • Add a new TerminalProviderError exception class (subclass of ProviderError, not of TransientProviderError) plus is_terminal_billing_error(). Gemini adapter raises TerminalProviderError for the spend-cap path.
  • Executor _classify() returns "terminal_billing"; _call_with_retry calls the provider exactly once for that class (no backoff burn).
  • Bundle executor (executor_b_bundles.py) trips an abort flag on the first terminal_billing final-class result, stops dispatching new batches, records remaining work as skipped_terminal_abort, and stamps abort_reason / aborted: true into both metrics_summary.json and run_manifest.json. With the bug, a 1800-call run with retry_max=6 made up to 12,600 provider calls; with the fix it makes at most batch_size calls before aborting.
  • Genuine transient 429 / rate-limit / 5xx behaviour is preserved (no spend-cap text → still retries with backoff as before).
  • No completed benchmark results are altered. No publish / tag / release / site / npm / PyPI / Zenodo touched.

Files changed

  • benchmarks/v4.1/providers/base.pyTerminalProviderError, _TERMINAL_BILLING_TOKENS, is_terminal_billing_error(); is_transient_error() short-circuits on the terminal signal.
  • benchmarks/v4.1/providers/__init__.py — export the new symbols.
  • benchmarks/v4.1/providers/gemini_adapter.py — wrap spend-cap SDK exceptions as TerminalProviderError; transient + timeout paths unchanged.
  • benchmarks/v4.1/runner/executor.py_classify() returns "terminal_billing"; the retry loop's existing if cls != "transient": return None short-circuit naturally applies.
  • benchmarks/v4.1/runner/executor_b_bundles.py — abort_reason flag set on first terminal_billing; _do_call short-circuits subsequent in-batch calls into skipped_terminal_abort; batch loop breaks; summary + manifest record aborted and abort_reason.
  • benchmarks/v4.1/tests/test_terminal_billing.py (NEW, 15 tests).

Testing

  • pytest benchmarks/v4.1/tests/110 passed locally (15 new + 95 existing, no regressions).
  • New tests pin:
    • 5 spend-cap message variants are classified terminal, not transient.
    • Genuine 429 rate-limit messages (no cap text) remain transient.
    • _call_with_retry calls the provider exactly 1 time on spend-cap with retry_max=8.
    • Gemini adapter wraps spend-cap as TerminalProviderError (with a fake client — no network).
    • Gemini adapter still wraps 503 UNAVAILABLE as TransientProviderError.
    • Bundle executor with retry_max=6 over 36 dispatched calls hits the provider ≤ batch_size times (vs the buggy 252 worst-case) and records aborted: true plus abort_reason in summary + manifest.
    • Bundle executor does not abort on genuine transient 429 — full retry budget is exhausted as before.

Test plan

  • pytest benchmarks/v4.1/tests/ -v passes locally (110/110).
  • Reviewer to confirm no completed benchmark artefacts are altered (none staged in this diff).
  • Reviewer to confirm the patch does not change behaviour for mock / non-Gemini providers.

🤖 Generated with Claude Code

…fast

Run 26642239431 produced 230 errors and 0 raw outputs because every
Gemini call returned 429 RESOURCE_EXHAUSTED with the message
"Your project has exceeded its monthly spending cap. Please go to AI
Studio at https://ai.studio/spend...". The harness misclassified this
as a transient 429, retried each item 6 times, and burned the full
120-min job on a guaranteed-failing run.

This patch:
- Adds is_terminal_billing_error() + TerminalProviderError to
  providers.base; the classifier checks the spend-cap signal BEFORE
  the transient signal so a 429-with-cap message becomes non-retryable.
- Gemini adapter wraps spend-cap SDK exceptions as TerminalProviderError.
- Executor _classify() exposes a "terminal_billing" class; the retry
  loop calls the provider exactly once for it (no backoff burn).
- Bundle executor trips an abort flag on the first terminal_billing
  result, stops dispatching new batches, records remaining work as
  skipped_terminal_abort, and stamps abort_reason into both summary
  and manifest.
- Genuine transient 429 rate-limit messages (no cap text) keep their
  existing retry behaviour.
- 15 new tests in tests/test_terminal_billing.py cover classifier
  variants, retry-loop call count, adapter wrapping (no network), and
  the bundle executor's fail-fast + don't-abort-on-transient paths.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Davincc77 Davincc77 merged commit 82c1c26 into main May 29, 2026
3 checks passed
@Davincc77 Davincc77 deleted the fix/bench-terminal-billing-classification branch May 29, 2026 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants