fix(benchmarks): classify Gemini monthly spend-cap as terminal, fail-fast bundle runner#92
Merged
Conversation
…fast Run 26642239431 produced 230 errors and 0 raw outputs because every Gemini call returned 429 RESOURCE_EXHAUSTED with the message "Your project has exceeded its monthly spending cap. Please go to AI Studio at https://ai.studio/spend...". The harness misclassified this as a transient 429, retried each item 6 times, and burned the full 120-min job on a guaranteed-failing run. This patch: - Adds is_terminal_billing_error() + TerminalProviderError to providers.base; the classifier checks the spend-cap signal BEFORE the transient signal so a 429-with-cap message becomes non-retryable. - Gemini adapter wraps spend-cap SDK exceptions as TerminalProviderError. - Executor _classify() exposes a "terminal_billing" class; the retry loop calls the provider exactly once for it (no backoff burn). - Bundle executor trips an abort flag on the first terminal_billing result, stops dispatching new batches, records remaining work as skipped_terminal_abort, and stamps abort_reason into both summary and manifest. - Genuine transient 429 rate-limit messages (no cap text) keep their existing retry behaviour. - 15 new tests in tests/test_terminal_billing.py cover classifier variants, retry-loop call count, adapter wrapping (no network), and the bundle executor's fail-fast + don't-abort-on-transient paths. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
429 RESOURCE_EXHAUSTEDwithYour project has exceeded its monthly spending cap. Please go to AI Studio at https://ai.studio/spend.... The harness's classifier treated that as transient (it matched both the 429 status and theRESOURCE_EXHAUSTEDtoken), so each of the 230 items was retried 6 times — burning the full 120-min job on a condition only a human can fix.monthly spending cap,spend cap,billing hard cap,ai.studio/spend, or equivalent are classified terminal, never transient — even when the 429/RESOURCE_EXHAUSTED token is also present.TerminalProviderErrorexception class (subclass ofProviderError, not ofTransientProviderError) plusis_terminal_billing_error(). Gemini adapter raisesTerminalProviderErrorfor the spend-cap path._classify()returns"terminal_billing";_call_with_retrycalls the provider exactly once for that class (no backoff burn).executor_b_bundles.py) trips an abort flag on the firstterminal_billingfinal-class result, stops dispatching new batches, records remaining work asskipped_terminal_abort, and stampsabort_reason/aborted: trueinto bothmetrics_summary.jsonandrun_manifest.json. With the bug, a 1800-call run withretry_max=6made up to 12,600 provider calls; with the fix it makes at mostbatch_sizecalls before aborting.Files changed
benchmarks/v4.1/providers/base.py—TerminalProviderError,_TERMINAL_BILLING_TOKENS,is_terminal_billing_error();is_transient_error()short-circuits on the terminal signal.benchmarks/v4.1/providers/__init__.py— export the new symbols.benchmarks/v4.1/providers/gemini_adapter.py— wrap spend-cap SDK exceptions asTerminalProviderError; transient + timeout paths unchanged.benchmarks/v4.1/runner/executor.py—_classify()returns"terminal_billing"; the retry loop's existingif cls != "transient": return Noneshort-circuit naturally applies.benchmarks/v4.1/runner/executor_b_bundles.py— abort_reason flag set on firstterminal_billing;_do_callshort-circuits subsequent in-batch calls intoskipped_terminal_abort; batch loop breaks; summary + manifest recordabortedandabort_reason.benchmarks/v4.1/tests/test_terminal_billing.py(NEW, 15 tests).Testing
pytest benchmarks/v4.1/tests/→ 110 passed locally (15 new + 95 existing, no regressions)._call_with_retrycalls the provider exactly 1 time on spend-cap withretry_max=8.TerminalProviderError(with a fake client — no network).503 UNAVAILABLEasTransientProviderError.retry_max=6over 36 dispatched calls hits the provider ≤batch_sizetimes (vs the buggy 252 worst-case) and recordsaborted: trueplusabort_reasonin summary + manifest.Test plan
pytest benchmarks/v4.1/tests/ -vpasses locally (110/110).mock/ non-Gemini providers.🤖 Generated with Claude Code