Skip to content

fix(benchmarks): robust retry/backoff for transient provider 503s#88

Merged
Davincc77 merged 1 commit into
mainfrom
fix-benchmark-retry-backoff
May 28, 2026
Merged

fix(benchmarks): robust retry/backoff for transient provider 503s#88
Davincc77 merged 1 commit into
mainfrom
fix-benchmark-retry-backoff

Conversation

@Davincc77

Copy link
Copy Markdown
Owner

Summary

The first real v4.1 Gemini pilot produced 4 transient 503 UNAVAILABLE high demand failures. Existing retry was retry_max=2 with a flat ~1s backoff and no error classification, so transient infrastructure flakes surfaced as noisy benchmark failures.

This PR adds a small, scientifically controlled retry layer before any larger pilot/full runs:

  • Classifier (providers.base.is_transient_error + TransientProviderError) recognises HTTP 429/500/502/503/504, gRPC UNAVAILABLE / RESOURCE_EXHAUSTED / DEADLINE_EXCEEDED, read timeouts, and overloaded/high-demand signatures.
  • Gemini adapter re-raises classified SDK exceptions as TransientProviderError so the executor only retries on conditions where a retry can plausibly succeed.
  • Executor uses exponential backoff with jitter, capped at retry_backoff_max_s. Permanent errors (auth, config, schema) abort immediately and are still recorded — retries do not hide them.
  • Structured logs now include retried_attempts, per-attempt error_class/error_type/sleep_s, final_error_class, and cumulative_retry_delay_s in both raw_outputs.jsonl and errors.jsonl. The run manifest records the full retry configuration.
  • CLI exposes --retry-max (default 5, cap 8), --retry-backoff (default 2s, cap 10s), --retry-backoff-max (default 30s, cap 30s), --retry-jitter (default 0.25).
  • Workflow exposes retry_max, retry_backoff, retry_backoff_max workflow_dispatch inputs, validates and caps them, and surfaces them in the job summary. Low default concurrency is unchanged.
  • Docs: README and BENCHMARK_PROTOCOL document the new retry/backoff defaults and the transient-vs-permanent contract.

Concurrency defaults stay low (default 1, workflow cap 2); the retry layer prefers waiting over hammering.

Testing

  • python3 -m pytest benchmarks/v4.1/tests/ -q66 passed (18 new in test_retry_backoff.py).
  • New tests cover: classifier on 503/429/RESOURCE_EXHAUSTED/timeouts and rejection of auth/config strings; transient-then-success path; persistent-failure surfacing with full trace; no-retry on permanent auth/config errors; backoff cap; jitter bounds; runner caps on --retry-max / --retry-backoff-max; manifest records retry settings.
  • No real LLM calls — all tests inject scripted mock providers.
  • yaml.safe_load validates the updated workflow.

No publish / no tag / no release / no Zenodo / no npm / no PyPI.


🤖 Generated by Computer

…backoff retry

Real Gemini pilot produced 4 transient 503 UNAVAILABLE failures; current
retry_max=2 with a flat ~1s backoff was too tight. Add a small,
scientifically controlled retry layer:

- providers.base: TransientProviderError + is_transient_error classifier
  for HTTP 429/500/502/503/504, gRPC UNAVAILABLE / RESOURCE_EXHAUSTED /
  DEADLINE_EXCEEDED, and read-timeout signatures.
- gemini_adapter: wraps SDK exceptions and re-raises transient ones as
  TransientProviderError so retries only fire on retryable conditions.
- executor: exponential backoff with jitter, capped at retry_backoff_max_s;
  retries only on transient errors; permanent errors abort immediately.
  Log retried_attempts, per-attempt error class/type/sleep, final_error_class,
  and cumulative_retry_delay_s in both raw_outputs.jsonl and errors.jsonl.
- runner CLI: --retry-max default 5 (cap 8), --retry-backoff default 2s
  (cap 10s), --retry-backoff-max default 30s (cap 30s), --retry-jitter 0.25.
- workflow: expose retry_max/retry_backoff/retry_backoff_max inputs with
  the same caps; surface them in the job summary.
- tests: classifier coverage, retry-after-503-then-success, persistent
  failure surfaces with full trace, no retry on permanent auth/config
  errors, runner caps, manifest records retry settings (18 new tests, 66
  total pass; zero network calls in CI).

Failures are still written to errors.jsonl as before — retries do not hide them.
@Davincc77 Davincc77 marked this pull request as ready for review May 28, 2026 21:31
@Davincc77 Davincc77 merged commit 102eeef into main May 28, 2026
3 checks passed
@Davincc77 Davincc77 deleted the fix-benchmark-retry-backoff branch May 28, 2026 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant