Skip to content

fix(benchmarks): bound Gemini calls and fail-fast on hung bundle batches#91

Merged
Davincc77 merged 1 commit into
mainfrom
fix/gemini-timeout-progress
May 29, 2026
Merged

fix(benchmarks): bound Gemini calls and fail-fast on hung bundle batches#91
Davincc77 merged 1 commit into
mainfrom
fix/gemini-timeout-progress

Conversation

@Davincc77

Copy link
Copy Markdown
Owner

Summary

Bundle-based Test B long pilot stalled twice only for bundle_index=4 (b05_drone_mission_ops) on the same head SHA 9448382 with identical params (sessions_per_bundle=150, concurrency=2, retry_max=5, retry_backoff=2, retry_backoff_max=30). Runs 26634227315 and 26637685221 both wedged on the Pilot Test B bundles — execute against Gemini step for >80 min before being cancelled. Bundles 0..3 on the same SHA completed cleanly.

Diagnosis

Two compounding bugs, both independent of the Drone/Mission Ops fixtures themselves:

  1. No per-request timeout on Gemini. GeminiProvider.generate did not forward ProviderConfig.timeout_s (60s) to the SDK. google-genai has no client-side deadline in that path, so a stalled TLS read — a known Gemini behaviour around safety-filtered content, and the Drone/Mission Ops bundle deliberately mixes "drone-operator", "mission-control", "security-incident-response", "human veto on a security-relevant step", and "CI weakening resistance" content — can block indefinitely with no error to retry.
  2. ThreadPoolExecutor context-manager deadlock. executor_b_bundles ran each batch through with ThreadPoolExecutor(...) as pool: list(pool.map(...)). __exit__ joins all workers, so one hung worker wedges the entire 1800-call run with zero log output — exactly what the GitHub Actions UI showed.

Mapping (bundle_index=4b05_drone_mission_ops / Drone/Mission Ops) is correct; no off-by-one. First call is s001 / p01_scoping / no_memory (small prompt) — the stall is not at prompt construction. Maximum prompt size for bundle 4 (~70 KB) is comparable to bundle 0 (~68 KB); content is not pathologically large.

Fix

  • Thread ProviderConfig.timeout_sHttpOptions(timeout=<ms>) so every Gemini call has a 60s per-attempt deadline (configurable via new --request-timeout-s flag, also wired into the workflow execute step).
  • Classify timeout exceptions as TransientProviderError so the retry loop can recover.
  • Replace the ThreadPoolExecutor in the bundle executor with a queue + daemon-thread pool plus a per-call wall-clock cap derived from timeout_s * (retry_max+1) + retry_backoff_max * retry_max + 5s. A wedged worker no longer blocks shutdown; its job is logged in errors.jsonl as final_error_class=wall_clock_timeout and the batch proceeds.
  • Add per-item progress logging to stderr ([bb] start N/M … / [bb] ok … / [bb] error … / [bb] WALL-TIMEOUT …) so GitHub Actions surfaces live progress.
  • Add CI-safe regression probe tests/test_bundle4_hang_probe.py:
    • Pins bundle_index=4 → b05_drone_mission_ops.
    • Pins the first call expansion (scoping / no_memory).
    • Synthetic _HangingProvider proves the executor finishes within 30s with all calls classified as wall_clock_timeout.
    • Asserts the Gemini adapter forwards timeout_s as http_options.timeout (ms).

No completed benchmark results are altered. No publish / tag / release / Zenodo / npm / PyPI path touched. Harness still makes no Mem0 compatibility claim.

Test plan

  • pytest benchmarks/v4.1/tests/ — 95 passed in 28.13s (90 pre-existing + 4 new + 1 incidental from import wiring)
  • pytest benchmarks/v4.1/tests/test_bundle4_hang_probe.py — 4 passed in 8.30s (hang-probe runs in <10s against synthetic hanging provider)
  • python -m py_compile on all touched files
  • Re-run Benchmark v4.1 — Gemini Test B Bundle Long Pilot (controlled) with bundle_index=4 and verify it now either completes or fails with explicit wall_clock_timeout rows + visible [bb] progress logs (manual dispatch — not gated by this PR's CI)

🤖 Generated with Claude Code

Bundle-based Test B pilot stalled twice only for bundle_index=4
(b05_drone_mission_ops) on the same head SHA and same params, both
runs reaching >80 min on the execute step before being cancelled. Other
bundles (0..3) on the same SHA completed cleanly. Root cause is two
compounding gaps in the call path, both independent of the fixtures:

1. ``GeminiProvider.generate`` calls ``client.models.generate_content``
   without forwarding ``ProviderConfig.timeout_s``. The google-genai SDK
   has no client-side deadline in that code path, so a stalled TLS read
   (a known Gemini behaviour around safety-filtered content; the Drone/
   Mission Ops bundle deliberately mixes "drone-operator",
   "mission-control", "security-incident-response", "human veto on a
   security-relevant step" and "CI weakening resistance" content) can
   block forever with no error to retry.

2. ``executor_b_bundles`` ran the per-batch concurrency through a
   ``ThreadPoolExecutor`` used as a context manager. When one worker
   never returns, ``__exit__`` joins it indefinitely, which deadlocks
   the entire 1800-call run and produces no logs.

This patch:

- Threads ``ProviderConfig.timeout_s`` through to
  ``HttpOptions(timeout=<ms>)`` so every Gemini call has a 60s
  per-attempt deadline (configurable via ``--request-timeout-s``).
- Classifies timeout exceptions as ``TransientProviderError`` so the
  retry loop can recover.
- Replaces the ``ThreadPoolExecutor`` in the bundle executor with a
  small queue + daemon-thread pool plus a derived per-call wall-clock
  cap. A wedged worker no longer blocks shutdown; its job is recorded
  as ``final_error_class=wall_clock_timeout`` in ``errors.jsonl`` and
  the batch proceeds.
- Adds per-item progress logging on stderr (start / ok / error /
  WALL-TIMEOUT) so GitHub Actions surfaces real-time progress; before
  this change a hung call was indistinguishable from a healthy run.
- Adds a CI-safe regression probe
  (``tests/test_bundle4_hang_probe.py``) covering the bundle_index=4
  fixture mapping, the deterministic first call (scoping / no_memory),
  the deadlock-free behaviour against a synthetic hanging provider,
  and that the adapter forwards ``timeout_s`` to ``HttpOptions``.

No completed benchmark results are altered. The harness still makes no
Mem0 compatibility claim and no publish/tag/release path is touched.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Davincc77 Davincc77 merged commit 304c8ed into main May 29, 2026
3 checks passed
@Davincc77 Davincc77 deleted the fix/gemini-timeout-progress branch May 29, 2026 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants