fix(benchmarks): bound Gemini calls and fail-fast on hung bundle batches#91
Merged
Conversation
Bundle-based Test B pilot stalled twice only for bundle_index=4 (b05_drone_mission_ops) on the same head SHA and same params, both runs reaching >80 min on the execute step before being cancelled. Other bundles (0..3) on the same SHA completed cleanly. Root cause is two compounding gaps in the call path, both independent of the fixtures: 1. ``GeminiProvider.generate`` calls ``client.models.generate_content`` without forwarding ``ProviderConfig.timeout_s``. The google-genai SDK has no client-side deadline in that code path, so a stalled TLS read (a known Gemini behaviour around safety-filtered content; the Drone/ Mission Ops bundle deliberately mixes "drone-operator", "mission-control", "security-incident-response", "human veto on a security-relevant step" and "CI weakening resistance" content) can block forever with no error to retry. 2. ``executor_b_bundles`` ran the per-batch concurrency through a ``ThreadPoolExecutor`` used as a context manager. When one worker never returns, ``__exit__`` joins it indefinitely, which deadlocks the entire 1800-call run and produces no logs. This patch: - Threads ``ProviderConfig.timeout_s`` through to ``HttpOptions(timeout=<ms>)`` so every Gemini call has a 60s per-attempt deadline (configurable via ``--request-timeout-s``). - Classifies timeout exceptions as ``TransientProviderError`` so the retry loop can recover. - Replaces the ``ThreadPoolExecutor`` in the bundle executor with a small queue + daemon-thread pool plus a derived per-call wall-clock cap. A wedged worker no longer blocks shutdown; its job is recorded as ``final_error_class=wall_clock_timeout`` in ``errors.jsonl`` and the batch proceeds. - Adds per-item progress logging on stderr (start / ok / error / WALL-TIMEOUT) so GitHub Actions surfaces real-time progress; before this change a hung call was indistinguishable from a healthy run. - Adds a CI-safe regression probe (``tests/test_bundle4_hang_probe.py``) covering the bundle_index=4 fixture mapping, the deterministic first call (scoping / no_memory), the deadlock-free behaviour against a synthetic hanging provider, and that the adapter forwards ``timeout_s`` to ``HttpOptions``. No completed benchmark results are altered. The harness still makes no Mem0 compatibility claim and no publish/tag/release path is touched. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bundle-based Test B long pilot stalled twice only for
bundle_index=4(b05_drone_mission_ops) on the same head SHA9448382with identical params (sessions_per_bundle=150,concurrency=2,retry_max=5,retry_backoff=2,retry_backoff_max=30). Runs 26634227315 and 26637685221 both wedged on thePilot Test B bundles — execute against Geministep for >80 min before being cancelled. Bundles 0..3 on the same SHA completed cleanly.Diagnosis
Two compounding bugs, both independent of the Drone/Mission Ops fixtures themselves:
GeminiProvider.generatedid not forwardProviderConfig.timeout_s(60s) to the SDK.google-genaihas no client-side deadline in that path, so a stalled TLS read — a known Gemini behaviour around safety-filtered content, and the Drone/Mission Ops bundle deliberately mixes "drone-operator", "mission-control", "security-incident-response", "human veto on a security-relevant step", and "CI weakening resistance" content — can block indefinitely with no error to retry.ThreadPoolExecutorcontext-manager deadlock.executor_b_bundlesran each batch throughwith ThreadPoolExecutor(...) as pool: list(pool.map(...)).__exit__joins all workers, so one hung worker wedges the entire 1800-call run with zero log output — exactly what the GitHub Actions UI showed.Mapping (
bundle_index=4→b05_drone_mission_ops/ Drone/Mission Ops) is correct; no off-by-one. First call iss001 / p01_scoping / no_memory(small prompt) — the stall is not at prompt construction. Maximum prompt size for bundle 4 (~70 KB) is comparable to bundle 0 (~68 KB); content is not pathologically large.Fix
ProviderConfig.timeout_s→HttpOptions(timeout=<ms>)so every Gemini call has a 60s per-attempt deadline (configurable via new--request-timeout-sflag, also wired into the workflow execute step).TransientProviderErrorso the retry loop can recover.ThreadPoolExecutorin the bundle executor with a queue + daemon-thread pool plus a per-call wall-clock cap derived fromtimeout_s * (retry_max+1) + retry_backoff_max * retry_max + 5s. A wedged worker no longer blocks shutdown; its job is logged inerrors.jsonlasfinal_error_class=wall_clock_timeoutand the batch proceeds.[bb] start N/M …/[bb] ok …/[bb] error …/[bb] WALL-TIMEOUT …) so GitHub Actions surfaces live progress.tests/test_bundle4_hang_probe.py:bundle_index=4 → b05_drone_mission_ops._HangingProviderproves the executor finishes within 30s with all calls classified aswall_clock_timeout.timeout_sashttp_options.timeout(ms).No completed benchmark results are altered. No publish / tag / release / Zenodo / npm / PyPI path touched. Harness still makes no Mem0 compatibility claim.
Test plan
pytest benchmarks/v4.1/tests/— 95 passed in 28.13s (90 pre-existing + 4 new + 1 incidental from import wiring)pytest benchmarks/v4.1/tests/test_bundle4_hang_probe.py— 4 passed in 8.30s (hang-probe runs in <10s against synthetic hanging provider)python -m py_compileon all touched filesBenchmark v4.1 — Gemini Test B Bundle Long Pilot (controlled)withbundle_index=4and verify it now either completes or fails with explicitwall_clock_timeoutrows + visible[bb]progress logs (manual dispatch — not gated by this PR's CI)🤖 Generated with Claude Code