Skip to content

Latest commit

 

History

History
90 lines (59 loc) · 4.16 KB

File metadata and controls

90 lines (59 loc) · 4.16 KB

Matched-Budget Revision Cycle: 0.5B Smoke Gate

This report summarizes the cross-only revision-cycle experiment run from configs/matched_budget_revision_qwen_0p5b_smoke.yaml. The goal was to repair the unstable cross_paradigm arm while keeping the baseline, local-polishing arm, smoke seeds, split logic, backend family, and evaluation path fixed.

Top-level run root:

Family-level status artifacts:

Outcome

  • cycle1 selected cross_v2_typical_context
  • cycle2 selected cross_v5_boundary_clarity
  • Both cycles improved on the original frozen pilot's mean final_test accuracy
  • Neither cycle cleared the strict smoke-gate rule
  • Final family status: failed_after_max_revision_cycles

Cycle 1

Selector-screen winner:

  • cross_v2_typical_context

Key artifacts:

Cycle-1 3-seed means:

Split current_round_7 local_polishing cross_v2_typical_context
selector_dev 0.4912 0.4912 0.5789
final_test 0.4667 0.4667 0.5167

Gate result:

  • passed selector-best criterion
  • passed final-test-best criterion
  • failed strict win-count criterion versus baseline
  • baseline comparison on final_test: wins [43], ties [17, 29], losses []
  • local-polishing comparison on final_test: wins [17, 29], ties [43], losses []

Interpretation:

  • The revised arm became the best mean arm on both selector_dev and final_test
  • The smoke gate still failed because the improvement came as one outright win plus two ties against baseline, not 2/3 outright wins

Cycle 2

Selector-screen winner:

  • cross_v5_boundary_clarity

Key artifacts:

Cycle-2 3-seed means:

Split current_round_7 local_polishing cross_v5_boundary_clarity
selector_dev 0.4912 0.4912 0.5263
final_test 0.4667 0.4667 0.4833

Gate result:

  • passed selector-best criterion
  • passed final-test-best criterion
  • failed strict win-count criterion versus baseline
  • baseline comparison on final_test: wins [43], ties [17, 29], losses []
  • local-polishing comparison on final_test: wins [17, 29], ties [], losses [43]

Interpretation:

  • The boundary-explicit fallback preserved the mean lead, but the lead shrank relative to cycle 1
  • It still failed for the same structural reason: ties against baseline on seeds 17 and 29
  • It also lost to local polishing on seed 43, which became the priority failure seed in the cycle-2 audit

Bottom Line

The revised cross-paradigm family is better than the original frozen pilot in one important sense: both cycles made the revised arm the best mean arm on both selector_dev and final_test. But under the stricter promotion rule, that is still not enough. The revised arm did not produce the required number of outright held-out wins against baseline, so the experiment does not justify promotion to a larger phase or to stronger paper-facing claims.