This report summarizes the cross-only revision-cycle experiment run from configs/matched_budget_revision_qwen_0p5b_smoke.yaml. The goal was to repair the unstable cross_paradigm arm while keeping the baseline, local-polishing arm, smoke seeds, split logic, backend family, and evaluation path fixed.
Top-level run root:
Family-level status artifacts:
cycle1selectedcross_v2_typical_contextcycle2selectedcross_v5_boundary_clarity- Both cycles improved on the original frozen pilot's mean
final_testaccuracy - Neither cycle cleared the strict smoke-gate rule
- Final family status:
failed_after_max_revision_cycles
Selector-screen winner:
cross_v2_typical_context
Key artifacts:
cycle1/selector_screen/candidate_decision.jsoncycle1/smoke_gate/summary.jsoncycle1/smoke_gate/decision_log.json
Cycle-1 3-seed means:
| Split | current_round_7 |
local_polishing |
cross_v2_typical_context |
|---|---|---|---|
selector_dev |
0.4912 |
0.4912 |
0.5789 |
final_test |
0.4667 |
0.4667 |
0.5167 |
Gate result:
- passed selector-best criterion
- passed final-test-best criterion
- failed strict win-count criterion versus baseline
- baseline comparison on
final_test: wins[43], ties[17, 29], losses[] - local-polishing comparison on
final_test: wins[17, 29], ties[43], losses[]
Interpretation:
- The revised arm became the best mean arm on both
selector_devandfinal_test - The smoke gate still failed because the improvement came as one outright win plus two ties against baseline, not
2/3outright wins
Selector-screen winner:
cross_v5_boundary_clarity
Key artifacts:
cycle2/selector_screen/candidate_decision.jsoncycle2/smoke_gate/summary.jsoncycle2/smoke_gate/decision_log.jsoncycle2/smoke_gate/audits/
Cycle-2 3-seed means:
| Split | current_round_7 |
local_polishing |
cross_v5_boundary_clarity |
|---|---|---|---|
selector_dev |
0.4912 |
0.4912 |
0.5263 |
final_test |
0.4667 |
0.4667 |
0.4833 |
Gate result:
- passed selector-best criterion
- passed final-test-best criterion
- failed strict win-count criterion versus baseline
- baseline comparison on
final_test: wins[43], ties[17, 29], losses[] - local-polishing comparison on
final_test: wins[17, 29], ties[], losses[43]
Interpretation:
- The boundary-explicit fallback preserved the mean lead, but the lead shrank relative to cycle 1
- It still failed for the same structural reason: ties against baseline on seeds 17 and 29
- It also lost to local polishing on seed
43, which became the priority failure seed in the cycle-2 audit
The revised cross-paradigm family is better than the original frozen pilot in one important sense: both cycles made the revised arm the best mean arm on both selector_dev and final_test. But under the stricter promotion rule, that is still not enough. The revised arm did not produce the required number of outright held-out wins against baseline, so the experiment does not justify promotion to a larger phase or to stronger paper-facing claims.