Matched-Budget Revision Cycle: 0.5B Smoke Gate

This report summarizes the cross-only revision-cycle experiment run from configs/matched_budget_revision_qwen_0p5b_smoke.yaml. The goal was to repair the unstable cross_paradigm arm while keeping the baseline, local-polishing arm, smoke seeds, split logic, backend family, and evaluation path fixed.

Top-level run root:

outputs/matched_budget_revision_qwen_0p5b_smoke/

Family-level status artifacts:

Outcome

cycle1 selected cross_v2_typical_context
cycle2 selected cross_v5_boundary_clarity
Both cycles improved on the original frozen pilot's mean final_test accuracy
Neither cycle cleared the strict smoke-gate rule
Final family status: failed_after_max_revision_cycles

Cycle 1

Selector-screen winner:

cross_v2_typical_context

Key artifacts:

Cycle-1 3-seed means:

Split	`current_round_7`	`local_polishing`	`cross_v2_typical_context`
`selector_dev`	`0.4912`	`0.4912`	`0.5789`
`final_test`	`0.4667`	`0.4667`	`0.5167`

Gate result:

passed selector-best criterion
passed final-test-best criterion
failed strict win-count criterion versus baseline
baseline comparison on final_test: wins [43], ties [17, 29], losses []
local-polishing comparison on final_test: wins [17, 29], ties [43], losses []

Interpretation:

The revised arm became the best mean arm on both selector_dev and final_test
The smoke gate still failed because the improvement came as one outright win plus two ties against baseline, not 2/3 outright wins

Cycle 2

Selector-screen winner:

cross_v5_boundary_clarity

Key artifacts:

Cycle-2 3-seed means:

Split	`current_round_7`	`local_polishing`	`cross_v5_boundary_clarity`
`selector_dev`	`0.4912`	`0.4912`	`0.5263`
`final_test`	`0.4667`	`0.4667`	`0.4833`

Gate result:

passed selector-best criterion
passed final-test-best criterion
failed strict win-count criterion versus baseline
baseline comparison on final_test: wins [43], ties [17, 29], losses []
local-polishing comparison on final_test: wins [17, 29], ties [], losses [43]

Interpretation:

The boundary-explicit fallback preserved the mean lead, but the lead shrank relative to cycle 1
It still failed for the same structural reason: ties against baseline on seeds 17 and 29
It also lost to local polishing on seed 43, which became the priority failure seed in the cycle-2 audit

Bottom Line

The revised cross-paradigm family is better than the original frozen pilot in one important sense: both cycles made the revised arm the best mean arm on both selector_dev and final_test. But under the stricter promotion rule, that is still not enough. The revised arm did not produce the required number of outright held-out wins against baseline, so the experiment does not justify promotion to a larger phase or to stronger paper-facing claims.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matched-Budget Revision Cycle: 0.5B Smoke Gate

Outcome

Cycle 1

Cycle 2

Bottom Line

FilesExpand file tree

matched_budget_revision_qwen_0p5b_smoke.md

Latest commit

History

matched_budget_revision_qwen_0p5b_smoke.md

File metadata and controls

Matched-Budget Revision Cycle: 0.5B Smoke Gate

Outcome

Cycle 1

Cycle 2

Bottom Line