Generated by scripts/generate_statistical_rigor_report.py on 2026-05-09.
This report is a deterministic analysis of saved artifacts only. It reads
paper/tables/publication_claim_tables.json; it does not call a model, change
prompts, or unlock any held-out split.
The statistical status is intentionally conservative. The 3D rows are sequential scientific probes, not iid samples from a well-defined deployment population. The sign tests and bootstrap intervals below are therefore diagnostic descriptive summaries, not population-level proof. They are useful because they make the residual pattern inspectable: which metrics repeatedly improve, which ones tie, and which ones still fail under transfer.
| Quantity | Value |
|---|---|
| Surfaced unlocked 3D rows | 9 |
| Claim-bearing rows excluding post-selection audit | 8 |
| Paper-clean held-out wins | 2 |
| Clean-win rate over claim-bearing rows | 0.2500 |
| Exact 95% clean-win-rate interval | [0.0319, 0.6509] |
Rows below use the claim-bearing held-out/prospective rows and exclude the post-selection audit seed. Positive mean delta means selected scaffold advantage; for fragility this is already inverted as baseline fragility minus selected fragility.
| Metric | n | Mean delta | Bootstrap 95% CI | Win / tie / loss | One-sided sign p | Two-sided sign p |
|---|---|---|---|---|---|---|
| Salience | 8 | +0.0137 | [-0.0019, +0.0317] | 5 / 1 / 2 | 0.227 | 0.453 |
| Sensitivity | 8 | +0.2083 | [0.0000, +0.4583] | 4 / 3 / 1 | 0.188 | 0.375 |
| Valid format | 8 | 0.0000 | [0.0000, 0.0000] | 0 / 8 / 0 | N/A | N/A |
| Fragility | 8 | +0.0471 | [-0.0823, +0.1761] | 5 / 0 / 3 | 0.363 | 0.727 |
| Alignment | 8 | +0.0680 | [+0.0380, +0.0963] | 7 / 0 / 1 | 0.035 | 0.070 |
| WVS salience | 8 | +0.0536 | [+0.0061, +0.1092] | 6 / 1 / 1 | 0.062 | 0.125 |
| WVS sensitivity | 8 | +0.5000 | [+0.1250, +0.8750] | 4 / 4 / 0 | 0.062 | 0.125 |
The paper-ready clean-win label is stricter than metric no-regression. A seed may show no metric regression but still be treated as boundary evidence if it missed a preregistered strict gate, was post-selection audit, or had a non-claim-bearing status.
| Seed | Status | Paper-clean win | Metric no-regression | Metric losses | Source |
|---|---|---|---|---|---|
| 2407 | POST_SELECTION_AUDIT | no | yes | none | reports/3d_ethics_scaffold_family_prospective_seed2407_2026-05-06.json |
| 2801 | PREREGISTERED_HELD_OUT_WIN | yes | yes | none | reports/3d_ethics_scaffold_family_prospective_seed2801_2026-05-06.json |
| 3001 | CONFIRMATORY_HELD_OUT_MIXED | no | no | Salience, Fragility | reports/3d_ethics_scaffold_family_confirmatory_seed3001_2026-05-06.json |
| 3109 | CONFIRMATORY_HELD_OUT_MIXED | no | no | Fragility | reports/3d_ethics_scaffold_family_confirmatory_seed3109_2026-05-06.json |
| 4523 | CONFIRMED_HELD_OUT_WIN | yes | yes | none | reports/3d_ethics_v2_3s_to_v2_3u_wvs_guarded_replication_matrix_2026-05-07.json |
| 4627 | REPLICATION_BOUNDARY | no | yes | none | reports/3d_ethics_v2_3s_to_v2_3u_wvs_guarded_replication_matrix_2026-05-07.json |
| 4703 | REPLICATION_BOUNDARY | no | yes | none | reports/3d_ethics_v2_3s_to_v2_3u_wvs_guarded_replication_matrix_2026-05-07.json |
| 4909 | HELD_OUT_SELECTOR_GAP_FAILURE | no | no | Salience, Fragility, WVS salience | reports/3d_ethics_qwen3b_scaffold_family_tournament_v2_4b_semantic_gate_seed4909_2026-05-07.json |
| 8563 | PROSPECTIVE_MIXED_NEGATIVE | no | no | Sensitivity, Alignment | reports/3d_ethics_v2_10e_to_v2_10f_exact_count_transfer_2026-05-08.json |
| Issue | Current evidence | Required strengthening |
|---|---|---|
| Statistical power | Two clean wins exist, but the confidence interval is wide. | A preregistered multi-seed panel with fixed scaffold selection and the same access-log discipline. |
| Selector gap | Several boundary seeds show selector-dev does not reliably predict held-out quality. | Report selector-dev/final-test gaps as a first-class diagnostic. |
| Fragility tradeoff | Fragility improves on some rows but remains a repeated failure mode. | Add ablations that isolate update pressure from no-overreaction pressure. |
| WVS sensitivity | WVS improves strongly on some rows but ties or collapses on others. | Keep WVS endpoint-credit feasibility and semantic movement audits separate from official gate credit. |
| Prompt vs artifact route | Operation tags strengthen the baseline under dev-only replay. | Run an operation-tag off/on route ablation before any new held-out launch. |
paper/tables/publication_claim_tables.mdreports/publication_visual_table_package_2026-05-09.mddocs/ablation_depth_plan.mddocs/related_work_coverage.md