Skip to content

Latest commit

 

History

History
77 lines (62 loc) · 4.78 KB

File metadata and controls

77 lines (62 loc) · 4.78 KB

Statistical Reporting and Experimental Rigor Audit

Generated by scripts/generate_statistical_rigor_report.py on 2026-05-09.

This report is a deterministic analysis of saved artifacts only. It reads paper/tables/publication_claim_tables.json; it does not call a model, change prompts, or unlock any held-out split.

Statistical Posture

The statistical status is intentionally conservative. The 3D rows are sequential scientific probes, not iid samples from a well-defined deployment population. The sign tests and bootstrap intervals below are therefore diagnostic descriptive summaries, not population-level proof. They are useful because they make the residual pattern inspectable: which metrics repeatedly improve, which ones tie, and which ones still fail under transfer.

Quantity Value
Surfaced unlocked 3D rows 9
Claim-bearing rows excluding post-selection audit 8
Paper-clean held-out wins 2
Clean-win rate over claim-bearing rows 0.2500
Exact 95% clean-win-rate interval [0.0319, 0.6509]

Metric-Level Seed Summary

Rows below use the claim-bearing held-out/prospective rows and exclude the post-selection audit seed. Positive mean delta means selected scaffold advantage; for fragility this is already inverted as baseline fragility minus selected fragility.

Metric n Mean delta Bootstrap 95% CI Win / tie / loss One-sided sign p Two-sided sign p
Salience 8 +0.0137 [-0.0019, +0.0317] 5 / 1 / 2 0.227 0.453
Sensitivity 8 +0.2083 [0.0000, +0.4583] 4 / 3 / 1 0.188 0.375
Valid format 8 0.0000 [0.0000, 0.0000] 0 / 8 / 0 N/A N/A
Fragility 8 +0.0471 [-0.0823, +0.1761] 5 / 0 / 3 0.363 0.727
Alignment 8 +0.0680 [+0.0380, +0.0963] 7 / 0 / 1 0.035 0.070
WVS salience 8 +0.0536 [+0.0061, +0.1092] 6 / 1 / 1 0.062 0.125
WVS sensitivity 8 +0.5000 [+0.1250, +0.8750] 4 / 4 / 0 0.062 0.125

Seed-Level Clean-Win Audit

The paper-ready clean-win label is stricter than metric no-regression. A seed may show no metric regression but still be treated as boundary evidence if it missed a preregistered strict gate, was post-selection audit, or had a non-claim-bearing status.

Seed Status Paper-clean win Metric no-regression Metric losses Source
2407 POST_SELECTION_AUDIT no yes none reports/3d_ethics_scaffold_family_prospective_seed2407_2026-05-06.json
2801 PREREGISTERED_HELD_OUT_WIN yes yes none reports/3d_ethics_scaffold_family_prospective_seed2801_2026-05-06.json
3001 CONFIRMATORY_HELD_OUT_MIXED no no Salience, Fragility reports/3d_ethics_scaffold_family_confirmatory_seed3001_2026-05-06.json
3109 CONFIRMATORY_HELD_OUT_MIXED no no Fragility reports/3d_ethics_scaffold_family_confirmatory_seed3109_2026-05-06.json
4523 CONFIRMED_HELD_OUT_WIN yes yes none reports/3d_ethics_v2_3s_to_v2_3u_wvs_guarded_replication_matrix_2026-05-07.json
4627 REPLICATION_BOUNDARY no yes none reports/3d_ethics_v2_3s_to_v2_3u_wvs_guarded_replication_matrix_2026-05-07.json
4703 REPLICATION_BOUNDARY no yes none reports/3d_ethics_v2_3s_to_v2_3u_wvs_guarded_replication_matrix_2026-05-07.json
4909 HELD_OUT_SELECTOR_GAP_FAILURE no no Salience, Fragility, WVS salience reports/3d_ethics_qwen3b_scaffold_family_tournament_v2_4b_semantic_gate_seed4909_2026-05-07.json
8563 PROSPECTIVE_MIXED_NEGATIVE no no Sensitivity, Alignment reports/3d_ethics_v2_10e_to_v2_10f_exact_count_transfer_2026-05-08.json

Rigor Implications

Issue Current evidence Required strengthening
Statistical power Two clean wins exist, but the confidence interval is wide. A preregistered multi-seed panel with fixed scaffold selection and the same access-log discipline.
Selector gap Several boundary seeds show selector-dev does not reliably predict held-out quality. Report selector-dev/final-test gaps as a first-class diagnostic.
Fragility tradeoff Fragility improves on some rows but remains a repeated failure mode. Add ablations that isolate update pressure from no-overreaction pressure.
WVS sensitivity WVS improves strongly on some rows but ties or collapses on others. Keep WVS endpoint-credit feasibility and semantic movement audits separate from official gate credit.
Prompt vs artifact route Operation tags strengthen the baseline under dev-only replay. Run an operation-tag off/on route ablation before any new held-out launch.

Related Outputs

  • paper/tables/publication_claim_tables.md
  • reports/publication_visual_table_package_2026-05-09.md
  • docs/ablation_depth_plan.md
  • docs/related_work_coverage.md