Statistical Reporting and Experimental Rigor Audit

Generated by scripts/generate_statistical_rigor_report.py on 2026-05-09.

This report is a deterministic analysis of saved artifacts only. It reads paper/tables/publication_claim_tables.json; it does not call a model, change prompts, or unlock any held-out split.

Statistical Posture

The statistical status is intentionally conservative. The 3D rows are sequential scientific probes, not iid samples from a well-defined deployment population. The sign tests and bootstrap intervals below are therefore diagnostic descriptive summaries, not population-level proof. They are useful because they make the residual pattern inspectable: which metrics repeatedly improve, which ones tie, and which ones still fail under transfer.

Quantity	Value
Surfaced unlocked 3D rows	9
Claim-bearing rows excluding post-selection audit	8
Paper-clean held-out wins	2
Clean-win rate over claim-bearing rows	0.2500
Exact 95% clean-win-rate interval	[0.0319, 0.6509]

Metric-Level Seed Summary

Rows below use the claim-bearing held-out/prospective rows and exclude the post-selection audit seed. Positive mean delta means selected scaffold advantage; for fragility this is already inverted as baseline fragility minus selected fragility.

Metric	n	Mean delta	Bootstrap 95% CI	Win / tie / loss	One-sided sign p	Two-sided sign p
Salience	8	+0.0137	[-0.0019, +0.0317]	5 / 1 / 2	0.227	0.453
Sensitivity	8	+0.2083	[0.0000, +0.4583]	4 / 3 / 1	0.188	0.375
Valid format	8	0.0000	[0.0000, 0.0000]	0 / 8 / 0	N/A	N/A
Fragility	8	+0.0471	[-0.0823, +0.1761]	5 / 0 / 3	0.363	0.727
Alignment	8	+0.0680	[+0.0380, +0.0963]	7 / 0 / 1	0.035	0.070
WVS salience	8	+0.0536	[+0.0061, +0.1092]	6 / 1 / 1	0.062	0.125
WVS sensitivity	8	+0.5000	[+0.1250, +0.8750]	4 / 4 / 0	0.062	0.125

Seed-Level Clean-Win Audit

The paper-ready clean-win label is stricter than metric no-regression. A seed may show no metric regression but still be treated as boundary evidence if it missed a preregistered strict gate, was post-selection audit, or had a non-claim-bearing status.

Seed	Status	Paper-clean win	Metric no-regression	Metric losses	Source
2407	POST_SELECTION_AUDIT	no	yes	none	`reports/3d_ethics_scaffold_family_prospective_seed2407_2026-05-06.json`
2801	PREREGISTERED_HELD_OUT_WIN	yes	yes	none	`reports/3d_ethics_scaffold_family_prospective_seed2801_2026-05-06.json`
3001	CONFIRMATORY_HELD_OUT_MIXED	no	no	Salience, Fragility	`reports/3d_ethics_scaffold_family_confirmatory_seed3001_2026-05-06.json`
3109	CONFIRMATORY_HELD_OUT_MIXED	no	no	Fragility	`reports/3d_ethics_scaffold_family_confirmatory_seed3109_2026-05-06.json`
4523	CONFIRMED_HELD_OUT_WIN	yes	yes	none	`reports/3d_ethics_v2_3s_to_v2_3u_wvs_guarded_replication_matrix_2026-05-07.json`
4627	REPLICATION_BOUNDARY	no	yes	none	`reports/3d_ethics_v2_3s_to_v2_3u_wvs_guarded_replication_matrix_2026-05-07.json`
4703	REPLICATION_BOUNDARY	no	yes	none	`reports/3d_ethics_v2_3s_to_v2_3u_wvs_guarded_replication_matrix_2026-05-07.json`
4909	HELD_OUT_SELECTOR_GAP_FAILURE	no	no	Salience, Fragility, WVS salience	`reports/3d_ethics_qwen3b_scaffold_family_tournament_v2_4b_semantic_gate_seed4909_2026-05-07.json`
8563	PROSPECTIVE_MIXED_NEGATIVE	no	no	Sensitivity, Alignment	`reports/3d_ethics_v2_10e_to_v2_10f_exact_count_transfer_2026-05-08.json`

Rigor Implications

Issue	Current evidence	Required strengthening
Statistical power	Two clean wins exist, but the confidence interval is wide.	A preregistered multi-seed panel with fixed scaffold selection and the same access-log discipline.
Selector gap	Several boundary seeds show selector-dev does not reliably predict held-out quality.	Report selector-dev/final-test gaps as a first-class diagnostic.
Fragility tradeoff	Fragility improves on some rows but remains a repeated failure mode.	Add ablations that isolate update pressure from no-overreaction pressure.
WVS sensitivity	WVS improves strongly on some rows but ties or collapses on others.	Keep WVS endpoint-credit feasibility and semantic movement audits separate from official gate credit.
Prompt vs artifact route	Operation tags strengthen the baseline under dev-only replay.	Run an operation-tag off/on route ablation before any new held-out launch.

Related Outputs

paper/tables/publication_claim_tables.md
reports/publication_visual_table_package_2026-05-09.md
docs/ablation_depth_plan.md
docs/related_work_coverage.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Statistical Reporting and Experimental Rigor Audit

Statistical Posture

Metric-Level Seed Summary

Seed-Level Clean-Win Audit

Rigor Implications

Related Outputs

FilesExpand file tree

statistical_reporting_3d_2026-05-09.md

Latest commit

History

statistical_reporting_3d_2026-05-09.md

File metadata and controls

Statistical Reporting and Experimental Rigor Audit

Statistical Posture

Metric-Level Seed Summary

Seed-Level Clean-Win Audit

Rigor Implications

Related Outputs