This matrix maps the repository’s paper-facing claims to concrete artifacts and keeps the line clear between supported evidence, directional evidence, and development-only diagnostics.
Paper-facing prose uses incumbent baseline for the locked pre-scaffold
commonsense comparator. Historical reports and raw artifact paths retain the
legacy id current_round_7.
| Claim | Artifact | Current support | Claim boundary | Release use |
|---|---|---|---|---|
| Full-seed iterative rewriting is directionally better than the manual fixed prompt | ../outputs/runs/seed_17/final_evaluation/statistics.json, neurips_assets_summary.json |
Directionally supported | Pairwise test is not significant | Repeat across more seeds with matched local backend |
| A frozen teacher-refined prompt can beat continued adaptation | ../outputs/final_gemini_experiment_qwen_0p5b_seed17_checkpoint320/final_experiment_summary.json, figures/checkpoint320_final_test_accuracy.png |
Narrow claim-bearing support | Held-out final test is only 64 examples | Re-run the same controlled frozen-vs-adaptive design on more seeds |
| Selector-dev can mis-rank the held-out winner | figures/checkpoint320_selector_vs_final.png, ../outputs/final_gemini_experiment_qwen_0p5b_seed17_checkpoint320/runs/seed_17/frozen_track/track_summary.json |
Artifact-backed | Based on one completed checkpoint seed | Measure selector regret across more seeds and selector slices |
| ETHICS scaffold-freezing works beyond the checkpoint mechanism microscope | ../paper_aies_expanded/supplement.pdf, ../paper/tables/publication_claim_tables.md, figures/ethics_10seed_final_deltas.png |
Static-classification evidence: the 10-seed tournament reports 6 frozen wins, 2 ties, 2 continued wins, and mean frozen-minus-continued advantage +0.0438 |
ETHICS is supporting route evidence, not the headline perturbation proof | Treat as scaffold-freezing evidence, not as universal ETHICS victory |
| Fixed ETHICS prompt artifacts retain useful audit signal and expose capacity sensitivity | figures/ethics_postselection_audit_fixed_artifacts.png, figures/ethics_capacity_audit_fixed_artifacts.png, ../paper/refined_prompt_shape_epiplexity_paper.pdf |
Post-selection audit and capacity audit support artifact robustness and route-specificity | Audit rows are not independent discovery; unchanged prompt transfer is student-size sensitive | Use as VAE cost/residual and transfer-boundary evidence |
| Prompt-shape search is a better conceptual frame than local prompt polishing alone | ../paper/refined_prompt_shape_epiplexity_paper.tex, ../docs/3d_ethics_prompt_shape_framework.md, figures/prompt_shape_landscape.png, matched_budget_revision_qwen_0p5b_smoke.md |
Conceptual plus exploratory operational support | Broad matched-budget dominance is outside the release claim | Keep the paper claim conceptual and artifact-backed rather than broad |
| The corrected 3D Ethics prompt-rewriting protocol is implemented with locked-split discipline and can produce clean held-out teacher-family wins against the incumbent baseline | ../docs/3d_ethics_stability_protocol.md, 3d_ethics_scaffold_family_prospective_protocol_seed2801_2026-05-06.md, 3d_ethics_scaffold_family_prospective_seed2801_2026-05-06.md, 3d_ethics_qwen3b_scaffold_family_tournament_v2_3s_wvs_guarded_seed4523_2026-05-07.md |
Two narrow claim-bearing held-out wins: seed 2801 and seed 4523 |
Wins are relative to the locked incumbent baseline; broad all-seed confirmation is outside the release claim | Use the wins as real anchors, while keeping mixed rows visible as the repeatability boundary |
| The surfaced 3D held-out/prospective rows show a positive but underpowered seed-level pattern | statistical_reporting_3d_2026-05-09.md, statistical_reporting_3d_2026-05-09.json, ../paper/tables/publication_claim_tables.md |
Descriptive statistical support | Rows are sequential probes, not iid samples; clean-win interval is wide and p-values are diagnostic | Use the report to calibrate language; do not claim population-level significance |
| A teacher-generated scaffold family can clear strict 3D development gates on fresh splits | 3d_ethics_scaffold_family_replay_seed211_dev_2026-05-04.md, 3d_ethics_qwen3b_scaffold_family_tournament_v2_1_dev_2026-05-04.md, 3d_ethics_qwen3b_scaffold_family_tournament_v2_1_replay_seed613_dev_2026-05-04.md |
Development-only support | No single family stayed gate-passing across all later fresh dev seeds | Continue dev-only family consolidation |
| The first prospective 3D scaffold-family smoke showed a real selector gap and real family weakness | 3d_ethics_scaffold_family_prospective_seed307_2026-05-04.md, 3d_ethics_scaffold_family_seed307_postprospective_autopsy_2026-05-04.md |
Strong diagnostic support | Negative result, not a launch | Use only as development evidence; do not reuse the split confirmatorily |
| The seed-2801 support-state win plus the immediate three-seed confirmation boundary sharply localized the first repeatability problem | 3d_ethics_scaffold_family_confirmatory_matrix_seeds2903_3001_3109_2026-05-06.md, 3d_ethics_scaffold_family_confirmatory_seed3001_2026-05-06.md, 3d_ethics_scaffold_family_confirmatory_seed3109_2026-05-06.md, ../docs/current_status.md |
Mixed: the seed-2801 claim-bearing win plus strong negative repeatability evidence | Broad multi-seed support is not safe; the wall was tiny salience-margin misses, fragility regression, and one selector-gap recurrence | Use this boundary as the rationale for the later WVS-guarded named-criterion lane, not as a broad confirmation |
| Post-matrix support-basis/criterion-lock search can create strong tuned-dev candidates, but the current best did not survive fresh-dev replay | 3d_ethics_qwen3b_scaffold_family_tournament_v2_3l_criterion_lock_seed3709_dev_2026-05-07.md, 3d_ethics_qwen3b_scaffold_family_tournament_v2_3l_replay_seed3907_dev_2026-05-07.md, 3d_ethics_qwen3b_scaffold_family_tournament_v2_3m_minimal_criterion_patch_seed3907_dev_2026-05-07.md, 3d_ethics_v2_3l_v2_3m_pareto_diagnostic_2026-05-07.md, ../docs/research_logs/3d_ethics_v2_3j_to_v2_3m_dev_cycle_2026-05-07.md |
Development-only diagnostic support | Seed 3709 v2.3l passed all dev gates, but seed 3907 v2.3l/v2.3m failed salience and fragility; no final-test was accessed |
Use the seed-4127 Pareto-frontier cycle as the follow-up evidence; do not promote the 3709 tuned pass |
| Seed-4127 Pareto-frontier and changed-case probes show the support-change mechanism can recover WVS sensitivity and low fragility, but still cannot clear the strict salience gate | 3d_ethics_qwen3b_scaffold_family_tournament_v2_3n_pareto_seed4127_dev_2026-05-07.md, 3d_ethics_qwen3b_scaffold_family_tournament_v2_3o_terminal_patch_seed4127_dev_2026-05-07.md, 3d_ethics_qwen3b_scaffold_family_tournament_v2_3p_changed_case_seed4127_dev_2026-05-07.md, 3d_ethics_seed4127_changed_case_salience_gate_audit_2026-05-07.md, 3d_ethics_seed4127_wvs_salience_expert_adjudication_2026-05-07.md, ../docs/research_logs/3d_ethics_v2_3n_to_v2_3p_pareto_seed4127_dev_2026-05-07.md |
Development-only no-launch evidence plus derived/single-expert audit | Best changed-case candidate failed only the strict salience-improvement gate; all three runs kept final_test locked; expert adjudication says the gate should not be waived because at least one WVS preservation omission is real |
Run the pre-specified seed-4303 dev-only WVS-preservation patch protocol before any held-out spend |
| The seed-4303/4409 bridge showed that WVS preservation and WVS fact-change sensitivity must be gated explicitly before held-out launch | 3d_ethics_wvs_preservation_patch_protocol_seed4303_dev_2026-05-07.md, 3d_ethics_qwen3b_scaffold_family_tournament_v2_3q_wvs_preservation_seed4303_dev_2026-05-07.md, 3d_ethics_qwen3b_scaffold_family_tournament_v2_3r_salience_micro_lift_seed4409_2026-05-07.md |
Development/no-launch plus negative held-out diagnostic | Seed 4303 did not launch; seed 4409 launched a WVS-blind selector winner and failed final salience/fragility |
Justifies the prospective WVS-specific sensitivity gate introduced for seed 4523; does not itself support a positive held-out claim |
| A WVS-guarded named-criterion scaffold can produce a second clean held-out win against the incumbent baseline | 3d_ethics_qwen3b_scaffold_family_tournament_v2_3s_wvs_guarded_seed4523_2026-05-07.md, 3d_ethics_qwen3b_scaffold_family_tournament_v2_3s_wvs_guarded_seed4523_2026-05-07.json, ../outputs/3d_ethics_stability_qwen_3b_scaffold_family_tournament_v2_3s_wvs_guarded_seed4523/stability_prompt_rewrite_runs/seed_4523/data/access_log.json |
New claim-bearing held-out support vs the incumbent baseline | Seed 4523 is one clean prospective win; it should be combined with seed 2801 as evidence of a real scaffold family, not described as broad all-seed confirmation |
Use in paper as a second held-out win; keep seed 4627/4703 replication boundaries visible |
| The v2.3s WVS-guarded basin improves salience/fragility repeatedly while exposing held-out WVS-sensitivity instability across fresh seeds | 3d_ethics_v2_3s_to_v2_3u_wvs_guarded_replication_matrix_2026-05-07.md, 3d_ethics_qwen3b_scaffold_family_tournament_v2_3t_wvs_guarded_replication_seed4627_2026-05-07.md, 3d_ethics_qwen3b_scaffold_family_tournament_v2_3u_wvs_guarded_replication_seed4703_2026-05-07.md |
Mixed prospective replication evidence | Seed 4627 ties salience instead of strictly winning; seed 4703 wins salience/fragility but WVS sensitivity remains 0.0; both accessed final_test exactly once |
Use as residual-frontier evidence for WVS sensitivity-control analysis |
| A semantic WVS audit suggests some official WVS sensitivity misses are threshold/measurement boundaries rather than total support-removal blindness | 3d_ethics_wvs_changed_fact_semantic_audit_2026-05-07.md, 3d_ethics_wvs_changed_fact_semantic_audit_2026-05-07.json, 3d_ethics_wvs_changed_fact_semantic_audit_2026-05-07/audit_manifest.json |
Post-hoc measurement-audit support | This is not the official metric and cannot replace held-out WVS sensitivity | Use as a measurement caveat and define auxiliary gates prospectively |
| v2.6a shows the WVS boundary is weak directional support updates below the official sensitivity threshold, not pure support-change blindness | 3d_ethics_v2_6_support_basis_contrast_audit_2026-05-08.md, 3d_ethics_v2_6_support_basis_contrast_audit_2026-05-08.json, ../configs/3d_ethics_stability_qwen_3b_scaffold_family_tournament_v2_6a_support_basis_contrast_seed7307_dev.yaml, ../outputs/3d_ethics_stability_qwen_3b_scaffold_family_tournament_v2_6a_support_basis_contrast_seed7307_dev/stability_prompt_rewrite_runs/seed_7307/data/access_log.json |
Development-only no-launch plus row-level measurement audit | No candidate passed hard gates and final_test remained locked; this cannot support a positive held-out claim |
Use with v2.6b as metric-aware dev evidence |
| v2.6b shows weak WVS support recognition is real but cannot replace official WVS sensitivity as the paper gate | 3d_ethics_v2_6b_semantic_wvs_weak_update_audit_2026-05-08.md, 3d_ethics_v2_6b_semantic_wvs_weak_update_audit_2026-05-08.json, 3d_ethics_v2_6b_semantic_wvs_weak_update_audit_2026-05-08/audit_manifest.json, ../docs/research_logs/3d_ethics_v2_6b_semantic_wvs_weak_update_audit_2026-05-08.md |
Supplemental measurement-audit support over saved selector-dev rows | This audit runs after prompt selection, does not access final_test, and weak one-point updates are also common in the incumbent baseline; it is diagnostic, not headline held-out evidence |
Run a fresh dev-only metric-aware replay that reports official WVS sensitivity and supplemental weak-update sensitivity side by side |
| v2.6c--v2.6f show that the metric-aware frozen pool can still pass fresh dev but does not transfer prospectively enough for another held-out launch | 3d_ethics_v2_6c_to_v2_6f_metric_aware_replay_2026-05-08.md, 3d_ethics_v2_6c_to_v2_6f_metric_aware_replay_2026-05-08.json, ../docs/research_logs/3d_ethics_v2_6c_metric_aware_replay_seed7403_dev_protocol_2026-05-08.md, ../docs/research_logs/3d_ethics_v2_6d_metric_aware_prospective_seed7507_protocol_2026-05-08.md, ../docs/research_logs/3d_ethics_v2_6f_metric_aware_prospective_seed7703_protocol_2026-05-08.md |
Mixed development/prospective no-launch evidence | Seed 7403 is a strong dev pass, but seeds 7507 and 7703 both blocked before final-test and seed 7603 repair failed; no new held-out win exists |
Stop immediate prospective attempts from this frozen pool; move to selector calibration or support-basis tagging |
| v2.7a--v2.7g show that native WVS label handling plus a minimal output lock can pass one dev split while failing paper-ready transfer | 3d_ethics_v2_7a_to_v2_7g_native_label_output_lock_cycle_2026-05-08.md, 3d_ethics_v2_7a_to_v2_7g_native_label_output_lock_cycle_2026-05-08.json, ../docs/research_logs/3d_ethics_v2_7c_minimal_output_lock_seed7803_dev_protocol_2026-05-08.md, ../docs/research_logs/3d_ethics_v2_7f_minimal_output_lock_prospective_seed8009_protocol_2026-05-08.md |
Development-only pass plus prospective no-launch evidence | Seed 7803 v2.7c passed all dev hard gates, but seed 7901 fresh replay failed fragility/WVS sensitivity and seed 8009 prospective blocked before final-test; no new held-out win exists |
Use as evidence for split-safe support-basis tags and selector calibration |
| v2.8a--v2.8f show that split-safe support-basis artifacts are a promising VAE route, but still not stable enough for a new held-out claim | 3d_ethics_v2_8a_to_v2_8f_support_basis_artifact_cycle_2026-05-08.md, 3d_ethics_v2_8a_to_v2_8f_support_basis_artifact_cycle_2026-05-08.json, ../configs/3d_ethics_stability_qwen_3b_scaffold_family_tournament_v2_8c_support_basis_tags_prospective_seed8303.yaml, ../outputs/3d_ethics_stability_qwen_3b_scaffold_family_tournament_v2_8c_support_basis_tags_prospective_seed8303/stability_prompt_rewrite_runs/seed_8303/data/access_log.json |
Strong development-only mechanism evidence plus prospective no-launch | Seeds 8101 and 8209 passed dev gates with WVS sensitivity 1.0, but seed 8303 blocked before final-test; repair runs v2.8d--f did not recover a clean gate package. No v2.8 run accessed final-test. |
Build WVS polarity / least-supportive-label instrumentation and require multi-dev transfer before any new prospective launch |
| v2.9a--v2.9c show that WVS polarity artifacts can execute endpoint movement, but official WVS sensitivity can be infeasible for already-skeptical canonical rows | 3d_ethics_v2_9a_to_v2_9b_wvs_polarity_artifact_audit_2026-05-08.md, 3d_ethics_v2_9a_to_v2_9b_wvs_polarity_artifact_audit_2026-05-08.json, 3d_ethics_v2_9c_wvs_endpoint_credit_audit_2026-05-08.md, 3d_ethics_v2_9c_wvs_endpoint_credit_audit_2026-05-08.json, ../outputs/3d_ethics_stability_qwen_3b_scaffold_family_tournament_v2_9a_wvs_polarity_artifact_seed8407_dev/stability_prompt_rewrite_runs/seed_8407/data/access_log.json, ../outputs/3d_ethics_stability_qwen_3b_scaffold_family_tournament_v2_9b_wvs_support_endpoints_seed8407_dev/stability_prompt_rewrite_runs/seed_8407/data/access_log.json |
Development-only near-win plus derived measurement audit | v2.9a improved salience, fragility, alignment, and WVS salience but failed official WVS sensitivity. The v2.9c audit found 16/16 WVS changed-support rows reached the printed least-supportive endpoint, while only 5/16 could receive official credit; no v2.9 run accessed final-test. |
Run a feasibility-stratified dev replay that reports official WVS sensitivity and endpoint movement separately before any prospective launch |
| v2.10a--v2.10d show that feasibility-stratified WVS core-values rows can recover official WVS sensitivity, but the minimal endpoint repair still needs fresh-dev transfer before any held-out launch | 3d_ethics_v2_10a_to_v2_10d_feasibility_stratified_cycle_2026-05-08.md, 3d_ethics_v2_10a_to_v2_10d_feasibility_stratified_cycle_2026-05-08.json, ../configs/3d_ethics_stability_qwen_3b_scaffold_family_tournament_v2_10d_minimal_baseline_endpoint_seed8511_dev.yaml, ../outputs/3d_ethics_stability_qwen_3b_scaffold_family_tournament_v2_10d_minimal_baseline_endpoint_seed8511_dev/stability_prompt_rewrite_runs/seed_8511/data/access_log.json |
Development-only mechanism support and strongest current near-pass | v2.10d beats the incumbent baseline on salience (0.9796 vs 0.9558), WVS sensitivity (1.0 vs 0.5), WVS salience (0.9388 vs 0.8673), valid format (1.0 tied), and fragility (0.0238 vs 0.0754), but aggregate sensitivity is exactly 4/6 = 0.6667, below the configured decimal 0.67 floor. No v2.10 run accessed final-test. |
Fresh dev transfer of the minimal endpoint family with an explicitly pre-specified exact-count sensitivity gate (>= 4/6) if that is the intended threshold |
| v2.10e--v2.10f show exact-count endpoint transfer on fresh dev and a mixed-negative prospective held-out boundary | 3d_ethics_v2_10e_to_v2_10f_exact_count_transfer_2026-05-08.md, 3d_ethics_v2_10e_to_v2_10f_exact_count_transfer_2026-05-08.json, ../configs/3d_ethics_stability_qwen_3b_scaffold_family_tournament_v2_10e_minimal_endpoint_exact_count_seed8541_dev.yaml, ../configs/3d_ethics_stability_qwen_3b_scaffold_family_tournament_v2_10f_minimal_endpoint_exact_count_prospective_seed8563.yaml, ../outputs/3d_ethics_stability_qwen_3b_scaffold_family_tournament_v2_10f_minimal_endpoint_exact_count_prospective_seed8563/stability_prompt_rewrite_runs/seed_8563/data/access_log.json |
Fresh dev support plus mixed-negative prospective held-out evidence | v2.10e passed all selector-dev gates. v2.10f unlocked final-test exactly once and improved held-out salience, fragility, and WVS salience, but aggregate sensitivity collapsed to 0/3 vs baseline 1/3; therefore no new held-out win exists. |
Use as non-WVS changed-support sensitivity boundary evidence |
| v2.10g--v2.10h show that local non-WVS wording repairs hit a sensitivity/fragility tradeoff rather than producing a launchable candidate | 3d_ethics_v2_10g_to_v2_10h_non_wvs_repair_2026-05-08.md, 3d_ethics_v2_10g_to_v2_10h_non_wvs_repair_2026-05-08.json, ../configs/3d_ethics_stability_qwen_3b_scaffold_family_tournament_v2_10g_non_wvs_update_repair_seed8597_dev.yaml, ../configs/3d_ethics_stability_qwen_3b_scaffold_family_tournament_v2_10h_changed_only_no_drift_seed8597_dev.yaml, ../outputs/3d_ethics_stability_qwen_3b_scaffold_family_tournament_v2_10h_changed_only_no_drift_seed8597_dev/stability_prompt_rewrite_runs/seed_8597/data/access_log.json |
Development-only no-launch evidence | v2.10g recovered sensitivity only with high fragility; v2.10h preserved low fragility only with sensitivity stuck at 3/6. Both runs kept final_test locked and produced no new held-out win. |
Move to a structural row-local support-state execution artifact before any new prospective held-out launch |
| v2.10i--v2.10l show that row-local operation tags are a strong VAE-style artifact route with a fresh-dev transfer boundary | 3d_ethics_v2_10i_to_v2_10l_operation_artifact_2026-05-08.md, 3d_ethics_v2_10i_to_v2_10l_operation_artifact_2026-05-08.json, ../configs/3d_ethics_stability_qwen_3b_scaffold_family_tournament_v2_10i_operation_tag_seed8601_dev.yaml, ../configs/3d_ethics_stability_qwen_3b_scaffold_family_tournament_v2_10l_operation_artifact_salience_lift_fresh_seed8629_dev.yaml, ../outputs/3d_ethics_stability_qwen_3b_scaffold_family_tournament_v2_10l_operation_artifact_salience_lift_fresh_seed8629_dev/stability_prompt_rewrite_runs/seed_8629/data/access_log.json, 3d_ethics_operation_route_ablation_config_audit_2026-05-09.md |
Development-only mechanism evidence plus fresh-dev transfer boundary; route-ablation config pair is ready | v2.10k passed all same-seed dev gates, but v2.10l failed salience and fragility versus the stronger operation-artifact baseline despite keeping sensitivity 5/6, valid format 1.0, WVS sensitivity 1.0, and high alignment. All v2.10i-l access logs show zero final-test events. Seed 8707 is the matched route-ablation config pair. |
Use the paired dev-only route ablation to compare operation-tag off vs on with prompts and splits held fixed |
| The current implementation now guards the operation-artifact route against inconsistent configs and misleading selector summaries | 3d_ethics_implementation_design_audit_2026-05-09.md, 3d_ethics_operation_route_ablation_config_audit_2026-05-09.md, ../docs/research_logs/3d_ethics_operation_artifact_route_ablation_protocol_2026-05-09.md, ../src/ethics_prompt_rewrite/config.py, ../src/ethics_prompt_rewrite/stability_experiment.py |
Implementation and protocol hardening | These audits do not add a new model result and cannot support a held-out win by themselves. | Use the paired dev-only route ablation as the release control lane |
| v2.4a fresh-dev semantic gating strengthened the named-criterion basin before a prospective check | 3d_ethics_qwen3b_scaffold_family_tournament_v2_4a_semantic_gate_seed4801_dev_2026-05-07.md, 3d_ethics_qwen3b_scaffold_family_tournament_v2_4a_semantic_gate_seed4801_dev_2026-05-07.json, ../outputs/3d_ethics_stability_qwen_3b_scaffold_family_tournament_v2_4a_semantic_gate_seed4801_dev/stability_prompt_rewrite_runs/seed_4801/data/access_log.json |
Development-only support | final_test remained locked, so it cannot support a held-out claim by itself |
Use as the rationale for the seed-4909 prospective check |
| v2.4b seed-4909 is negative prospective evidence: the selected named-criterion scaffold failed held-out clean-win criteria due to selector-gap fragility collapse | 3d_ethics_qwen3b_scaffold_family_tournament_v2_4b_semantic_gate_seed4909_2026-05-07.md, 3d_ethics_v2_4b_seed4909_fragility_autopsy_2026-05-07.md, ../outputs/3d_ethics_stability_qwen_3b_scaffold_family_tournament_v2_4b_semantic_gate_seed4909/stability_prompt_rewrite_runs/seed_4909/data/access_log.json |
Claim-bearing negative held-out evidence | It weakens broad all-seed claims; it does not erase the seed-2801/4523 wins | Treat as selector-gap/fragility frontier evidence and avoid further held-out launches without a selector repair |
| v2.4c/v2.4d show that prompt wording alone has not solved the joint salience/fragility/WVS gate problem | 3d_ethics_qwen3b_scaffold_family_tournament_v2_4c_fragility_hardening_seed5003_dev_2026-05-07.md, 3d_ethics_qwen3b_scaffold_family_tournament_v2_4d_minimal_support_patch_seed5101_dev_2026-05-07.md, ../docs/research_logs/3d_ethics_v2_4a_to_v2_4d_semantic_gate_and_fragility_cycle_2026-05-07.md |
Development-only no-launch evidence | No held-out access; no positive claim. Seed 5101 shows a strong named-criterion near-pass but still missed the pre-specified salience gate |
Run a dev-only selector/gate audit before any new prospective seed |
- If you are writing the paper, treat the ETHICS checkpoint/scaffold-freezing
rows plus the 3D
2801/4523rows as the strongest release-surface evidence. - If you are touching the 3D program, read rows 16 onward together; the 3D line
now has two access-log verified held-out wins (
2801,4523) and a clearly labeled boundary against broad all-seed confirmation. - If a result depends on a split whose
final_testwas later used to redesign the method, treat that result as diagnostic only.