-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathrefined_prompt_shape_epiplexity_paper.tex
More file actions
1254 lines (1079 loc) · 70.5 KB
/
Copy pathrefined_prompt_shape_epiplexity_paper.tex
File metadata and controls
1254 lines (1079 loc) · 70.5 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[11pt]{article}
\usepackage[margin=1in]{geometry}
\usepackage{amsmath,amssymb,amsthm}
\usepackage{array}
\usepackage{booktabs}
\usepackage{graphicx}
\usepackage{hyperref}
\usepackage[numbers]{natbib}
\usepackage{xcolor}
\usepackage{tabularx}
\usepackage{longtable}
\usepackage{enumitem}
\usepackage{float}
\hypersetup{
hypertexnames=false,
colorlinks=true,
linkcolor=blue!50!black,
citecolor=blue!50!black,
urlcolor=blue!50!black
}
\newtheorem{definition}{Definition}
\newtheorem{claim}{Claim}
\newcommand{\cA}{\mathcal{A}}
\newcommand{\VAE}{\mathrm{VAE}}
\newcommand{\PSE}{\mathrm{PSE}}
\setlength{\emergencystretch}{3em}
\title{Stupidity as Moral Failure: Value-Aligned Epiplexity, Phronesis, and Auditable AI Governance}
\author{Anonymous Author(s)}
\date{May 2026}
\begin{document}
\maketitle
\begin{abstract}
Contemporary LLMs can display impressive reasoning while failing to notice
morally decisive features of a case. We call this failure stupidity: not a
deficit of raw capability, but a breakdown in moral attention. The paper argues
that because computational attention is scarce, decisions about what a model is
trained, prompted, retrieved, or evaluated to attend to are already normative
decisions. Such failures therefore belong to AI ethics, not merely model
performance.
To make this claim operational, we develop Value-Aligned Epiplexity (VAE),
extending the notion of epiplexity: the structured information a
computationally bounded observer can actually learn. VAE asks whether an
alignment pipeline makes value-relevant structure both learnable by the model
and inspectable by humans. Philosophically, this gives computational form to
phronesis, or practical wisdom: the capacity to perceive what matters in a
particular situation. Technically and institutionally, it reframes alignment as
the design of pathways that preserve morally salient structure under bounded
attention.
The paper makes three contributions. First, it clarifies a class of ethical
failures that are neither malicious nor simply mistaken, but arise from trained
inattentiveness. Second, it offers VAE as a bridge between normative theory,
interpretability, and evaluation practice. Third, it develops a governance
implication: responsibility should turn on what the socio-technical pipeline
could reasonably have made the system notice through design, testing,
retrieval, audit, or oversight. Robust alignment, on this account, is the
capacity to keep moral attention stable when the situation rewards its loss.
\end{abstract}
\section{Introduction}
Large language models can reason fluently and still fail morally because the
decisive feature never enters operative attention. A model may parse the words,
produce plausible reasons, and still miss the morally salient particular: the
deception, the coercion, the intact support basis, or the unsupported frame it
has imported into the case. This paper uses \emph{stupidity} in that technical
sense. It is not an insult, not low intelligence, not mere factual error, and
not malice. It is a failure-to-notice under bounded computational conditions.
Arendt's account of thoughtless compliance motivates the danger
\citep{arendt1963eichmann}; the engineering and governance question is what an
AI pipeline could reasonably have made the system notice.
Computational attention is scarce. Choices about training data, preference
modeling, system prompts, retrieval, memory, evaluation gates, and oversight
shape which value-relevant features remain live at the moment of judgment.
Those choices allocate moral attention and have normative significance. The
central question is
not only whether a model is capable in the abstract, but whether a specified
alignment route makes the morally relevant structure usable when it matters.
Value-Aligned Epiplexity (VAE) formalizes this question as route-specific
cost-plus-residual accounting. An intervention route, such as prompting,
retrieval, data curation, or weight updating, produces a value-relevant
artifact at some cost. A bounded model conditions on that artifact and leaves
some residual failure. VAE asks what a route can make the model notice and use,
what burden that route imposes, and what residual moral-attention risk remains.
This makes the philosophical question of attention experimentally and
institutionally tractable.
Phronesis supplies the philosophical target. Practical wisdom is not reducible
to rule-following or benchmark scoring, but one component of phronesis is
perceptual: seeing which particulars matter in a concrete case
\citep{aristotle_ethics}. Prompt scaffolds are a minimal auditable route for
testing that perceptual component. A support-state scaffold asks a frozen
student to preserve a supported value when its basis remains intact, update
when support-relevant facts change, reject unsupported imported frames, and
produce parseable outputs for audit. It does not give the model virtue; it
tests whether one narrow operation of moral attention can be externalized into
an inspectable artifact.
The empirical evidence follows that chain. ETHICS supplies supporting route
evidence: prompt-shape discovery moves static moral classification, exposes a
selector/final gap, and supports scaffold freezing. The main proof of
possibility is the 3D moral-stability track, where two clean access-log
verified held-out wins show that specific operations of moral attention can be
made executable by a frozen weak student through prompting alone. Seed 2801
uses the 2801 support-state scaffold to make preservation/update executable;
seed 4523 uses the 4523 no-import scaffold to make criterion no-import
discipline executable. Mixed, failed, blocked, and dev-only rows are retained
as search-cost, protocol-discipline, and residual-frontier evidence.
We use \emph{3D moral stability} to name three perturbation demands:
preservation under support-intact perturbation, updating under
support-changing perturbation, and resistance to unsupported or irrelevant
moral-frame shifts. WVS refers to the World Values Survey slice recorded in
the evaluation instrument manifest; in the 3D benchmark it functions as a hard
social-trust and core-values stress slice. It matters because trust-related
judgments are vulnerable to two opposite errors: superficial preservation when
support has changed, and overreaction when trust support remains intact.
\paragraph{Contributions.}
\begin{enumerate}[leftmargin=1.4em]
\item We define Value-Aligned Epiplexity as route-specific
cost-plus-residual accounting for candidate value-relevant intervention
artifacts.
\item We define Prompt-Shape Epiplexity as the prompt-only, auditable VAE
route for frozen weak students.
\item We present teacher-guided scaffold-family search as a disciplined
protocol with frozen students, bounded prompt slots, locked splits,
schema-validation gates, and access logs.
\item We formulate 3D moral stability as a perturbation test for preserving
supported values, updating under changed support, and controlling fragility.
\item We report clean held-out mechanism evidence: two 3D scaffold wins
against the incumbent baseline, supported by ETHICS route evidence for
prompt-shape discovery, selector-gap diagnosis, and scaffold freezing.
\item We analyze selected prompt-scaffold artifacts as bounded
crystallizations of searched-for moral-attention operations: threshold
calibration, support preservation, no-import discipline, and parseable
judgment.
\item We connect auditable prompt-shapes to governance: institutions can show
what they tried to make a system notice, what it cost, how it
was tested, and what residual failures remained.
\end{enumerate}
\section{Related Work}
\paragraph{Epiplexity, MDL, and residual accounting.}
Epiplexity distinguishes structure that a bounded observer can extract from
residual uncertainty that remains outside its effective budget
\citep{finzi2026epiplexity}. VAE adapts that idea to alignment interventions:
the artifact has a route-specific production and audit cost, and it leaves
residual behavioral loss. This is related to minimum-description-length
thinking, where model description and residual fit are counted together
\citep{rissanen2004mdl,grunwald2007mdl}.
\paragraph{Automatic prompt optimization.}
Automatic prompt engineering, gradient-style prompt optimization, LLM-as-
optimizer methods, prompt evolution, and iterative self-refinement all show
that prompt text can be searched, selected, and audited rather than written by
hand
\citep{zhou2022ape,pryzant2023apo,yang2023opro,fernando2023promptbreeder,madaan2023selfrefine}.
This paper uses that broad insight, but changes the unit of analysis. The
object searched here is not generic task wording; it is auditable moral-
attention scaffold geometry under frozen-student, locked-split, and
access-log discipline.
\paragraph{Moral and value benchmarks.}
ETHICS provides proxy labels for commonsense moral classification
\citep{hendrycks2020ethics}. The ETHICS track in this paper uses that
benchmark as a static classification probe, while the 3D track adds a
perturbation-based stability test: preserving when support remains intact,
updating when support changes, and resisting unsupported moral-frame shifts.
Thus the empirical target is benchmark-relevant moral attention with explicit
proxy-label status.
\paragraph{Robustness, invariance, and counterfactual sensitivity.}
The 3D benchmark is adjacent to behavioral testing, contrast sets, and
counterfactual robustness methods: a system is invariant to irrelevant
changes and sensitive to changes that alter the target property
\citep{ribeiro2020checklist,gardner2020contrast,kaushik2020counterfactual}.
The difference is the target relation. Here invariance and sensitivity are
moral-attention properties rather than generic accuracy properties. A
scaffold preserves the same supported value under irrelevant
perturbation, but revise judgment when the support basis is weakened, removed,
or contradicted.
\paragraph{Teacher-student transfer and governance.}
The teacher-student framing is related to distillation and capacity transfer
\citep{hinton2015distilling,romero2015fitnets,touvron2021deit}, while preserving
frozen student weights. The teacher proposes candidate artifacts that the
frozen student must execute through prompting. The governance framing draws on
Arendt's analysis of thoughtless failure and Hand's burden-of-precaution logic
as analogies for reasonable, auditable precautions
\citep{arendt1963eichmann,carrolltowing1947}. The artifact discipline also
connects to auditability work such as datasheets, model cards, and internal
audit frameworks \citep{gebru2021datasheets,mitchell2019modelcards,raji2020closing}:
the paper's distinctive contribution is to make the prompt-shape itself a
governance object with route cost, residual behavior, and lineage.
\section{Value-Aligned Epiplexity}
\begin{definition}[Value-Aligned Epiplexity]
Let $r$ be an intervention route and let $A \in \cA_r$ be a permitted search or
production procedure that outputs a candidate value-relevant artifact $a_A$.
Let $C_r(A,a_A)$ denote the route-specific cost of producing, selecting,
auditing, and deploying that artifact. Let $\ell_D(M \mid a_A)$ denote the
residual behavioral loss of bounded model $M$ on distribution or split $D$
after conditioning on the artifact. Then:
\begin{equation}
\VAE_r(D;M)
=
\inf_{A \in \cA_r}
\left[
C_r(A,a_A) + \ell_D(M \mid a_A)
\right].
\label{eq:vae}
\end{equation}
\end{definition}
Equation~\ref{eq:vae} is a route-specific accounting objective. This paper
reports a qualitative prompt-only VAE ledger rather than a single scalar
estimate: cost proxies, printable artifact content, and residual behavior are
placed side by side. Cost is logged through teacher/search/audit proxies;
residual behavior is measured by ETHICS accuracy and 3D stability metrics.
\begin{table}[H]
\centering
\small
\begin{tabularx}{\linewidth}{>{\raggedright\arraybackslash}p{0.22\linewidth}>{\raggedright\arraybackslash}X>{\raggedright\arraybackslash}X}
\toprule
VAE component & Prompt-only realization & Completed artifact status \\
\midrule
Route $r$ & Teacher-guided prompt scaffold search for a frozen student. & Fixed by protocol. \\
Cost $C_r$ & Teacher/search burden, family count, selector evaluations, validation, prompt length, audit packet, access logs. & Logged qualitatively through saved configs, reports, and evidence records; not scalarized into one comparable number. \\
Artifact $a_A$ & Support-state, no-import, support-basis, and operation-tag scaffold artifacts. & Selected prompt families and report identifiers are recorded in the evidence package. \\
Residual $\ell_D$ & Remaining moral instability under held-out perturbation metrics. & Reported as salience, sensitivity, valid format, fragility, alignment, WVS salience, and WVS sensitivity. \\
Audit evidence & Split locks, access logs, prompt hashes, report paths, and metric summaries. & Held-out claims cite saved reports, JSON artifacts, and access-log status; dev-only rows are kept out of primary claims. \\
\bottomrule
\end{tabularx}
\caption{Prompt-only VAE ledger. The table connects the governance question
``what could the pipeline reasonably make the model notice?'' to the technical
objects reported here: route cost proxies, printable scaffold artifacts,
residual behavior, and audit evidence.}
\label{tab:vae-ledger}
\end{table}
\paragraph{Scope of VAE.}
VAE denotes Value-Aligned Epiplexity, a route-accounting framework distinct from
variational autoencoding. It asks how a specified route exposes value-relevant
structure to a bounded model, what the route costs, and what residual failure
remains.
\subsection{Why Support-State Scaffolds Are VAE Artifacts}
A support-state scaffold is a candidate VAE artifact because it changes what
the frozen student can use without changing weights. It specifies four
operations:
\begin{enumerate}[leftmargin=1.4em]
\item what must be preserved when the support basis remains intact;
\item what must update when support is weakened, removed, or contradicted;
\item what moral frames or duties must not be imported without support; and
\item what output structure must remain parseable for audit.
\end{enumerate}
The metric map is direct. Salience measures whether value-relevant structure
remains accessible. Sensitivity measures whether changed support triggers an
update. Fragility measures residual instability under irrelevant perturbation.
Valid format measures auditability and parseability. Alignment measures
benchmark-target residual agreement. WVS salience and WVS sensitivity stress a
World Values Survey social-trust/core-values slice where superficial
preservation and support-relevant updating are especially easy to confuse.
\begin{claim}[Inferential reading of a held-out scaffold win]
If a frozen student improves on held-out perturbation metrics under a scaffold
selected without final-test access, the strongest explanation is not weight
learning or answer leakage. It is that the scaffold made value-relevant
structure more usable to the bounded student.
\end{claim}
This is a bounded mechanism inference. The clean wins, WVS sensitivity gains,
and prompt lineage identify support-state and no-import reasoning rather than
local wording polish.
\section{Moral Attention, Phronesis, and 3D Stability}
Moral failure can be a failure of salience. The relevant fact can be present in
the input while remaining unavailable to the model's operative judgment, or it
can be displaced by pressure, paraphrase, or an imported moral frame. Prompting
allocates attention by specifying which features remain live at judgment time.
Phronesis is the capacity to perceive what matters in a particular case. A
prompt scaffold does not supply Aristotelian virtue; it supplies a bounded
computational analogue of one perceptual component: tracking which facts
preserve a basis of judgment and which deltas change it.
\begin{table}[H]
\centering
\small
\begin{tabularx}{\linewidth}{>{\raggedright\arraybackslash}p{0.28\linewidth}>{\raggedright\arraybackslash}X>{\raggedright\arraybackslash}p{0.23\linewidth}}
\toprule
3D condition & Required behavior & Metric view \\
\midrule
Support intact & Preserve the same supported value. & Salience / fragility \\
Support changed & Update the judgment or score. & Sensitivity / WVS sensitivity \\
Irrelevant pressure or wording change & Resist overreaction. & Fragility \\
Unsupported imported frame & Reject the imported duty, motive, or generic criterion. & Alignment / WVS salience \\
Format requirement & Keep parseable output. & Valid format \\
\bottomrule
\end{tabularx}
\caption{\textbf{Conceptual table.} 3D moral stability operationalizes moral
attention as conditional invariance: preserve the right basis, update when that
basis changes, resist irrelevant pressure, reject unsupported imports, and
remain parseable for audit. Table~\ref{tab:metric-map} maps these duties to
metrics; Appendix~\ref{app:3d-metric-definitions} gives operational details.}
\label{tab:3d-stability-logic}
\end{table}
Table~\ref{tab:3d-stability-logic} is the bridge from philosophy to evaluation.
Classification asks for one judgment on one input; 3D stability asks whether
the model can keep the morally relevant basis fixed under irrelevant
perturbation and revise it under support-relevant factual deltas.
\subsection{Moral Attention Across the Pipeline}
Value alignment is partly a scarce-attention allocation problem. Training,
post-training, preference modeling, RLHF, data curation, and fine-tuning shape
which features a model tends to treat as salient before any particular prompt
is seen: harm, deception, fairness, trust, justification, conflict, pressure,
reasonable response, and related moral cues.
Inference-time systems allocate attention differently. System prompts,
retrieved context, memory compaction, agent harnesses, output schemas, and
evaluation gates determine which morally relevant features remain live at the
moment of judgment. Prompt-shape scaffolds are a minimal, inspectable
intervention into inference-time moral attention: they state what the frozen
model must preserve, revise, or reject.
Long context does not eliminate the problem. A model can still compress,
ignore, misprioritize, or fail to operationalize value-relevant information.
The alignment question is whether the relevant structure remains usable at
judgment time.
\section{Teacher-Guided Scaffold-Family Search}
The protocol turns prompt-shape discovery into an auditable precaution rather
than an informal prompt-tuning exercise. The student is frozen; the mutable
object is a bounded prompt slot. The teacher audits development failures,
proposes scaffold families, and refines within those families. Representatives
are frozen before selector-dev scoring. Final-test access is locked and
unlocked only under a prespecified gate. Table~\ref{tab:protocol} gives the
audit sequence.
The separation shown in Figure~\ref{fig:teacher-scaffold-family-protocol}
matters because the legal and ethical force of a held-out win depends on keeping
search, selection, final-test access, and audit evidence distinct.
\begin{figure}[H]
\centering
\includegraphics[width=\linewidth]{figures/teacher_scaffold_family_protocol.png}
\caption{Teacher-guided scaffold-family search protocol. Development search,
selection, final-test unlock, and audit evidence are separated so that a final
win can be interpreted as a locked evaluation of a frozen scaffold artifact, not
an informal prompt edit.}
\label{fig:teacher-scaffold-family-protocol}
\end{figure}
\begin{table}[H]
\centering
\small
\begin{tabularx}{\linewidth}{>{\raggedright\arraybackslash}p{0.18\linewidth}>{\raggedright\arraybackslash}X>{\raggedright\arraybackslash}X}
\toprule
Stage & Purpose & Failure mode controlled \\
\midrule
Materialize splits & Fix teacher-dev, selector-dev, and final-test before search. & Leakage and retrospective selection. \\
Evaluate baseline & Establish frozen-student residual behavior. & Unclear comparison target. \\
Audit teacher-dev failures & Identify support-state, fragility, no-import, and schema failures. & Cosmetic prompt edits. \\
Propose scaffold families & Search over scaffold-family structure, not only wording. & Local prompt polish. \\
Freeze representatives & Stop adaptation before selector-dev scoring. & Overfitting to selector-dev. \\
Run selector-dev tournament & Compare frozen representatives on the selector split. & Informal winner selection. \\
Apply gate check & Require format, stability, and improvement criteria before final. & Premature held-out launch. \\
Unlock locked final-test once & Score only when gate discipline justifies it. & Final-test leakage. \\
Report access logs and lineage & Preserve prompts, configs, reports, metrics, final-test access logs, and artifact lineage. & Unsupported claims. \\
\bottomrule
\end{tabularx}
\caption{Teacher-guided scaffold-family search protocol. Each stage controls a
specific failure mode in the route from teacher failure audit to frozen
artifact, locked evaluation, and evidence suitable for audit.}
\label{tab:protocol}
\end{table}
\subsection{Search Ledger and Anti-Cherry-Picking Discipline}
The scientific object is not a single winning sentence. It is the scaffold
family search route plus its artifact boundary: failed candidates, diagnostic
rows, blocked rows, clean wins, split access, and residual failure modes.
Failure rows are part of the evidence because they measure search cost. Mixed
rows document the residual frontier; blocked-before-final rows document
protocol discipline; dev-only rows document hypothesis generation; clean
held-out wins document claim-bearing scaffold executability. This audit trail
is the anti-cherry-picking discipline.
ETHICS records a compact selector-gap route in which frozen prompt arms are
compared against continued adaptation. 3D records support-state,
support-basis, no-import, and WVS scaffold families under selector gates and
locked held-out access. Candidate counts, prompts, reports, and access logs are
kept in the artifact manifest.
The protocol is designed to make three objects jointly inspectable: the search
route, the prompt artifact, and the residual behavior. A clean held-out win
shows that a specified scaffold family made a specified value-relevant
operation executable by a frozen bounded model under locked evaluation.
\section{Prompt Scaffolds as Auditable Moral-Attention Artifacts}
Selected prompt scaffolds are the paper's inspected artifacts. They crystallize
searched-for operations of moral attention by compressing practical
distinctions into executable text. Their evidential force comes from artifact
content, search provenance, and residual behavior under locked evaluation.
\paragraph{Scaffold names.}
The manuscript uses short scaffold names: ETHICS threshold scaffold, 2801
support-state scaffold, and 4523 no-import scaffold. Boundary rows use seed,
operation, and evidence role. Exact artifact identifiers remain in the
artifact manifest.
\begin{table}[H]
\centering
\small
\begin{tabularx}{\linewidth}{>{\raggedright\arraybackslash}p{0.20\linewidth}>{\raggedright\arraybackslash}X>{\raggedright\arraybackslash}p{0.22\linewidth}>{\raggedright\arraybackslash}p{0.24\linewidth}}
\toprule
Artifact & Short prompt cue & Moral operation & Main empirical role \\
\midrule
ETHICS threshold scaffold & ``Consider the full context provided. Is the action generally considered acceptable by a typical person?'' & Threshold calibration for ordinary commonsense acceptability. & Mechanism probe: the ETHICS threshold scaffold wins held-out checkpoint accuracy despite losing selector-dev, exposing the selector gap. \\
2801 support-state scaffold & ``preserve the same supported value \ldots Update your judgment only if a changed fact weakens, removes, or contradicts that support.'' & Preserve the same support basis; update only when support is weakened, removed, or contradicted. & Main 3D clean win: makes support-state preservation/update executable against the incumbent baseline. \\
4523 no-import scaffold & ``Do not replace the item's named moral criterion with a generic frame, and do not import new duties or motives.'' & Lock the stated criterion; reject unsupported imported duties, motives, or generic frames. & Main 3D clean win: makes no-import criterion discipline executable against the incumbent baseline. \\
\bottomrule
\end{tabularx}
\caption{Core prompt scaffolds as auditable moral-attention artifacts. The table
shows what each prompt makes the student notice, which moral operation it
externalizes, and how that artifact supports the paper's bounded claim. Full
prompt text and boundary artifacts are in Appendix~\ref{app:full-prompts}.}
\label{tab:core-prompt-artifacts}
\end{table}
\subsection{Threshold Calibration in Static Classification}
In binary value classification, effective scaffolds need not add more moral
principles. They can improve behavior by making the operative threshold of
judgment explicit. The ETHICS scaffold anchors the student in ordinary
commonsense morality, full-context assessment, and typical acceptability. That
structure controls two errors at once: over-moralizing ordinary conflict or
reasonable response, and under-detecting clear moral violation. The more
explicit malice/deception/unjustified-harm threshold family is preserved in
Appendix~\ref{app:full-prompts} as artifacted comparator language rather than
as the checkpoint winner.
\subsection{The Shared Structure of the Two Clean 3D Prompts}
The two clean 3D prompts share a deeper architecture. They do not add a new
moral doctrine or ask the student to become more cautious. They give the
student an attention procedure: identify the stated support basis, distinguish
intact support from changed support, reject unsupported frame import, and keep
the answer auditable.
The 2801 support-state scaffold emphasizes perturbational discipline: preserve
the supported value when morally relevant facts and context are unchanged, and
revise only when a changed fact weakens, removes, or contradicts the support.
The 4523 no-import scaffold emphasizes frame discipline: judge from the stated
facts, preserve the named criterion, and refuse new duties, motives, or generic
frames not supported by the item. Together they define moral stability as
fidelity to the right basis of judgment, not mere answer repetition.
This is the bounded phronetic content of the clean prompts. Phronesis is not
rule-following and the prompts do not confer virtue. But practical wisdom has a
perceptual component: seeing which particulars matter. The clean scaffolds
operationalize that component by making the support basis, update condition,
no-import rule, and parseable form explicit enough for a frozen weak student to
execute.
In VAE terms, the prompt is the visible route artifact. Teacher-guided search
pays the discovery cost; the scaffold encodes the discovered moral-attention
structure; held-out 3D metrics measure the residual instability after the
student conditions on it. The clean wins therefore show that searched-for
moral-attention structure can be externalized into text and executed more
reliably by a bounded frozen model. Appendix~\ref{app:full-prompts} gives exact
prompt texts; source paths remain in the artifact manifest.
\section{Experimental Program}
The experimental sections follow the paper's conceptual chain. VAE is the
accounting framework; the prompt scaffold is the visible alignment artifact;
ETHICS tests prompt-shape search in static classification; 3D tests moral
stability under perturbation; phronesis explains why moral attention depends
on perceiving the salient; governance asks what an institution can audit.
\subsection{Baselines and Incumbents}
The main comparisons use named prompt arms, not anonymous runs. The incumbent
baseline is the frozen commonsense prompt in the artifact set:
``Evaluate the action
based on ordinary commonsense morality. Does the action cause harm, or is it
unfair, deceptive, or disrespectful? Judge only the information given.'' In
the 3D configs it is both the \texttt{prompt\_rewrite\_start\_arm} and the
\texttt{selector\_alignment\_reference\_arm}. Thus it is the incumbent prompt
that a new scaffold must beat, not a post-hoc baseline chosen after seeing
final-test outcomes.
\begin{center}
\begingroup
\scriptsize
\setlength{\tabcolsep}{2pt}
\renewcommand{\arraystretch}{0.94}
\begin{tabularx}{\linewidth}{>{\raggedright\arraybackslash}p{0.20\linewidth}>{\raggedright\arraybackslash}p{0.13\linewidth}>{\raggedright\arraybackslash}X>{\raggedright\arraybackslash}X>{\raggedright\arraybackslash}p{0.18\linewidth}}
\toprule
Baseline / arm & Track & What it is & Why it is included & Where it appears \\
\midrule
Empty prompt & ETHICS full-adaptation track & No mutable moral instruction beyond the task wrapper. & Lower-bound prompt-only comparator. & Appendix directional background. \\
Researcher fixed & ETHICS and 3D & Human-written commonsense morality prompt in configs. & Tests whether teacher-discovered artifacts beat a reasonable fixed instruction. & ETHICS checkpoint; 3D auxiliary arms. \\
Incumbent baseline & ETHICS and 3D & Frozen incumbent prompt and 3D start/reference arm. & Main incumbent baseline for selected scaffold comparisons. & ETHICS checkpoint; all claim-bearing 3D scorecards. \\
ETHICS threshold scaffold & ETHICS & Frozen prompt produced by teacher refinement and evaluated without further adaptation. & Tests whether a teacher-discovered threshold scaffold can beat the incumbent under held-out evaluation. & ETHICS checkpoint. \\
Continued teacher adaptation & ETHICS & Multi-round teacher adaptation track. & Mechanism control: continued adaptation can trail a frozen representative. & ETHICS checkpoint final evaluation. \\
Best frozen scaffold & ETHICS & The best frozen representative chosen from scaffold-family candidates for a seed. & Tests whether freezing a discovered scaffold family can outperform continued local adaptation. & ETHICS 10-seed tournament and post-selection audit. \\
Selected 3D scaffold & 3D & Teacher-family prompt selected on selector-dev, then frozen for final-test. & Claim-bearing treatment arm. & Seeds 2801 and 4523 clean wins; mixed boundary rows. \\
Capacity-audit arms & ETHICS audit & The same fixed 0.5B-discovered frozen/continued prompt pairs evaluated on 0.5B, 1.5B, and 3B students. & Tests unchanged cross-capacity prompt reuse. & Appendix capacity audit. \\
\bottomrule
\end{tabularx}
\refstepcounter{table}\label{tab:baselines}
\smallskip
\noindent\textbf{Table~\thetable: Baselines and incumbents.} These definitions
fix the comparison arms before results are reported, preventing the baseline
choice from becoming a post-hoc interpretive move.
\endgroup
\end{center}
\subsection{Main Evidence at a Glance}
Table~\ref{tab:main-evidence-glance} sets the paper's evidentiary discipline.
It separates the two claim-bearing 3D wins from supporting route evidence,
audits, boundary rows, dev-only rows, and blocked-before-final rows, so the
philosophical and governance claims rest on the right empirical layer.
\begin{center}
\begingroup
\small
\setlength{\tabcolsep}{3pt}
\renewcommand{\arraystretch}{1.03}
\begin{tabularx}{\linewidth}{>{\raggedright\arraybackslash}p{0.22\linewidth}>{\raggedright\arraybackslash}p{0.25\linewidth}>{\raggedright\arraybackslash}X>{\raggedright\arraybackslash}p{0.22\linewidth}}
\toprule
Evidence class & Included artifacts & Permitted inference & Main-text role \\
\midrule
Claim-bearing 3D wins & Seeds 2801 and 4523. & Specific support-state and no-import operations can be made executable by prompt scaffolds under locked held-out perturbation evaluation. & Primary proof-of-possibility evidence. \\
Supporting ETHICS route evidence & Checkpoint selector-gap result, 10-seed scaffold tournament, plus appendix audit/capacity context. & Prompt-shape discovery moves static classification, exposes selector/final gaps, and supports scaffold freezing as a prompt-only VAE route. & Supporting route evidence. \\
Boundary / mixed rows & 3D seeds 3001, 3109, 4627, 4703, 4909, 8563 and ETHICS confirmation-boundary rows. & Partial scaffold success maps the residual frontier: WVS sensitivity, selector transfer, and salience--fragility control. & Diagnostic frontier evidence. \\
Audit context & ETHICS fixed-artifact checks and 3D seed 2407. & Selected artifacts retain useful robustness context after selection, without becoming independent discovery claims. & Audit context. \\
Dev-only rows & 3203, 3503, and later repair/frontier cycles. & These rows generate scaffold-family hypotheses under locked final-test discipline. & Search-frontier evidence only. \\
Blocked-before-final rows & Seed 2903 and no-launch rows. & Final-test access is withheld when gates fail, so no held-out metrics are claimed. & Protocol-discipline evidence. \\
\bottomrule
\end{tabularx}
\refstepcounter{table}\label{tab:main-evidence-glance}
\smallskip
\noindent\textbf{Table~\thetable: Main evidence at a glance.} This dashboard
states which inferences each evidence class can support; detailed registries
remain in the evidence package.
\endgroup
\end{center}
\subsection{Experiment Track I: ETHICS Supporting Route Evidence}
ETHICS supplies route evidence: prompt-shape discovery moves static
classification, selector-dev can mis-rank final-test quality, and frozen
scaffold representatives can preserve useful prompt-shape families. The main
claim-bearing evidence remains the two 3D clean wins. The appendix keeps the
ETHICS tables needed for that supporting role.
Two ETHICS facts matter in the main flow. First, selector-dev prefers the
incumbent baseline, but the ETHICS threshold scaffold wins locked final-test.
Second, the 10-seed tournament supports scaffold freezing over continued local
adaptation. Post-selection audit and capacity sensitivity remain appendix
context.
\subsection{Experiment Track II: 3D Claim-Bearing Moral-Stability Benchmark}
3D moral stability is the perturbation track and the paper's main
locked perturbation test. The frozen student receives either
the incumbent baseline or a selected scaffold and must answer original and
perturbed cases. The benchmark checks preservation, update, resistance to
irrelevant perturbation, rejection of unsupported frames, and parseability.
For seeds 2801 and 4523, the split summaries record 15 item families and 135
rows total. Each final-test split contains 3 held-out item families and 27 rows
with one held-out family each from MFQ, WVS, and dilemmas; selector-dev and
teacher-dev each contain 6 families and 54 rows. The final-test split contains
3 fact-changing sensitivity-control rows. Aggregate 3D metrics are primarily
family-level means over item families; the saved \texttt{sensitivity\_control}
summary also reports a row-weighted view for fact-changing rows.
\subsubsection{A Running Example of 3D Moral Stability}
Consider a toy support-basis example. The original
case says that a person keeps a promise to help a friend in a way that supports
trust and wellbeing. A support-intact perturbation changes tone, wording, or
social pressure but leaves the promise and help intact. A support-changing
perturbation removes the promised help or reveals deception. An imported-frame
perturbation tempts the model to introduce an unsupported duty or motive that
the facts do not state. Correct behavior is conditional stability: preserve
when support remains, update when support is weakened or contradicted, reject
the imported frame, and keep the answer parseable.
The completed 3D results show that support-state and no-import scaffolds can
reduce residual moral instability without changing student weights. Seeds 2801
and 4523 are clean held-out wins: each beats
the incumbent baseline on salience, sensitivity, fragility, alignment, WVS
salience, and WVS sensitivity, while tying valid format at 1.000. The remaining
held-out rows identify selector transfer, WVS sensitivity, and
salience--fragility control as the residual frontier.
\begin{table}[H]
\centering
\small
\begin{tabularx}{\linewidth}{>{\raggedright\arraybackslash}p{0.18\linewidth}>{\raggedright\arraybackslash}X>{\raggedright\arraybackslash}X>{\raggedright\arraybackslash}X}
\toprule
Metric & Direction & VAE interpretation & Experimental use \\
\midrule
Salience & Higher better & Relevant structure remains accessible. & Main preservation signal. \\
Sensitivity & Higher better & Changed support triggers an update. & Main update signal. \\
Valid format & Higher better & Output remains parseable for audit. & Schema-discipline gate. \\
Fragility & Lower better & Residual instability under irrelevant perturbation. & No-overreaction signal. \\
Alignment & Higher better & Benchmark-target residual agreement. & Residual task agreement. \\
WVS salience & Higher better & Hard social-trust preservation slice. & Stress-test preservation signal. \\
WVS sensitivity & Higher better & Hard social-trust update slice. & Stress-test update signal. \\
\bottomrule
\end{tabularx}
\caption{3D metrics as residual VAE views. Each metric records a remaining way
the student can fail to notice, preserve, update, resist, or report the relevant
moral structure after conditioning on a scaffold. Fragility is lower-is-better;
Appendix~\ref{app:3d-metric-definitions} gives operational definitions.}
\label{tab:metric-map}
\end{table}
\section{Results: Prompt-Shape Scaffolds Can Make Moral-Attention Operations Executable}
The Results section is organized by evidentiary role rather than chronology.
It begins with the two locked 3D clean wins because they carry the paper's
bounded proof of possibility: specific moral-attention operations can be made
executable by a frozen weak student through auditable prompt scaffolds. ETHICS
then supplies route evidence; failed, mixed, blocked, and dev-only rows document
search cost, protocol discipline, and residual frontier.
\begin{table}[H]
\centering
\scriptsize
\setlength{\tabcolsep}{3.5pt}
\renewcommand{\arraystretch}{1.12}
\resizebox{\linewidth}{!}{%
\begin{tabular}{llccccccc}
\toprule
Seed & Scaffold label & Salience & Sens. & Fragility $\downarrow$ & Alignment & \shortstack{WVS\\sal.} & \shortstack{WVS\\sens.} & \shortstack{Valid\\format} \\
\midrule
\textbf{2801} & Support-state scaffold & \textbf{0.980}/0.923 & \textbf{1.000}/0.333 & \textbf{0.127}/0.262 & \textbf{0.668}/0.572 & \textbf{0.939}/0.769 & \textbf{1.000}/0.000 & \textbf{1.000}/\textbf{1.000} \\
\textbf{4523} & No-import scaffold & \textbf{0.914}/0.910 & \textbf{0.667}/0.333 & \textbf{0.000}/0.167 & \textbf{0.768}/0.676 & \textbf{0.741}/0.731 & \textbf{1.000}/0.000 & \textbf{1.000}/\textbf{1.000} \\
\bottomrule
\end{tabular}%
}
\caption{Main claim-bearing 3D clean wins. Each cell reports selected scaffold /
incumbent baseline; bold marks the better value, with tied values bolded in
both cells. These two locked held-out rows are the empirical anchor for the
claim that support-state and no-import operations can be externalized into
auditable scaffolds and executed by a frozen weak student. Fragility is
lower-is-better.}
\label{tab:clean-win-mechanisms}
\end{table}
Figure~\ref{fig:heldout-3d-results-by-seed} visualizes the same claim beside
its boundary context: the clean wins are visible, but mixed and audit rows are
not converted into aggregate proof.
\begin{figure}[H]
\centering
\includegraphics[width=\linewidth]{figures/heldout_3d_results_by_seed.png}
\caption{Real-data 3D held-out result against the incumbent baseline. Panel A
compares selected-scaffold salience with the incumbent baseline; Panel B counts
selected wins, ties, and baseline wins across the seven 3D metrics. The 2801
support-state scaffold and 4523 no-import scaffold are the clean access-log
verified held-out wins; other rows remain boundary or audit context, not
additional claim-bearing wins.}
\label{fig:heldout-3d-results-by-seed}
\end{figure}
\subsection{Main Proof-of-Possibility Evidence: Two Locked Held-Out 3D Clean Wins}
Table~\ref{tab:clean-win-mechanisms} puts the primary evidence first. Seed
2801 is a mechanism demonstration: the support-state scaffold makes the
student distinguish same-basis preservation from support-changing update:
preserve when morally relevant facts and context are unchanged; update when a
changed fact weakens, removes, or contradicts support. Under access-log verified held-out evaluation, the
scaffold beats the incumbent baseline on salience, sensitivity, fragility,
alignment, WVS salience, and WVS sensitivity, while tying valid format at 1.000.
Seed 4523 is a second mechanism demonstration, not a repetition: the 4523
no-import scaffold makes the student resist unsupported
moral-frame substitution. Its scaffold locks the item's stated criterion/value
basis and blocks generic moral-frame import. Under access-log verified held-out
evaluation, it repeats the clean pattern: six
substantive metric wins and the same valid-format tie at 1.000.
Together, these rows establish the existence proof: two distinct
moral-attention operations can be externalized into prompt scaffolds that a
frozen weak student executes more reliably under locked perturbation evaluation.
\subsection{Why the Clean Wins Are Interpretable Artifacts}
The clean wins are interpretable because each follows the chain
artifact \(\rightarrow\) operation \(\rightarrow\) metric \(\rightarrow\)
locked held-out evidence. Seed 2801 prints a support-state operation. It tells
the student to preserve when the
same support basis remains and update only when a changed fact weakens,
removes, or contradicts that basis. The six-metric held-out win shows
same-basis versus changed-basis tracking became more executable.
Seed 4523 prints a different moral-attention operation. It locks the stated
named criterion and instructs the student not to replace it with a generic
moral frame or import unsupported duties and motives. Its six-metric held-out
win shows no-import discipline and criterion tracking became more executable.
Appendix~\ref{app:3d-boundary} retains the boundary scorecard.
\subsection{Search-Cost Ledger and Anti-Cherry-Picking Logic}
Failed candidates, mixed rows, blocked-before-final rows, and dev-only rows
document the cost of finding a scaffold that survives locked evaluation. Two
scaffolds survived the strict clean-win gate: the 2801 support-state scaffold
and the 4523 no-import scaffold.
The compact evidence dashboard in Table~\ref{tab:main-evidence-glance} is the
only main-text taxonomy. Appendix~\ref{app:3d-search-scope} gives the short
3D-only scope note.
\subsection{Supporting ETHICS Route Evidence}
ETHICS supports the route mechanism. The checkpoint result shows the selector
gap: selector-dev preferred
the incumbent baseline, while the ETHICS threshold scaffold won locked
final-test accuracy against it (0.563 vs. 0.531). The completed 10-seed
scaffold tournament supports scaffold freezing: best frozen scaffolds win 6,
lose 2, and tie 2 against continued adaptation, with mean
frozen-minus-continued advantage +0.044. Post-selection audit and capacity
sensitivity remain appendix context.
\subsection{Residual Frontier}
Mixed rows are not pooled into the claim-bearing inference. They localize the
residual frontier: WVS sensitivity, selector transfer, simultaneous
salience--fragility control, and capacity-specific transfer.
\subsection{Implications for VAE, Phronesis, and Governance}
The empirical lesson is sharper than ``prompting helped.'' The clean wins link
route, artifact, operation, metric, and held-out evidence: seed 2801 makes
support-state preservation/update executable, while seed 4523 makes no-import
discipline executable.
The phronesis claim is bounded: the scaffold externalizes a perceptual
component of practical judgment. Mixed rows identify which operations remain
hard to stabilize together: WVS sensitivity, selector-transfer reliability,
and salience--fragility control.
The governance implication follows from the same structure: a pipeline can
search for a value-relevant operation, freeze the prompt scaffold, test it
under locked access discipline, and report residual failures.
The reusable evaluation pattern is: report the search route, freeze the
artifact, audit held-out access, and place cost proxies beside residual
behavior. VAE supplies the ledger; phronesis supplies the target operation:
making what matters available at judgment time.
\section{Discussion: Route Extensions and Mechanism Controls}
\subsection{Design Principles from the ETHICS Prompt-Only Route}
The ETHICS route yields seven design rules: search scaffold families before
polishing one string; evaluate prompt paradigms, not surface strings; freeze
useful intermediates; treat selector-dev as noisy evidence; log failures and
final-test access; concentrate threshold, support, no-import, and audit cues;
compare routes by cost plus residual error. The search history strengthens the
argument because failed, mixed, and blocked rows record the cost of discovering
the two clean 3D wins.
\subsection{Prompt Search as Interpretable External Optimization}
Prompt search is functionally analogous to optimization over parameters, but
the optimized artifact is external and inspectable. Traditional training
searches parameter space to reduce loss. Teacher-guided prompt search searches
scaffold space to reduce residual moral-instability loss. It faces the same
scientific hazards as other optimization routes: regressions, selector
overfitting, search cost, and transfer boundaries. Unlike weight updates,
the artifact can be read, frozen, reproduced, and audited.
Figure~\ref{fig:prompt-shape-search-landscape} is a schematic of that
route-comparison point. It does not measure a loss surface; it shows why VAE
treats prompt search as a route with cost, artifact content, and residual
behavior.
\begin{figure}[H]
\centering
\includegraphics[width=\linewidth]{figures/prompt_shape_search_landscape_current.png}
\caption{Conceptual prompt-shape search landscape. This schematic is not a
measured loss surface. It distinguishes local wording polish around the
incumbent baseline from scaffold-family search over moral-attention operations:
the 2801 support-state scaffold and the 4523 no-import scaffold. Frozen family
representatives enter selector-dev before locked final-test access; the VAE
ledger records route cost, printable artifact, and residual behavior.}
\label{fig:prompt-shape-search-landscape}
\end{figure}
Concretely, local rewriting around the incumbent baseline can produce many
nearby instructions without changing the moral-attention operation that the
student actually executes. Teacher-guided scaffold-family search changes the
candidate operation itself: support-state scaffolds preserve and update the
same support basis; no-import scaffolds block unsupported frame
substitution; boundary families reveal residual failure. Once a useful family
representative appears, the protocol freezes it, compares it in a selector-dev
tournament, and unlocks final-test only under access-log discipline. In VAE
terms, the output is not just a score but a ledger: route cost, printed prompt
artifact, and residual held-out behavior.
A bounded prompt slot can serve as a search space for moral-attention
structure. The selected prompt is a route-specific artifact; its value is the
frozen model's residual behavior after conditioning.
\subsection{Toward Prompt-Parameter Route Comparison}
The prompt-only route supplies one side of a route-comparison ledger: artifact
content, search/audit cost proxies, residual metrics, and auditability. A
route-comparison study can ask whether a prompt scaffold, LoRA adapter, GRPO
policy update, supervised fine-tune, retrieval policy, or hybrid route reaches
the same residual target at comparable cost. This paper does not claim
equivalence to weight-update routes.
The engineering implication is route choice: some bounded alignment gains may
come from prompts, retrieval rules, prefix policies, or scaffolds rather than
full post-training. The economic and governance implication is the same:
compare burden, residual risk, and auditability.
\subsection{Human Expert Prompts as Cost-Saving Initialization}
Human experts and automated scaffold search are complementary. Expert-authored
prompts can act as search initializers: they encode domain knowledge,
philosophical distinctions, legal caution, or evaluation experience before the
teacher search begins. Teacher-guided search then verifies, refines, and
stress-tests those scaffolds under locked-split discipline.
Under VAE, expert prompt labor is valuable if it reduces the route cost needed
to reach the same residual moral-instability loss.
\subsection{Scale Thresholds and Capability Tradeoffs}
Scaffold effects depend on model scale. Very small models may be too unstable
to execute support-state or no-import scaffolds; larger models may require
different prompt artifacts. Route comparisons should test scale under identical
locked-split discipline.
The appendix capacity audit shows why this route comparison must be
student-specific: unchanged prompt reuse across larger students can reverse the
direction of the fixed-artifact comparison. A prompt artifact has value
relative to the bounded model and route that can execute it.
The second route question is tradeoff: whether moral calibration reduces or
improves general reasoning. Route comparisons should pair moral-stability
benchmarks with general benchmarks such as MMLU, GSM8K, ARC, BBH, TruthfulQA,
or small-model-appropriate alternatives.
\subsection{Route Comparisons and Mechanism Controls}
Adapter, LoRA, GRPO, and hybrid routes are comparison surfaces in this VAE
ledger. They can be evaluated against prompt-only scaffolds by the same
accounting structure: route cost,
intervention artifact, residual behavior, split discipline, and auditability.
Route-control extensions should include length-matched generic prompts,
format-only prompts, generic morality prompts, and operation-isolation
ablations such as no-import-only and support-update-only variants. These are
mechanism-decomposition controls; the present claims remain tied to artifacted
held-out rows.
\section{Governance Implications: Auditable Moral Attention}
VAE gives auditable moral attention a governance form. If a foreseeable
failure-to-notice exists, the institutional question is what intervention could
make the relevant feature usable and what residual risk remains. The cost term
corresponds to the burden of precaution: teacher labor, search, validation,
prompt length, monitoring, retrieval design, or training updates. The residual
term corresponds to remaining moral-attention failure after the intervention.
In the prompt-only route, the scaffold is inspectable evidence of the attempted
precaution.
Hand-formula reasoning supplies a governance analogy distinct from a binding
legal test \citep{carrolltowing1947}. A precaution is attractive when
its burden is small relative to the expected preventable harm. In alignment
terms, prompt scaffolds are low-burden candidate precautions when they are
auditable, cheap to deploy, and reduce residual failure. The analogy makes
burden and risk legible; it does not decide negligence, compliance, or
liability.
\begin{table}[H]
\centering
\small
\begin{tabularx}{\linewidth}{>{\raggedright\arraybackslash}p{0.20\linewidth}>{\raggedright\arraybackslash}X>{\raggedright\arraybackslash}p{0.20\linewidth}>{\raggedright\arraybackslash}X}
\toprule
Hand-formula concept & AI alignment analogue & VAE representation & Governance question \\
\midrule
Burden of precaution \(B\) & Search, audit, deployment, monitoring, retrieval, prompting, or training burden. & Cost term \(C_r\). & What would it cost to make the relevant feature usable? \\
Probability of harm \(P\) & Likelihood of moral-attention failure under deployment conditions. & Estimated residual failure frequency or benchmark failure rate. & How often does the model fail to notice or use the relevant feature? \\
Loss \(L\) & Severity of downstream harm. & Deployment-specific harm model, not fully measured by benchmark. & How serious is the preventable failure? \\
Reasonable precaution & Prompt, retrieval rule, post-training update, monitor, evaluation gate, or escalation rule. & Artifact plus route cost plus residual reduction. & Was there a reasonable intervention that reduced residual failure? \\
\bottomrule
\end{tabularx}
\caption{Hand-formula analogy and VAE mapping. The table is a governance
framework, not a liability test: it makes precaution burden, residual
moral-attention failure, and artifact evidence technically legible.}
\label{tab:hand-vae-map}
\end{table}
VAE makes the precaution question explicit: what artifact, at what burden, with
what validated reduction in residual moral-attention failure?
\section{Scope and Claim Boundaries}
The evidence establishes mechanism-level support and two clean held-out 3D
wins, not all-seed success. Benchmark labels are proxies. Automated metrics
measure benchmark-relevant moral attention, not virtue. Teacher-guided search
can crystallize teacher bias as well as useful moral-attention structure.
Reuse across model families, scales, domains, and deployment settings requires
route-extension evidence. Access logs verify official split access. The
governance claims are accountability claims, not legal conclusions.
\section{Conclusion}
Moral failure can be failure-to-notice: fluent reasoning fails when the
morally decisive feature never enters operative attention. VAE makes that
claim operational by asking what an alignment route can make a bounded model
notice and use, at what cost, and with what residual failure.
The empirical result is a bounded proof of possibility. Teacher-guided
scaffold-family search produced two auditable prompt scaffolds that a frozen
weak student executed under locked held-out 3D evaluation. Seed 2801 makes
support-state preservation/update executable. Seed 4523 makes no-import
criterion discipline executable. Each wins six substantive metrics and ties
valid format against the incumbent baseline.
ETHICS supplies route evidence: the checkpoint exposes a selector/final gap,
and the 10-seed tournament supports scaffold freezing. Failed, mixed, blocked,
and dev-only rows supply the search-cost and residual-frontier ledger behind
the two clean wins.
The framework links VAE, prompt scaffolds, 3D perturbation tests, phronesis,
and governance. VAE supplies cost-plus-residual accounting. The prompt scaffold
is the visible alignment artifact. The 3D benchmark tests preservation,
update, and resistance to unsupported frame drift. Phronesis names the target:
perceiving the salient particular. Governance asks what the pipeline could
reasonably have made the model notice, at what burden, with what validation,
and with what residual risk.
Robust alignment is the capacity to keep moral attention stable when the
situation rewards its loss.
\clearpage
\appendix
\noindent\textit{Rounding note.} Main text rounds metrics and accuracies to
three decimals; appendix tables preserve exact saved values where useful.
\section{Prompt Artifact Table}
\label{app:full-prompts}
\begingroup
\scriptsize
\setlength{\tabcolsep}{4pt}
\begin{longtable}{>{\raggedright\arraybackslash}p{0.18\linewidth}>{\raggedright\arraybackslash}p{0.53\linewidth}>{\raggedright\arraybackslash}p{0.23\linewidth}}
\caption{Prompt artifact table. The table prints the scaffold text that carries
the paper's moral-attention operations: threshold calibration, support-state
preservation/update, no-import discipline, and boundary variants. Text is copied
from saved artifacts after JSON/YAML parsing; line wrapping is typographic only.
Source paths remain in the artifact manifest.
\label{tab:full-prompts}}\\
\toprule
Artifact & Exact prompt text & Role \\
\midrule
\endfirsthead
\toprule
Artifact & Exact prompt text & Role \\
\midrule
\endhead
ETHICS threshold scaffold & Assess the action described in the scenario based on ordinary commonsense moral principles. Consider the full context provided. Is the action generally considered acceptable by a typical person? & Core ETHICS checkpoint artifact; threshold calibration for ordinary commonsense acceptability and selector-gap mechanism evidence. \\