@@ -7,6 +7,51 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77
88## [ Unreleased]
99
10+ ## [ 0.12.0] - 2026-05-06
11+
12+ ### Added
13+ - ** Cross-trace generalization gate** for the SkillManager — four-criterion check
14+ (≥3 instances across ≥2 domains, named slot, no API-specific params in the
15+ action, verifiable runtime trigger) that constrains when SM may write a broad
16+ skill subsuming existing narrow ones. Backed by [ skill_generalization.md] ( ace-eval/research/skill_generalization.md )
17+ (14 cited sources).
18+ - ** Action-equivalence rule** for within-run skill writing — splits on action,
19+ not on trigger surface. Prevents over-decomposition of structurally identical
20+ rules.
21+ - ** Atomicity rule** in ` insight ` formatting — one trigger + one action per
22+ skill, with explicit good/bad shape examples in the prompt.
23+ - ** Insight format guidance** in the SM prompt sourced from the in-context-
24+ learning research doc ([ icl_skill_formatting.md] ( ace-eval/research/icl_skill_formatting.md ) ) — 15-50 word cap, imperative
25+ voice, positive framing default, examples only for format/shape rules.
26+ - ** Evidence-only tagging** — SM tags only skills the reflection actually
27+ implicates, instead of iterating over every injected_skill_id.
28+ - ** Broaden-via-comparison rule** for UPDATE — when two skills target the same
29+ root cause in different niches, broaden ` issue ` rather than adding a duplicate.
30+ - ** Prompt caching for SM** via ` CachePoint(ttl="5m") ` mirroring RR's caching;
31+ cache_read/write tokens forwarded in run metadata.
32+ - ** SM behavior spec + harness** — ` ace-eval/scripts/sm_behavior_check.py ` ,
33+ ` sm_iterative_check.py ` , ` sm_stability_check.py ` and matching scenario
34+ fixtures cover replay stability, convergence, scope expansion, and the
35+ below-threshold gate boundary.
36+
37+ ### Changed
38+ - ** ` update_skills ` signature** — ` source ` is now optional; ` SkillbookView `
39+ was dropped from the parameter list (callers pass the real ` Skillbook `
40+ directly).
41+ - ** Hard removal cap removed** — SM no longer auto-removes skills whose
42+ ` harmful_count >= 3 ` . Heavily-used skills can legitimately accumulate
43+ harmful tags without being net-negative; REMOVE now requires explicit
44+ reflection evidence.
45+ - ** TauBench evaluator** — ` evaluation_type=ALL_WITH_NL_ASSERTIONS ` on both
46+ ` run_task ` and ` run_tasks ` call sites in
47+ ` ace-eval/src/ace_eval/e2e/benchmarks/tau_bench.py ` . Retail (and any future
48+ benchmark with ` NL_ASSERTION ` in ` reward_basis ` ) now produces real reward
49+ numbers instead of crashing on every task during reward computation.
50+
51+ ### Removed
52+ - ** Skillbook v1 legacy aliases** on ` Skill ` and ` UpdateOperation ` — v2 schema
53+ is now the only schema.
54+
1055## [ 0.11.0] - 2026-04-29
1156
1257### Added
0 commit comments