chore(release): bump to 0.12.0

davidfarah2003 · davidfarah2003 · commit b61567b30661 · 2026-05-06T21:04:42.000-07:00
SM hardening release: cross-trace generalization gate,
action-equivalence rule, atomicity check, ICL-grounded insight
format, evidence-only tagging, broaden-via-comparison, prompt
caching, removed harmful-count hard removal cap, behavior spec +
harness. Drops Skillbook v1 legacy aliases. Submodule fix lets
tau-bench retail produce real benchmark numbers.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,51 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+## [0.12.0] - 2026-05-06
+
+### Added
+- **Cross-trace generalization gate** for the SkillManager — four-criterion check
+  (≥3 instances across ≥2 domains, named slot, no API-specific params in the
+  action, verifiable runtime trigger) that constrains when SM may write a broad
+  skill subsuming existing narrow ones. Backed by [skill_generalization.md](ace-eval/research/skill_generalization.md)
+  (14 cited sources).
+- **Action-equivalence rule** for within-run skill writing — splits on action,
+  not on trigger surface. Prevents over-decomposition of structurally identical
+  rules.
+- **Atomicity rule** in `insight` formatting — one trigger + one action per
+  skill, with explicit good/bad shape examples in the prompt.
+- **Insight format guidance** in the SM prompt sourced from the in-context-
+  learning research doc ([icl_skill_formatting.md](ace-eval/research/icl_skill_formatting.md)) — 15-50 word cap, imperative
+  voice, positive framing default, examples only for format/shape rules.
+- **Evidence-only tagging** — SM tags only skills the reflection actually
+  implicates, instead of iterating over every injected_skill_id.
+- **Broaden-via-comparison rule** for UPDATE — when two skills target the same
+  root cause in different niches, broaden `issue` rather than adding a duplicate.
+- **Prompt caching for SM** via `CachePoint(ttl="5m")` mirroring RR's caching;
+  cache_read/write tokens forwarded in run metadata.
+- **SM behavior spec + harness** — `ace-eval/scripts/sm_behavior_check.py`,
+  `sm_iterative_check.py`, `sm_stability_check.py` and matching scenario
+  fixtures cover replay stability, convergence, scope expansion, and the
+  below-threshold gate boundary.
+
+### Changed
+- **`update_skills` signature** — `source` is now optional; `SkillbookView`
+  was dropped from the parameter list (callers pass the real `Skillbook`
+  directly).
+- **Hard removal cap removed** — SM no longer auto-removes skills whose
+  `harmful_count >= 3`. Heavily-used skills can legitimately accumulate
+  harmful tags without being net-negative; REMOVE now requires explicit
+  reflection evidence.
+- **TauBench evaluator** — `evaluation_type=ALL_WITH_NL_ASSERTIONS` on both
+  `run_task` and `run_tasks` call sites in
+  `ace-eval/src/ace_eval/e2e/benchmarks/tau_bench.py`. Retail (and any future
+  benchmark with `NL_ASSERTION` in `reward_basis`) now produces real reward
+  numbers instead of crashing on every task during reward computation.
+
+### Removed
+- **Skillbook v1 legacy aliases** on `Skill` and `UpdateOperation` — v2 schema
+  is now the only schema.
+
 ## [0.11.0] - 2026-04-29
 
 ### Added
diff --git a/pyproject.toml b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "ace-framework"
-version = "0.11.0"
+version = "0.12.0"
 description = "Build self-improving AI agents that learn from experience"
 readme = "README.md"
 requires-python = ">=3.12"
diff --git a/uv.lock b/uv.lock