kayba-ai
diff --git a/‎.claude/projects/-home-david-projects-Kayba-agentic-context-engine/memory/feedback_bedrock_only.md‎
Lines changed: 11 additions & 0 deletions b/‎.claude/projects/-home-david-projects-Kayba-agentic-context-engine/memory/feedback_bedrock_only.md‎
Lines changed: 11 additions & 0 deletions
diff --git a/‎.gitignore‎
Lines changed: 7 additions & 1 deletion b/‎.gitignore‎
Lines changed: 7 additions & 1 deletion
diff --git a/‎AGENTS.md‎
Lines changed: 40 additions & 54 deletions b/‎AGENTS.md‎
Lines changed: 40 additions & 54 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 75 additions & 1 deletion b/‎CHANGELOG.md‎
Lines changed: 75 additions & 1 deletion
diff --git a/‎CLAUDE.md‎
Lines changed: 5 additions & 1 deletion b/‎CLAUDE.md‎
Lines changed: 5 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 2 additions & 2 deletions b/‎README.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎ace-eval‎ b/‎ace-eval‎
@@ -0,0 +1,11 @@
+---
+name: Always use Bedrock
+description: Never use direct Anthropic API key or fall back to OpenAI — always use Bedrock via AWS_BEARER_TOKEN_BEDROCK
+type: feedback
+---
+
+Always use Bedrock for LLM calls. Never use the Anthropic API key directly, never fall back to OpenAI or any other provider.
+
+**Why:** The user has Bedrock configured with `AWS_BEARER_TOKEN_BEDROCK` and does not want direct Anthropic API usage (burns quota/money on the wrong account). Fallback logic is unacceptable — it silently uses the wrong provider.
+
+**How to apply:** In integration tests and any code that needs an LLM model string, use `bedrock/us.anthropic.claude-haiku-4-5-20251001-v1:0` (or similar Bedrock model). Never write fallback chains like "if ANTHROPIC_KEY else OPENAI". Just use Bedrock, period.
@@ -113,4 +113,10 @@ site/
 .private/
 
 # GSD planning artifacts (NEVER commit!)
-.planning/
+.planning/
+
+# Node build artifacts
+node_modules/
+
+# Local scratch / experiment outputs
+tmp/
@@ -7,72 +7,60 @@ This file provides guidance to coding agents working in this repository.
 ### Pipeline-First Development (MANDATORY)
 **All new functionality MUST be implemented as pipeline Steps composed via the Pipeline engine.** Do NOT write standalone scripts, ad-hoc loops, or inline logic that bypasses the pipeline. Before writing any code:
 
-1. Read `docs/design/PIPELINE_DESIGN.md` to understand the Step -> Pipeline -> Branch model.
+1. Read `docs/design/PIPELINE_DESIGN.md` to understand the Step → Pipeline → Branch model.
 2. Implement logic as a `Step` class with `requires`/`provides` declarations and a `__call__(self, ctx) -> ctx` method.
-3. Compose steps using `Pipeline().then(...)` and `.branch(...)` - never manual for-loops or direct function chaining.
-4. Use `StepContext.replace()` for immutable context updates - never mutate context directly.
+3. Compose steps using `Pipeline().then(...)` and `.branch(...)` — never manual for-loops or direct function chaining.
+4. Use `StepContext.replace()` for immutable context updates — never mutate context directly.
 5. Put integration-specific data in `metadata`, not new context fields, unless the field is shared across multiple pipelines.
 
 **Anti-patterns to reject:**
 - Writing a function that calls multiple steps manually instead of composing them in a Pipeline
-- Inline reflection/evaluation logic instead of creating a `ReflectStep` or `EvaluateStep`
+- Inline reflection/evaluation logic instead of creating a ReflectStep or EvaluateStep
 - Ad-hoc `ThreadPoolExecutor` usage instead of `async_boundary` and `max_workers` on steps
 - Standalone scripts that duplicate pipeline functionality without using the pipeline engine
 - Bypassing `requires`/`provides` contracts by accessing context fields not declared in `requires`
 
-If a task seems like it cannot fit the pipeline model, explain why to the user before proceeding - do not silently circumvent it.
+If a task seems like it cannot fit the pipeline model, explain why to the user before proceeding — do not silently circumvent it.
 
 ### Core Code Protection
-**Do NOT modify core modules (`ace/`, `ace/core/`, `pipeline/`) without explicit user approval.** Before proposing any change to these directories:
+**Do NOT modify core modules (`ace/core/`, `pipeline/`) without explicit user approval.** Before proposing any change to these directories:
 1. Read the relevant design docs (`docs/design/ACE_ARCHITECTURE.md`, `docs/design/PIPELINE_DESIGN.md`) thoroughly.
-2. Evaluate whether the change is truly required or if it can be achieved outside the core (for example, in an integration, step, or example).
-3. Clearly explain the proposed change and its justification to the user before making any edits.
+2. Evaluate whether the change is truly required or if it can be achieved outside the core (e.g., in an integration, step, or example).
+3. Clearly explain the proposed change and its justification to the user **before** making any edits.
 4. Wait for the user to explicitly accept before proceeding.
 
 ### Documentation Maintenance
 Before working on code in `ace/`, read `docs/design/ACE_ARCHITECTURE.md` to understand the current architecture.
-Before working on code in `pipeline/`, read `docs/design/PIPELINE_DESIGN.md` to understand the pipeline engine.
-Before working on code in `ace/rr/`, read `docs/design/RR_DESIGN.md` to understand the recursive reflection design.
-Before working on code in `ace/cli/`, read `docs/design/CLI_DESIGN.md` to understand the CLI architecture.
+Before working on code in `pipeline/` or `ace/core/`, read `docs/design/PIPELINE_DESIGN.md` to understand the pipeline engine.
 
-**Docs MUST be kept in sync with code.** Any change that alters a public API, renames a concept, adds or removes a module, or changes execution flow requires a corresponding update to the relevant docs. Do not merge code changes that make the documentation inaccurate.
+**Docs MUST be kept in sync with code.** Any change that alters a public API, renames a concept, adds/removes a module, or changes execution flow **requires** a corresponding update to the relevant docs. Do not merge code changes that make the documentation inaccurate.
 
 Key design docs:
-- `docs/design/ACE_ARCHITECTURE.md` - core ACE architecture: roles, runners, skillbook, adaptation loops, integrations, and public API
-- `docs/design/PIPELINE_DESIGN.md` - pipeline engine: steps, `StepProtocol`, `Pipeline`, branching, execution, and `SubRunner`
-- `docs/design/RR_DESIGN.md` - recursive reflection design in `ace/rr/`
-- `docs/design/CLI_DESIGN.md` - CLI architecture, lazy imports, and command design
-- `docs/design/ACE_REFERENCE.md` - code reference and examples
-- `docs/design/ACE_DECISIONS.md` - design decisions and rejected alternatives
+- `docs/design/ACE_ARCHITECTURE.md` — ACE architecture: layers, core concepts, roles, steps, runners, integrations
+- `docs/design/ACE_REFERENCE.md` — ACE code reference: full implementations, API signatures, usage examples
+- `docs/design/ACE_DECISIONS.md` — design decisions and rejected alternatives (ACE, pipeline, migration)
+- `docs/design/PIPELINE_DESIGN.md` — pipeline engine: steps, StepProtocol, Pipeline, Branch, concurrency
 - If you need to work with collected traces from Logfire, read `agent-guides/logfire.md`
 
 ### Project Structure
-- `ace/` - main package: core data types, role implementations, steps, runners, integrations, providers, recursive reflection, and observability
-- `pipeline/` - generic pipeline engine used by ACE
-- `ace-eval/` - evaluation framework submodule / separate repo workspace
-- `tests/` - pytest-based test suite, including pipeline engine and RR coverage
-- `examples/` - runnable demos for ACE, integrations, and pipeline composition
-- `benchmarks/` - benchmark loaders and task definitions
-- `scripts/` - helper scripts and research tooling
-- `agent-guides/` - internal development guides for LLM agents; not part of the public docs site
-- `docs/` - guides and reference material
-  - `docs/getting-started/` - installation, setup, and quick start
-  - `docs/concepts/` - core concepts such as roles, skillbook, updates, and insight levels
-  - `docs/guides/` - in-depth guides for full pipelines, composition, prompts, integration, and testing
-  - `docs/integrations/` - per-integration docs for LiteLLM, browser-use, LangChain, Claude Code, Claude SDK, MCP, OpenClaw, hosted API, and Opik
-  - `docs/pipeline/` - pipeline engine guides and API reference
-  - `docs/api/` - package API index
-  - `docs/design/` - architecture references (ACE_ARCHITECTURE, ACE_REFERENCE, ACE_DECISIONS, PIPELINE_DESIGN, RR_DESIGN, CLI_DESIGN)
+- `ace/` — core library: roles (PydanticAI-backed), skillbook, steps, runners, providers, RR, integrations, observability
+- `pipeline/` — generic pipeline engine that `ace` is built on (see `docs/design/PIPELINE_DESIGN.md`)
+- `ace-eval/` — evaluation framework (submodule, separate repo)
+- `tests/` — unit/integration tests (pytest)
+- `examples/` — runnable demos grouped by integration
+- `agent-guides/` — internal development guides for LLM agents; not part of the public docs site
+- `docs/` — guides and reference material
+  - `docs/design/ACE_ARCHITECTURE.md` — architecture and concepts (keep in sync with code)
+  - `docs/design/ACE_REFERENCE.md` — code reference and examples (keep in sync with code)
+  - `docs/design/ACE_DECISIONS.md` — design decisions and rejected alternatives
+  - `docs/design/PIPELINE_DESIGN.md` — pipeline engine design doc (keep in sync with code)
 
 ### Commands
-- `uv sync` - install dependencies
-- `uv run pytest` - run tests with coverage on `ace` and `pipeline` (`--cov-fail-under=25`)
-- `uv run pytest -m unit` - run unit tests
-- `uv run pytest -m integration` - run integration tests
-- `uv run pytest -m slow` - run slow tests
-- `uv run pytest -m requires_api` - run tests that need live API credentials
-- `uv run black ace/ pipeline/ tests/ examples/` - format code
-- `uv run mypy ace/` - type check the main package
+- `uv sync` — install all dependencies
+- `uv run pytest` — run tests (coverage enforced `--cov-fail-under=25`)
+- `uv run pytest -m unit` / `-m integration` / `-m slow` — run by marker
+- `uv run black ace/ tests/ examples/` — format code
+- `uv run mypy ace/` — type check
 
 ### Coding Style
 - PEP 8 with Black formatting (line length 88)
@@ -82,12 +70,11 @@ Key design docs:
 
 ### Testing
 - Pytest is the primary runner
-- Some tests use `unittest`-style classes but still run under pytest
-- Use the existing markers: `unit`, `integration`, `slow`, and `requires_api`
-- Add tests for new features and regression tests for bug fixes
+- Add tests for new features; include regression tests for bug fixes
 
 ### Commits
 - Conventional Commits: `feat(scope): subject`, `fix(scope): subject`
+- Do NOT add `Co-Authored-By` trailers to commit messages
 - PRs should include description, test results, and relevant docs updates
 
 ### ACE Roles (quick reference)
@@ -98,15 +85,14 @@ Key design docs:
 | **Reflector** | Analyzes execution results | `Reflector` |
 | **SkillManager** | Updates the skillbook with new strategies | `SkillManager` |
 
-### Public Runners
+### Integration Runners
 
 | Runner | Framework | Use Case |
 |--------|-----------|----------|
-| `ACELiteLLM` | LiteLLM | Batteries-included self-improving agent with `.ask()`, `.learn()`, and trace learning helpers |
-| `ACE` | Core ACE runner | Full learning loop over `Sample` + `TaskEnvironment` |
-| `TraceAnalyser` | Offline traces | Learn from recorded traces without re-running tasks |
-| `BrowserUse` | browser-use | Browser automation with learning |
-| `LangChain` | LangChain | Wrap chains, agents, or graphs with learning |
-| `ClaudeCode` | Claude Code CLI | Coding tasks with learning |
-
-The Anthropic SDK integration lives in `ace/integrations/claude_sdk.py` and is step-based rather than a public runner class.
+| `ACELiteLLM` | LiteLLM (100+ providers) | Simple self-improving agent |
+| `ACELangChain` | LangChain | Wrap chains/agents with learning |
+| `ACEBrowserUse` | browser-use | Browser automation with learning |
+| `ACEClaudeCode` | Claude Code CLI | Coding tasks with learning |
+
+NEVER USE FALLBACKS OR IMPLEMENT THINGS I NEVER ASKED FOR.
+IF IT'S STRAIGHFORWARD, IMPLEMENT IT STRAIGHFORWARD.
@@ -7,6 +7,80 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+## [0.12.0] - 2026-05-06
+
+### Added
+- **Cross-trace generalization gate** for the SkillManager — four-criterion check
+  (≥3 instances across ≥2 domains, named slot, no API-specific params in the
+  action, verifiable runtime trigger) that constrains when SM may write a broad
+  skill subsuming existing narrow ones. Backed by [skill_generalization.md](ace-eval/research/skill_generalization.md)
+  (14 cited sources).
+- **Action-equivalence rule** for within-run skill writing — splits on action,
+  not on trigger surface. Prevents over-decomposition of structurally identical
+  rules.
+- **Atomicity rule** in `insight` formatting — one trigger + one action per
+  skill, with explicit good/bad shape examples in the prompt.
+- **Insight format guidance** in the SM prompt sourced from the in-context-
+  learning research doc ([icl_skill_formatting.md](ace-eval/research/icl_skill_formatting.md)) — 15-50 word cap, imperative
+  voice, positive framing default, examples only for format/shape rules.
+- **Evidence-only tagging** — SM tags only skills the reflection actually
+  implicates, instead of iterating over every injected_skill_id.
+- **Broaden-via-comparison rule** for UPDATE — when two skills target the same
+  root cause in different niches, broaden `issue` rather than adding a duplicate.
+- **Prompt caching for SM** via `CachePoint(ttl="5m")` mirroring RR's caching;
+  cache_read/write tokens forwarded in run metadata.
+- **SM behavior spec + harness** — `ace-eval/scripts/sm_behavior_check.py`,
+  `sm_iterative_check.py`, `sm_stability_check.py` and matching scenario
+  fixtures cover replay stability, convergence, scope expansion, and the
+  below-threshold gate boundary.
+
+### Changed
+- **`update_skills` signature** — `source` is now optional; `SkillbookView`
+  was dropped from the parameter list (callers pass the real `Skillbook`
+  directly).
+- **Hard removal cap removed** — SM no longer auto-removes skills whose
+  `harmful_count >= 3`. Heavily-used skills can legitimately accumulate
+  harmful tags without being net-negative; REMOVE now requires explicit
+  reflection evidence.
+- **TauBench evaluator** — `evaluation_type=ALL_WITH_NL_ASSERTIONS` on both
+  `run_task` and `run_tasks` call sites in
+  `ace-eval/src/ace_eval/e2e/benchmarks/tau_bench.py`. Retail (and any future
+  benchmark with `NL_ASSERTION` in `reward_basis`) now produces real reward
+  numbers instead of crashing on every task during reward computation.
+
+### Removed
+- **Skillbook v1 legacy aliases** on `Skill` and `UpdateOperation` — v2 schema
+  is now the only schema.
+
+## [0.11.0] - 2026-04-29
+
+### Added
+- **`RecursiveAgent` core abstraction** — extracted from RR into `ace/core/recursive_agent.py`; provides a generic recursive PydanticAI agent with sandbox, microcompaction, default tool set, and depth-aware sub-agent registration. Reusable across roles beyond the Reflector.
+- **Skillbook v2 schema** — full rewrite of `ace/core/skillbook.py` with section-grouped storage, richer `InsightSource` provenance, and BM25-backed retrieval (`rank-bm25` runtime dependency).
+- **Agentic SkillManager** — `SkillManager` rewritten as a tool-calling loop (`ace/implementations/sm_tools.py`). Provenance is now populated by the SkillManager agent directly rather than a dedicated step.
+- **RR skillbook tools for the Reflector** — Reflector can introspect and propose updates to the skillbook from inside the recursive loop.
+- **Anthropic prompt caching enabled by default** for RR agents; `cache_read_tokens` and `cache_write_tokens` are forwarded in run metadata for cost accounting.
+- **Logfire spans around recursive agent sessions** for end-to-end observability of nested RR runs.
+- **Online / offline mode** in the ACE runner.
+- **`nest-asyncio`** added to the dev extra to support nested loops in notebooks and live test scripts.
+
+### Changed
+- **RR collapsed into a single `RRStep`** — the orchestrator/worker split, batch machinery, and `AttachInsightSourcesStep` have been removed. RR now runs as a true recursive loop with depth-bounded sub-agent delegation and microcompaction of stale tool results.
+- **Reflector prompts** simplified, deduplicated, and made input-agnostic; added early-skillbook-skim and parallel-tool guidance.
+- **`record_observation` tool renamed to `think`** to clarify it is a scratch reasoning channel, not persistent storage.
+- **Native evidence summaries** are produced inside RR before final synthesis.
+- **Skillbook prompt format is now markdown** — `Skillbook.as_prompt()` returns a section-grouped markdown list instead of TOON. The `python-toon` dependency has been dropped.
+- **`metered_model` and `sandbox`** moved from `ace/rr/` into `ace/core/` to reflect their cross-role use.
+- **Pytest defaults** — `uv run pytest` now excludes `integration` and `requires_api` markers by default; coverage flags removed from `addopts` (run with `--cov` explicitly when needed).
+- **Observability** — `tool_arguments` and `tool_response` are no longer scrubbed by the Logfire callback so tool I/O remains inspectable.
+
+### Removed
+- `ace/rr/` legacy package layout (`agent.py`, `runner.py`, `trace_context.py`, `message_trimming.py`, batch helpers). Functionality is now in `ace/core/recursive_agent.py` and `ace/implementations/rr/`.
+- `AttachInsightSourcesStep` and its pipeline wiring — provenance is attached by the SkillManager agent.
+- `python-toon` runtime dependency.
+- TAG handling from the SkillManager.
+- Citation scanning from the Reflector.
+
 ## [0.10.0] - 2026-04-13
 
 ### Added
@@ -32,7 +106,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [0.9.2] - 2026-03-31
 
 ### Added
-- **Insight source provenance** — `InsightSource` typed model captures the origin of each skillbook update (trace ID, sample question, epoch/step, reflection summary, integration metadata); `AttachInsightSourcesStep` automatically enriches `UpdateBatch` operations with provenance and is wired into the default learning tail
+- **Insight source provenance** — `InsightSource` typed model captures the origin of each skillbook update (trace ID, sample question, epoch/step, reflection summary, integration metadata); provenance is now populated by the SkillManager agent directly
 - **Claude SDK step** — `ClaudeSDKStep` integration for running Claude Code sub-agents from within ACE pipelines
 - **RR sub-agent code execution** — Recursive Reflector can now delegate to code-execution sub-agents at runtime
 - **RR raw trace batch helpers** — `build_raw_trace_batches` and related runtime utilities for feeding raw traces directly into the RR pipeline
 
@@ -57,7 +57,7 @@ Key design docs:
 
 ### Commands
 - `uv sync` — install all dependencies
-- `uv run pytest` — run tests (coverage enforced `--cov-fail-under=25`)
+- `uv run pytest` — run tests (excludes `integration` and `requires_api` markers by default)
 - `uv run pytest -m unit` / `-m integration` / `-m slow` — run by marker
 - `uv run black ace/ tests/ examples/` — format code
 - `uv run mypy ace/` — type check
@@ -93,3 +93,7 @@ Key design docs:
 | `ACELangChain` | LangChain | Wrap chains/agents with learning |
 | `ACEBrowserUse` | browser-use | Browser automation with learning |
 | `ACEClaudeCode` | Claude Code CLI | Coding tasks with learning |
+
+NEVER USE FALLBACKS OR IMPLEMENT THINGS I NEVER ASKED FOR. 
+
+Keep your answers concise and to the point. If you don't know something, say you don't know instead of making assumptions or fabricating information. Always ask clarifying questions if the user's request is ambiguous or lacks necessary details.
@@ -168,15 +168,15 @@ ACE + Claude Code translated this library from Python to TypeScript with zero su
 ACE is built on a composable pipeline engine. Each step declares what it requires and what it produces:
 
 ```
-AgentStep -> EvaluateStep -> ReflectStep -> UpdateStep -> ApplyStep -> DeduplicateStep
+AgentStep -> EvaluateStep -> ReflectStep -> UpdateStep -> DeduplicateStep
 ```
 
 Use `learning_tail()` for the standard learning sequence, or compose custom pipelines:
 
 ```python
 from ace import Pipeline, AgentStep, EvaluateStep, learning_tail
 
-steps = [AgentStep(agent), EvaluateStep(env)] + learning_tail(reflector, skill_manager, skillbook)
+steps = [AgentStep(agent, skillbook), EvaluateStep(env)] + learning_tail(reflector, skill_manager, skillbook)
 pipeline = Pipeline(steps)
 ```