Skip to content

Commit 300efc0

Browse files
Merge pull request #130 from kayba-ai/refactor/davidfarah2003/simplify-rr
Release 0.12.0 (incl. 0.11.0): RR/Skillbook v2 rewrite + SM hardening
2 parents 5889da7 + b61567b commit 300efc0

85 files changed

Lines changed: 6558 additions & 5540 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
---
2+
name: Always use Bedrock
3+
description: Never use direct Anthropic API key or fall back to OpenAI — always use Bedrock via AWS_BEARER_TOKEN_BEDROCK
4+
type: feedback
5+
---
6+
7+
Always use Bedrock for LLM calls. Never use the Anthropic API key directly, never fall back to OpenAI or any other provider.
8+
9+
**Why:** The user has Bedrock configured with `AWS_BEARER_TOKEN_BEDROCK` and does not want direct Anthropic API usage (burns quota/money on the wrong account). Fallback logic is unacceptable — it silently uses the wrong provider.
10+
11+
**How to apply:** In integration tests and any code that needs an LLM model string, use `bedrock/us.anthropic.claude-haiku-4-5-20251001-v1:0` (or similar Bedrock model). Never write fallback chains like "if ANTHROPIC_KEY else OPENAI". Just use Bedrock, period.

.gitignore

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -113,4 +113,10 @@ site/
113113
.private/
114114

115115
# GSD planning artifacts (NEVER commit!)
116-
.planning/
116+
.planning/
117+
118+
# Node build artifacts
119+
node_modules/
120+
121+
# Local scratch / experiment outputs
122+
tmp/

AGENTS.md

Lines changed: 40 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -7,72 +7,60 @@ This file provides guidance to coding agents working in this repository.
77
### Pipeline-First Development (MANDATORY)
88
**All new functionality MUST be implemented as pipeline Steps composed via the Pipeline engine.** Do NOT write standalone scripts, ad-hoc loops, or inline logic that bypasses the pipeline. Before writing any code:
99

10-
1. Read `docs/design/PIPELINE_DESIGN.md` to understand the Step -> Pipeline -> Branch model.
10+
1. Read `docs/design/PIPELINE_DESIGN.md` to understand the Step Pipeline Branch model.
1111
2. Implement logic as a `Step` class with `requires`/`provides` declarations and a `__call__(self, ctx) -> ctx` method.
12-
3. Compose steps using `Pipeline().then(...)` and `.branch(...)` - never manual for-loops or direct function chaining.
13-
4. Use `StepContext.replace()` for immutable context updates - never mutate context directly.
12+
3. Compose steps using `Pipeline().then(...)` and `.branch(...)` never manual for-loops or direct function chaining.
13+
4. Use `StepContext.replace()` for immutable context updates never mutate context directly.
1414
5. Put integration-specific data in `metadata`, not new context fields, unless the field is shared across multiple pipelines.
1515

1616
**Anti-patterns to reject:**
1717
- Writing a function that calls multiple steps manually instead of composing them in a Pipeline
18-
- Inline reflection/evaluation logic instead of creating a `ReflectStep` or `EvaluateStep`
18+
- Inline reflection/evaluation logic instead of creating a ReflectStep or EvaluateStep
1919
- Ad-hoc `ThreadPoolExecutor` usage instead of `async_boundary` and `max_workers` on steps
2020
- Standalone scripts that duplicate pipeline functionality without using the pipeline engine
2121
- Bypassing `requires`/`provides` contracts by accessing context fields not declared in `requires`
2222

23-
If a task seems like it cannot fit the pipeline model, explain why to the user before proceeding - do not silently circumvent it.
23+
If a task seems like it cannot fit the pipeline model, explain why to the user before proceeding do not silently circumvent it.
2424

2525
### Core Code Protection
26-
**Do NOT modify core modules (`ace/`, `ace/core/`, `pipeline/`) without explicit user approval.** Before proposing any change to these directories:
26+
**Do NOT modify core modules (`ace/core/`, `pipeline/`) without explicit user approval.** Before proposing any change to these directories:
2727
1. Read the relevant design docs (`docs/design/ACE_ARCHITECTURE.md`, `docs/design/PIPELINE_DESIGN.md`) thoroughly.
28-
2. Evaluate whether the change is truly required or if it can be achieved outside the core (for example, in an integration, step, or example).
29-
3. Clearly explain the proposed change and its justification to the user before making any edits.
28+
2. Evaluate whether the change is truly required or if it can be achieved outside the core (e.g., in an integration, step, or example).
29+
3. Clearly explain the proposed change and its justification to the user **before** making any edits.
3030
4. Wait for the user to explicitly accept before proceeding.
3131

3232
### Documentation Maintenance
3333
Before working on code in `ace/`, read `docs/design/ACE_ARCHITECTURE.md` to understand the current architecture.
34-
Before working on code in `pipeline/`, read `docs/design/PIPELINE_DESIGN.md` to understand the pipeline engine.
35-
Before working on code in `ace/rr/`, read `docs/design/RR_DESIGN.md` to understand the recursive reflection design.
36-
Before working on code in `ace/cli/`, read `docs/design/CLI_DESIGN.md` to understand the CLI architecture.
34+
Before working on code in `pipeline/` or `ace/core/`, read `docs/design/PIPELINE_DESIGN.md` to understand the pipeline engine.
3735

38-
**Docs MUST be kept in sync with code.** Any change that alters a public API, renames a concept, adds or removes a module, or changes execution flow requires a corresponding update to the relevant docs. Do not merge code changes that make the documentation inaccurate.
36+
**Docs MUST be kept in sync with code.** Any change that alters a public API, renames a concept, adds/removes a module, or changes execution flow **requires** a corresponding update to the relevant docs. Do not merge code changes that make the documentation inaccurate.
3937

4038
Key design docs:
41-
- `docs/design/ACE_ARCHITECTURE.md` - core ACE architecture: roles, runners, skillbook, adaptation loops, integrations, and public API
42-
- `docs/design/PIPELINE_DESIGN.md` - pipeline engine: steps, `StepProtocol`, `Pipeline`, branching, execution, and `SubRunner`
43-
- `docs/design/RR_DESIGN.md` - recursive reflection design in `ace/rr/`
44-
- `docs/design/CLI_DESIGN.md` - CLI architecture, lazy imports, and command design
45-
- `docs/design/ACE_REFERENCE.md` - code reference and examples
46-
- `docs/design/ACE_DECISIONS.md` - design decisions and rejected alternatives
39+
- `docs/design/ACE_ARCHITECTURE.md` — ACE architecture: layers, core concepts, roles, steps, runners, integrations
40+
- `docs/design/ACE_REFERENCE.md` — ACE code reference: full implementations, API signatures, usage examples
41+
- `docs/design/ACE_DECISIONS.md` — design decisions and rejected alternatives (ACE, pipeline, migration)
42+
- `docs/design/PIPELINE_DESIGN.md` — pipeline engine: steps, StepProtocol, Pipeline, Branch, concurrency
4743
- If you need to work with collected traces from Logfire, read `agent-guides/logfire.md`
4844

4945
### Project Structure
50-
- `ace/` - main package: core data types, role implementations, steps, runners, integrations, providers, recursive reflection, and observability
51-
- `pipeline/` - generic pipeline engine used by ACE
52-
- `ace-eval/` - evaluation framework submodule / separate repo workspace
53-
- `tests/` - pytest-based test suite, including pipeline engine and RR coverage
54-
- `examples/` - runnable demos for ACE, integrations, and pipeline composition
55-
- `benchmarks/` - benchmark loaders and task definitions
56-
- `scripts/` - helper scripts and research tooling
57-
- `agent-guides/` - internal development guides for LLM agents; not part of the public docs site
58-
- `docs/` - guides and reference material
59-
- `docs/getting-started/` - installation, setup, and quick start
60-
- `docs/concepts/` - core concepts such as roles, skillbook, updates, and insight levels
61-
- `docs/guides/` - in-depth guides for full pipelines, composition, prompts, integration, and testing
62-
- `docs/integrations/` - per-integration docs for LiteLLM, browser-use, LangChain, Claude Code, Claude SDK, MCP, OpenClaw, hosted API, and Opik
63-
- `docs/pipeline/` - pipeline engine guides and API reference
64-
- `docs/api/` - package API index
65-
- `docs/design/` - architecture references (ACE_ARCHITECTURE, ACE_REFERENCE, ACE_DECISIONS, PIPELINE_DESIGN, RR_DESIGN, CLI_DESIGN)
46+
- `ace/` — core library: roles (PydanticAI-backed), skillbook, steps, runners, providers, RR, integrations, observability
47+
- `pipeline/` — generic pipeline engine that `ace` is built on (see `docs/design/PIPELINE_DESIGN.md`)
48+
- `ace-eval/` — evaluation framework (submodule, separate repo)
49+
- `tests/` — unit/integration tests (pytest)
50+
- `examples/` — runnable demos grouped by integration
51+
- `agent-guides/` — internal development guides for LLM agents; not part of the public docs site
52+
- `docs/` — guides and reference material
53+
- `docs/design/ACE_ARCHITECTURE.md` — architecture and concepts (keep in sync with code)
54+
- `docs/design/ACE_REFERENCE.md` — code reference and examples (keep in sync with code)
55+
- `docs/design/ACE_DECISIONS.md` — design decisions and rejected alternatives
56+
- `docs/design/PIPELINE_DESIGN.md` — pipeline engine design doc (keep in sync with code)
6657

6758
### Commands
68-
- `uv sync` - install dependencies
69-
- `uv run pytest` - run tests with coverage on `ace` and `pipeline` (`--cov-fail-under=25`)
70-
- `uv run pytest -m unit` - run unit tests
71-
- `uv run pytest -m integration` - run integration tests
72-
- `uv run pytest -m slow` - run slow tests
73-
- `uv run pytest -m requires_api` - run tests that need live API credentials
74-
- `uv run black ace/ pipeline/ tests/ examples/` - format code
75-
- `uv run mypy ace/` - type check the main package
59+
- `uv sync` — install all dependencies
60+
- `uv run pytest` — run tests (coverage enforced `--cov-fail-under=25`)
61+
- `uv run pytest -m unit` / `-m integration` / `-m slow` — run by marker
62+
- `uv run black ace/ tests/ examples/` — format code
63+
- `uv run mypy ace/` — type check
7664

7765
### Coding Style
7866
- PEP 8 with Black formatting (line length 88)
@@ -82,12 +70,11 @@ Key design docs:
8270

8371
### Testing
8472
- Pytest is the primary runner
85-
- Some tests use `unittest`-style classes but still run under pytest
86-
- Use the existing markers: `unit`, `integration`, `slow`, and `requires_api`
87-
- Add tests for new features and regression tests for bug fixes
73+
- Add tests for new features; include regression tests for bug fixes
8874

8975
### Commits
9076
- Conventional Commits: `feat(scope): subject`, `fix(scope): subject`
77+
- Do NOT add `Co-Authored-By` trailers to commit messages
9178
- PRs should include description, test results, and relevant docs updates
9279

9380
### ACE Roles (quick reference)
@@ -98,15 +85,14 @@ Key design docs:
9885
| **Reflector** | Analyzes execution results | `Reflector` |
9986
| **SkillManager** | Updates the skillbook with new strategies | `SkillManager` |
10087

101-
### Public Runners
88+
### Integration Runners
10289

10390
| Runner | Framework | Use Case |
10491
|--------|-----------|----------|
105-
| `ACELiteLLM` | LiteLLM | Batteries-included self-improving agent with `.ask()`, `.learn()`, and trace learning helpers |
106-
| `ACE` | Core ACE runner | Full learning loop over `Sample` + `TaskEnvironment` |
107-
| `TraceAnalyser` | Offline traces | Learn from recorded traces without re-running tasks |
108-
| `BrowserUse` | browser-use | Browser automation with learning |
109-
| `LangChain` | LangChain | Wrap chains, agents, or graphs with learning |
110-
| `ClaudeCode` | Claude Code CLI | Coding tasks with learning |
111-
112-
The Anthropic SDK integration lives in `ace/integrations/claude_sdk.py` and is step-based rather than a public runner class.
92+
| `ACELiteLLM` | LiteLLM (100+ providers) | Simple self-improving agent |
93+
| `ACELangChain` | LangChain | Wrap chains/agents with learning |
94+
| `ACEBrowserUse` | browser-use | Browser automation with learning |
95+
| `ACEClaudeCode` | Claude Code CLI | Coding tasks with learning |
96+
97+
NEVER USE FALLBACKS OR IMPLEMENT THINGS I NEVER ASKED FOR.
98+
IF IT'S STRAIGHFORWARD, IMPLEMENT IT STRAIGHFORWARD.

CHANGELOG.md

Lines changed: 75 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,80 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
## [0.12.0] - 2026-05-06
11+
12+
### Added
13+
- **Cross-trace generalization gate** for the SkillManager — four-criterion check
14+
(≥3 instances across ≥2 domains, named slot, no API-specific params in the
15+
action, verifiable runtime trigger) that constrains when SM may write a broad
16+
skill subsuming existing narrow ones. Backed by [skill_generalization.md](ace-eval/research/skill_generalization.md)
17+
(14 cited sources).
18+
- **Action-equivalence rule** for within-run skill writing — splits on action,
19+
not on trigger surface. Prevents over-decomposition of structurally identical
20+
rules.
21+
- **Atomicity rule** in `insight` formatting — one trigger + one action per
22+
skill, with explicit good/bad shape examples in the prompt.
23+
- **Insight format guidance** in the SM prompt sourced from the in-context-
24+
learning research doc ([icl_skill_formatting.md](ace-eval/research/icl_skill_formatting.md)) — 15-50 word cap, imperative
25+
voice, positive framing default, examples only for format/shape rules.
26+
- **Evidence-only tagging** — SM tags only skills the reflection actually
27+
implicates, instead of iterating over every injected_skill_id.
28+
- **Broaden-via-comparison rule** for UPDATE — when two skills target the same
29+
root cause in different niches, broaden `issue` rather than adding a duplicate.
30+
- **Prompt caching for SM** via `CachePoint(ttl="5m")` mirroring RR's caching;
31+
cache_read/write tokens forwarded in run metadata.
32+
- **SM behavior spec + harness**`ace-eval/scripts/sm_behavior_check.py`,
33+
`sm_iterative_check.py`, `sm_stability_check.py` and matching scenario
34+
fixtures cover replay stability, convergence, scope expansion, and the
35+
below-threshold gate boundary.
36+
37+
### Changed
38+
- **`update_skills` signature**`source` is now optional; `SkillbookView`
39+
was dropped from the parameter list (callers pass the real `Skillbook`
40+
directly).
41+
- **Hard removal cap removed** — SM no longer auto-removes skills whose
42+
`harmful_count >= 3`. Heavily-used skills can legitimately accumulate
43+
harmful tags without being net-negative; REMOVE now requires explicit
44+
reflection evidence.
45+
- **TauBench evaluator**`evaluation_type=ALL_WITH_NL_ASSERTIONS` on both
46+
`run_task` and `run_tasks` call sites in
47+
`ace-eval/src/ace_eval/e2e/benchmarks/tau_bench.py`. Retail (and any future
48+
benchmark with `NL_ASSERTION` in `reward_basis`) now produces real reward
49+
numbers instead of crashing on every task during reward computation.
50+
51+
### Removed
52+
- **Skillbook v1 legacy aliases** on `Skill` and `UpdateOperation` — v2 schema
53+
is now the only schema.
54+
55+
## [0.11.0] - 2026-04-29
56+
57+
### Added
58+
- **`RecursiveAgent` core abstraction** — extracted from RR into `ace/core/recursive_agent.py`; provides a generic recursive PydanticAI agent with sandbox, microcompaction, default tool set, and depth-aware sub-agent registration. Reusable across roles beyond the Reflector.
59+
- **Skillbook v2 schema** — full rewrite of `ace/core/skillbook.py` with section-grouped storage, richer `InsightSource` provenance, and BM25-backed retrieval (`rank-bm25` runtime dependency).
60+
- **Agentic SkillManager**`SkillManager` rewritten as a tool-calling loop (`ace/implementations/sm_tools.py`). Provenance is now populated by the SkillManager agent directly rather than a dedicated step.
61+
- **RR skillbook tools for the Reflector** — Reflector can introspect and propose updates to the skillbook from inside the recursive loop.
62+
- **Anthropic prompt caching enabled by default** for RR agents; `cache_read_tokens` and `cache_write_tokens` are forwarded in run metadata for cost accounting.
63+
- **Logfire spans around recursive agent sessions** for end-to-end observability of nested RR runs.
64+
- **Online / offline mode** in the ACE runner.
65+
- **`nest-asyncio`** added to the dev extra to support nested loops in notebooks and live test scripts.
66+
67+
### Changed
68+
- **RR collapsed into a single `RRStep`** — the orchestrator/worker split, batch machinery, and `AttachInsightSourcesStep` have been removed. RR now runs as a true recursive loop with depth-bounded sub-agent delegation and microcompaction of stale tool results.
69+
- **Reflector prompts** simplified, deduplicated, and made input-agnostic; added early-skillbook-skim and parallel-tool guidance.
70+
- **`record_observation` tool renamed to `think`** to clarify it is a scratch reasoning channel, not persistent storage.
71+
- **Native evidence summaries** are produced inside RR before final synthesis.
72+
- **Skillbook prompt format is now markdown**`Skillbook.as_prompt()` returns a section-grouped markdown list instead of TOON. The `python-toon` dependency has been dropped.
73+
- **`metered_model` and `sandbox`** moved from `ace/rr/` into `ace/core/` to reflect their cross-role use.
74+
- **Pytest defaults**`uv run pytest` now excludes `integration` and `requires_api` markers by default; coverage flags removed from `addopts` (run with `--cov` explicitly when needed).
75+
- **Observability**`tool_arguments` and `tool_response` are no longer scrubbed by the Logfire callback so tool I/O remains inspectable.
76+
77+
### Removed
78+
- `ace/rr/` legacy package layout (`agent.py`, `runner.py`, `trace_context.py`, `message_trimming.py`, batch helpers). Functionality is now in `ace/core/recursive_agent.py` and `ace/implementations/rr/`.
79+
- `AttachInsightSourcesStep` and its pipeline wiring — provenance is attached by the SkillManager agent.
80+
- `python-toon` runtime dependency.
81+
- TAG handling from the SkillManager.
82+
- Citation scanning from the Reflector.
83+
1084
## [0.10.0] - 2026-04-13
1185

1286
### Added
@@ -32,7 +106,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
32106
## [0.9.2] - 2026-03-31
33107

34108
### Added
35-
- **Insight source provenance**`InsightSource` typed model captures the origin of each skillbook update (trace ID, sample question, epoch/step, reflection summary, integration metadata); `AttachInsightSourcesStep` automatically enriches `UpdateBatch` operations with provenance and is wired into the default learning tail
109+
- **Insight source provenance**`InsightSource` typed model captures the origin of each skillbook update (trace ID, sample question, epoch/step, reflection summary, integration metadata); provenance is now populated by the SkillManager agent directly
36110
- **Claude SDK step**`ClaudeSDKStep` integration for running Claude Code sub-agents from within ACE pipelines
37111
- **RR sub-agent code execution** — Recursive Reflector can now delegate to code-execution sub-agents at runtime
38112
- **RR raw trace batch helpers**`build_raw_trace_batches` and related runtime utilities for feeding raw traces directly into the RR pipeline

CLAUDE.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ Key design docs:
5757

5858
### Commands
5959
- `uv sync` — install all dependencies
60-
- `uv run pytest` — run tests (coverage enforced `--cov-fail-under=25`)
60+
- `uv run pytest` — run tests (excludes `integration` and `requires_api` markers by default)
6161
- `uv run pytest -m unit` / `-m integration` / `-m slow` — run by marker
6262
- `uv run black ace/ tests/ examples/` — format code
6363
- `uv run mypy ace/` — type check
@@ -93,3 +93,7 @@ Key design docs:
9393
| `ACELangChain` | LangChain | Wrap chains/agents with learning |
9494
| `ACEBrowserUse` | browser-use | Browser automation with learning |
9595
| `ACEClaudeCode` | Claude Code CLI | Coding tasks with learning |
96+
97+
NEVER USE FALLBACKS OR IMPLEMENT THINGS I NEVER ASKED FOR.
98+
99+
Keep your answers concise and to the point. If you don't know something, say you don't know instead of making assumptions or fabricating information. Always ask clarifying questions if the user's request is ambiguous or lacks necessary details.

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -168,15 +168,15 @@ ACE + Claude Code translated this library from Python to TypeScript with zero su
168168
ACE is built on a composable pipeline engine. Each step declares what it requires and what it produces:
169169

170170
```
171-
AgentStep -> EvaluateStep -> ReflectStep -> UpdateStep -> ApplyStep -> DeduplicateStep
171+
AgentStep -> EvaluateStep -> ReflectStep -> UpdateStep -> DeduplicateStep
172172
```
173173

174174
Use `learning_tail()` for the standard learning sequence, or compose custom pipelines:
175175

176176
```python
177177
from ace import Pipeline, AgentStep, EvaluateStep, learning_tail
178178

179-
steps = [AgentStep(agent), EvaluateStep(env)] + learning_tail(reflector, skill_manager, skillbook)
179+
steps = [AgentStep(agent, skillbook), EvaluateStep(env)] + learning_tail(reflector, skill_manager, skillbook)
180180
pipeline = Pipeline(steps)
181181
```
182182

ace-eval

0 commit comments

Comments
 (0)