PR title target: feat: add evaluation harness foundation
Ship the smallest useful benchmark-first slice:
- config and schema flags
- typed benchmark pack format
- typed run-summary format
- status CLI for operators
- README and docs alignment
This PR does not change live recall or extraction behavior.
This matches the roadmap priority order:
- Evaluation harness and shadow-mode measurement.
- Objective-state + causal trajectory memory.
- Trust-zoned memory promotion and poisoning defense.
- Harmonic retrieval over abstractions plus anchors.
- Creation-memory, commitments, and recoverability.
Without PR1, every later memory change still lands on intuition.
src/evals.tssrc/cli.tssrc/types.tssrc/config.tsopenclaw.plugin.json
tests/config-eval-harness.test.tstests/cli-benchmark-status.test.ts
README.mddocs/config-reference.mddocs/evaluation-harness.mddocs/plans/2026-03-06-engram-agentic-memory-roadmap.mddocs/plans/2026-03-06-engram-pr1-eval-harness-foundation.md
evalHarnessEnabledevalShadowModeEnabledevalStoreDir
All default off or inert.
Required:
schemaVersionbenchmarkIdtitlecases[].idcases[].prompt
Required:
schemaVersionrunIdbenchmarkIdstatusstartedAttotalCasespassedCasesfailedCases
openclaw engram benchmark-statusThe command must:
- work even when
evalHarnessEnabledis false - report benchmark pack counts
- report invalid manifests
- summarize latest run
- fail open on missing directories
- Config defaults:
- flags off
- store dir derived from
memoryDir
- Config overrides:
- explicit flags respected
- explicit store dir respected
- CLI empty state:
- zero counts
- no crash on missing dirs
- CLI populated state:
- valid benchmark counted
- invalid manifest surfaced
- latest run summarized
Run before pushing:
npx tsx --test tests/config-eval-harness.test.ts tests/cli-benchmark-status.test.tsnpm run check-typesnpm testnpm run build
- PR2 benchmark pack validator/import tools
- PR3 shadow recording for recall behavior
- PR4 CI benchmark delta gating
- PR5 objective-state memory store