PR1 Eval Harness Foundation Plan

PR title target: feat: add evaluation harness foundation

Goal

Ship the smallest useful benchmark-first slice:

config and schema flags
typed benchmark pack format
typed run-summary format
status CLI for operators
README and docs alignment

This PR does not change live recall or extraction behavior.

Why PR1 Starts Here

This matches the roadmap priority order:

Evaluation harness and shadow-mode measurement.
Objective-state + causal trajectory memory.
Trust-zoned memory promotion and poisoning defense.
Harmonic retrieval over abstractions plus anchors.
Creation-memory, commitments, and recoverability.

Without PR1, every later memory change still lands on intuition.

Scope

Code

src/evals.ts
src/cli.ts
src/types.ts
src/config.ts
openclaw.plugin.json

Tests

tests/config-eval-harness.test.ts
tests/cli-benchmark-status.test.ts

Docs

README.md
docs/config-reference.md
docs/evaluation-harness.md
docs/plans/2026-03-06-engram-agentic-memory-roadmap.md
docs/plans/2026-03-06-engram-pr1-eval-harness-foundation.md

Feature Flags

evalHarnessEnabled
evalShadowModeEnabled
evalStoreDir

All default off or inert.

Contract

Benchmark manifest

Required:

schemaVersion
benchmarkId
title
cases[].id
cases[].prompt

Run summary

Required:

schemaVersion
runId
benchmarkId
status
startedAt
totalCases
passedCases
failedCases

CLI Surface

openclaw engram benchmark-status

The command must:

work even when evalHarnessEnabled is false
report benchmark pack counts
report invalid manifests
summarize latest run
fail open on missing directories

Tests Required

Config defaults:
- flags off
- store dir derived from memoryDir
Config overrides:
- explicit flags respected
- explicit store dir respected
CLI empty state:
- zero counts
- no crash on missing dirs
CLI populated state:
- valid benchmark counted
- invalid manifest surfaced
- latest run summarized

Verification Gate

Run before pushing:

npx tsx --test tests/config-eval-harness.test.ts tests/cli-benchmark-status.test.ts
npm run check-types
npm test
npm run build

Follow-On PRs Unblocked by PR1

PR2 benchmark pack validator/import tools
PR3 shadow recording for recall behavior
PR4 CI benchmark delta gating
PR5 objective-state memory store

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PR1 Eval Harness Foundation Plan

Goal

Why PR1 Starts Here

Scope

Code

Tests

Docs

Feature Flags

Contract

Benchmark manifest

Run summary

CLI Surface

Tests Required

Verification Gate

Follow-On PRs Unblocked by PR1

Uh oh!

FilesExpand file tree

2026-03-06-engram-pr1-eval-harness-foundation.md

Latest commit

History

2026-03-06-engram-pr1-eval-harness-foundation.md

File metadata and controls

PR1 Eval Harness Foundation Plan

Goal

Why PR1 Starts Here

Scope

Code

Tests

Docs

Feature Flags

Contract

Benchmark manifest

Run summary

CLI Surface

Tests Required

Verification Gate

Follow-On PRs Unblocked by PR1