Skip to content

bhyi4/measure-mirror

Repository files navigation

🪞 Measurement Mirror

Measurement Mirror

CI License: Apache 2.0 deps: zero Python 3.10+

Catch AI evaluation illusions — false positives and false negatives — automatically.
Zero training · Deterministic · Zero-dep core (Python 3.10+ stdlib; judge module optional).

Built while honestly killing our own project.
The makers ran it on themselves first. → 🦋 Origin Story

📖 Full Probe Guide → — detailed explanations, worked examples, and workflows for all 23 probes

🪞🔎🪪 New — Mirror Stack (stack/): measure-mirror is the claims layer of a three-mirror integrity stack for autonomous research agents (claims · actions · provenance, joined by five conventions + one verify-all command). Includes a real case study: an agent that retracted its own experiment before spending a single token — preregistration → power-check design fix → adversarial amendments → prior-art retraction, with the actual chain-sealed ledger bundled so you can verify it yourself. measure-mirror itself is unchanged (feature-frozen core; the stack adds conventions, not probes).

Tool Audits Question
🪞 measure-mirror (you are here) AI evaluation claims Is the claim honest?
🪪 action-mirror Agent behaviour Who did what, provably?
🔎 provenance-mirror Content authenticity Is the origin proven?
👁 mirror-witness Cross-operator witness board Who else witnessed it?

💬 Discussions — questions · ideas · independent reproductions welcome.


The Problem

AI/ML papers routinely overclaim. The most common failure modes:

Illusion How it happens
Small-sample mirage n=9, acc=55.6% reported as breakthrough
Post-hoc metric swap Register accuracy, report the F1 that happened to look better
Crippled baseline Compare against a deliberately weak competitor
Data leakage Train/test overlap inflating every number
Scope overreach "Works on task A" claimed as "general reasoning"

Measurement Mirror catches these structurally, not by opinion.

Precisely what it does (and does not). It runs deterministic checks on the inputs you provide — it is input-driven, not an autonomous flaw-hunter. Arithmetic/statistical probes (small-sample CI, GRIM, power, multiple-comparisons) are fully deterministic. But design-flaw probes (crippled baseline, gaming, scope) only fire on the baseline / reward terms / scopes you declare — the tool cannot discover a flaw you hide, and many real catches are still your judgment, guided by the discipline. It makes honesty provable; it does not force it.


Install

pip install -e .                        # core (zero deps)
pip install -e ".[mcp]"                 # + MCP server for AI agents
pip install -e ".[judge]"               # + LLM-as-a-Judge runner (openai / anthropic)
pip install -e ".[test]"                # + pytest plugin
pip install -e ".[mcp,judge,test]"      # everything

CLI entry point: mm
MCP entry point: mm-mcp


Quick Start

CLI

# Step 1 — BEFORE running your experiment: seal the criteria
mm register my_model --metric acc --min-n 200 --baseline 0.5 --pass 0.60

# Step 2 — AFTER evaluation: one-command audit
mm my_model                            # auto-loads my_model.json
mm audit my_model --acc 0.72 --n 500
mm audit --file results.json

results.json format: {"claim_id": "my_model", "metric": "acc", "acc": 0.72, "n": 500}

Python API

from measure_mirror import mm

LEDGER = "mm_ledger.jsonl"

# ① Before experiment — seal criteria (tamper-evident hash)
mm.preregister(LEDGER, "my_model",
               metric="acc", min_n=200, baseline=0.5, pass_threshold=0.60)

# ② After evaluation — full 7-probe audit at once
findings = mm.full_audit(
    LEDGER, "my_model",
    reported_metric="acc", reported_acc=0.72, n=500,
    baseline=0.5,
    competing_name="strong_baseline", competing_acc=0.68,   # ② fairness
    reward_terms=["cross_entropy"],                          # ③ gaming check
    train_items=train_set, test_items=test_set,              # ④ leakage
    seed_results=[0.70, 0.72, 0.74],                         # ⑤ multi-seed
    claimed_scope=["reasoning"], tested_scope=["task_a"],    # ⑥ scope
)
mm.report("my_model", findings)

# Individual probes
mm.report("fairness", [mm.baseline_fairness("vs GRU", 0.72, 0.68)])
mm.report("leakage",  [mm.leakage_check(train_items, test_items)])
mm.report("seeds",    [mm.multiseed_check([0.70, 0.72, 0.74], baseline=0.5)])

Regression / Continuous Metrics

For MSE, Pearson r, RMSE, and other non-binary metrics:

findings = mm.continuous_audit(
    LEDGER, "my_regressor",
    reported_metric="mse", reported_value=0.10,
    baseline_value=0.15, n=500,
    higher_better=False,   # lower MSE is better
    std=0.02,              # optional: enables effect-size check
)
mm.report("regression", findings)

pytest Integration (CI Gate)

# conftest.py
pytest_plugins = ["measure_mirror.pytest_plugin"]

# test_eval.py
from measure_mirror import mm
from measure_mirror.pytest_plugin import assert_clean

def test_my_model_is_real():
    findings = mm.audit("ledger.jsonl", "my_model",
                        reported_metric="acc", reported_acc=0.78, n=1000)
    assert_clean(findings)   # FAIL findings → test fails → CI goes red

Three Verification Tiers

You don't need to memorize 23 probes — there are exactly three ways to use the mirror:

# FULL — one shot, everything applicable runs automatically
mm verify --file results.json

# GROUP — restrict to verification groups
mm verify --file results.json --groups stats judge
mm verify --list-groups

# INDIVIDUAL — any probe directly (Python)
mm.grim_check(reported_acc=0.33, n=10)
from measure_mirror import mm

# FULL: probes activate based on which keys exist in data
findings = mm.verify("ledger.jsonl", {
    "claim_id": "my_model", "acc": 0.72, "n": 500,        # → ledger + stats
    "seed_results": [0.70, 0.72, 0.74],                    # → ⑤
    "scores": [3, 7, 5, 8, 4],                             # → judge ⑰
})

# GROUP: same data, judge group only
findings = mm.verify("ledger.jsonl", data, groups=["judge"])

verify() runs only the probes whose inputs are present — an empty key fires nothing, extra keys fire extra probes. The data keys are identical to the JSON file format.

Probes by Verification Group

ledger — pre-registration & ledger integrity

Probe # Catches
preregister / audit Post-hoc metric swap · sample underrun · ledger tampering
verify_chain Deleted/inserted entries · ledger tampering
cascade_check Claim or transitive dependency retracted → FAIL/WARN stale

stats — statistical validity

Probe # Catches
audit — Wilson CI ④a Results indistinguishable from chance (small sample)
audit — direction ④a Performance worse than baseline (anti-signal)
multiseed_check Unstable signal / lucky seed
too_good_check Suspiciously large Δ over baseline
power_check n too small to detect minimum effect (false-negative guard)
multiple_comparisons_check k>1 experiments in ledger — Bonferroni correction alarm
grim_check Reported acc × n is arithmetically impossible (fabricated value)

design — experiment-design fairness

Probe # Catches
baseline_fairness Crippled / tied / reversed baseline
gaming_check Metric directly in reward/loss (self-fulfilling)
leakage_check ④a Train∩test data contamination
scope_check Claimed scope wider than tested scope
falsifiability_check No kill-condition → unfalsifiable; kill_threshold triggered → claim is dead

negative — Resolved-Negative closure gate

Probe # Catches
negative_audit Negative conclusion has too few independent angles, unregistered angles, or scope overshoot

judge — LLM-judge reliability

Probe # Catches
judge_consistency_check LLM judge flip-rate too high — unreliable judge detector
judge_bias_check Judge systematically favors position A or B — position bias detector
inter_rater_agreement Cohen's κ between two different judges below threshold (standalone-only)
judge_score_sanity Judge assigns identical/near-identical scores — degenerate distribution
judge_swap_check Verdict stays with the slot after AB→BA swap — judge reads position, not content

ranking — leaderboard integrity

Probe # Catches
judge_transitivity_check A>B>C>A preference cycles — judge has no consistent quality scale
ranking_stability_check "A beats B" flips under bootstrap resampling — ranking mirage
Utility Purpose
anchor Print tamper-evident ledger snapshot (hash + head seal) to stdout for external archival
calibrate Self-test: 5 synthetic known-good/bad cases; confirms tool health
witness Execute a command, capture output, seal tamper-evident run record
retract Append a chain-linked retraction entry; cascades to dependents via cascade_check
certificate Issue a sealed verification certificate: CERTIFIED / WITH-WARNINGS / UNVERIFIED / REJECTED
badge Render a certificate as an embeddable markdown / SVG badge (verdict-colored)

Chain hash ledger (① extended)

Every preregister() call now embeds the previous entry's seal into the new one before computing the SHA-256. This makes the ledger tamper-evident end-to-end:

# verify the entire ledger chain at any time
findings = mm.verify_chain("mm_ledger.jsonl")
mm.report("ledger integrity", findings)

Catches: entry deletion, entry insertion, content modification.
Documented limitation: complete file deletion + fresh re-registration is not caught — commit the ledger file to git for that guarantee.

Power check ⑧

# warn when n is too small to detect a real effect
f = mm.power_check(n=50, baseline=0.5, min_detectable_effect=0.05)
# ⚠️  n=50 insufficient to detect Δ=+0.05 at 80% power (need n≥388)

# or activate via full_audit
findings = mm.full_audit(LEDGER, "my_model", ..., min_detectable_effect=0.05)

Multiple comparisons ⑨

# warn when k>1 experiments share a ledger (Bonferroni)
f = mm.multiple_comparisons_check("mm_ledger.jsonl")
# ⚠️  k=3 experiments → Bonferroni α=0.0167 (not 0.05)

# or activate via full_audit
findings = mm.full_audit(LEDGER, "my_model", ..., check_multiplicity=True)

Falsifiability ⑪ — the Popper gate

# Register BEFORE the experiment — seal what would kill the claim
mm.preregister("mm_ledger.jsonl", "my_model",
               metric="acc", min_n=200, baseline=0.5, pass_threshold=0.60,
               # human-readable (optional)
               kill_condition="accuracy on held-out test drops below 0.55",
               # structured: auto-evaluated at audit time
               kill_threshold={"metric": "acc", "threshold": 0.55, "direction": "below"})

# After the experiment — audit checks ⑪ automatically
findings = mm.audit("mm_ledger.jsonl", "my_model",
                    reported_metric="acc", reported_acc=0.50, n=500)
# 🔴 [⑪ falsifiability] Kill condition triggered: acc=0.5 < 0.55.
#    Claim 'my_model' is falsified by its own pre-registered criterion.

# Or check standalone
f = mm.falsifiability_check("mm_ledger.jsonl", "my_model", reported_acc=0.50)
# CLI
mm register my_model --metric acc --min-n 200 --baseline 0.5 --pass 0.60 \
  --kill "accuracy < 0.55 on held-out" \
  --kill-threshold 0.55 --kill-direction below

Claims registered without any kill_condition or kill_threshold receive a WARN: Unfalsifiable at audit time — operationalizing Popper's criterion as a code contract. OSF pre-registration accepts hypotheses; this is the first tool that also seals the kill condition.

Retraction cascade ⑫

# Register a claim that depends on prior work
mm.preregister("mm_ledger.jsonl", "model_v2",
               metric="acc", min_n=200, baseline=0.5, pass_threshold=0.60,
               depends_on=["dataset_v1", "baseline_eval"])

# Later: dataset_v1 turns out to have contamination — retract it
mm.retract("mm_ledger.jsonl", "dataset_v1", "train/test overlap discovered")

# cascade_check flags model_v2 as STALE automatically
f = mm.cascade_check("mm_ledger.jsonl", "model_v2")
# ⚠️  [⑫ retraction-cascade] Claim 'model_v2' is STALE: depends (transitively) on
#     retracted claim(s): 'dataset_v1'

# audit() runs cascade_check automatically — only WARN/FAIL are appended
findings = mm.audit("mm_ledger.jsonl", "model_v2",
                    reported_metric="acc", reported_acc=0.72, n=500)
# CLI
mm register model_v2 --metric acc --min-n 200 --baseline 0.5 --pass 0.60 \
  --depends-on dataset_v1 baseline_eval

mm retract dataset_v1 --reason "train/test overlap discovered"

The retraction entry is chain-linked — deleting it from the ledger breaks the chain and is detected by verify_chain(). Retraction propagates regardless of publication order: a claim built on a retracted foundation is automatically stale.

Negative-claim audit ⑬ — the premature-closure gate

A single negative result may reflect a frame flaw, not a universal wall. negative_audit gates "Resolved-Negative" conclusions: you must present ≥ min_angles independent pre-registered experiments before closing.

# Register each independent angle BEFORE running the experiment
for angle_id in ["oee_test_wave", "oee_test_bilinear", "oee_test_alife"]:
    mm.preregister("mm_ledger.jsonl", angle_id,
                   metric="oee_score", min_n=100, baseline=0.5, pass_threshold=0.0)

# After all angles converge negative — gate the closure
f = mm.negative_audit("mm_ledger.jsonl",
                      angles=["oee_test_wave", "oee_test_bilinear", "oee_test_alife"],
                      min_angles=3)
# ✅ [⑬ negative-audit] 3/3 independent pre-registered angle(s) verified —
#    negative conclusion is supported.

# Optional scope check: negative conclusion must not overgeneralize
f = mm.negative_audit("mm_ledger.jsonl",
                      angles=["oee_test_wave", "oee_test_bilinear", "oee_test_alife"],
                      conclusion_scope=["all_substrates"],   # claimed scope
                      tested_scope=["in_silico"])            # actually tested
# 🔴 [⑬ negative-audit] conclusion scope includes untested domain(s): ['all_substrates'].
# CLI
mm negative --angles oee_test_wave oee_test_bilinear oee_test_alife --min-angles 3

# Or activate inside full_audit
findings = mm.full_audit(LEDGER, "main_claim", ..., angles=["a1", "a2", "a3"])
Condition Level
len(angles) < min_angles FAIL — premature closure
Any angle not pre-registered FAIL — unverifiable evidence
conclusion_scope ⊄ tested_scope FAIL — scope overshoot
A registered angle is retracted WARN — weakened case
All checks pass OK

LLM-as-a-Judge probes ⑭⑮⑯⑰

LLM judges introduce their own failure modes that numeric metrics don't catch. The four judge probes audit the judge itself, not just the model being evaluated.

pip install "measure-mirror[judge]"   # adds openai and anthropic
from measure_mirror.judge import anthropic_judge, openai_judge, judge_run

# Build a judge callable (pairwise A-vs-B mode)
judge_fn = anthropic_judge(model="claude-opus-4-8")
# or: judge_fn = openai_judge(model="gpt-4o")

# Each item: {"prompt": str, "a": str, "b": str}  (pairwise)
#            {"prompt": str, "response": str}       (rating, pairwise=False)
pairs = [
    {"prompt": "Summarize quantum entanglement",
     "a": "candidate_A output", "b": "candidate_B output"},
    ...
]

# judge_run calls judge_fn runs×len(items) times,
# fires ⑭⑮⑯⑰(⑱) automatically, and seals a chain-linked entry.
result = judge_run("mm_ledger.jsonl", "my_llm_eval",
                   judge_fn=judge_fn,
                   items=pairs,
                   runs=2,               # run each item twice → ⑭ consistency
                   pairwise=True,        # A-vs-B → ⑮ bias check
                   swap_positions=True)  # extra AB→BA pass → ⑱ swap test

for f in result["findings"]:
    print(f"  {f.level}  [{f.probe}]  {f.msg}")

The checks triggered by judge_run:

Probe Catches
judge_consistency_check Judge gives different verdict on re-run (stochastic / unreliable)
judge_bias_check Judge systematically favors position A or B regardless of content
judge_score_sanity Judge assigns identical / near-identical scores to everything
judge_swap_check Verdict stays with the slot after AB→BA swap (content-blind judge)

inter_rater_agreement is standalone-only: it compares two genuinely different judges; re-running the same judge is already covered by ⑭.

Unparseable judge responses score -1 and are excluded from all probes; a judge-parse WARN fires when the failure rate exceeds 10% (FAIL when nothing parsed).

Why ⑱ matters — a deterministic judge that never reads the responses passes ⑭ (perfectly consistent), ⑮ (balanced win-rate), ⑯ (κ=1.0), and ⑰ (varied scores). Only swapping A and B exposes it: a content-driven judge must invert its verdict, a content-blind judge keeps choosing the same slot. Run the demo:

python examples/demo_judge.py   # no API key needed — mock judges

Standalone usage (bring-your-own scores):

from measure_mirror import mm

# ⑭ consistency — did the judge flip its verdict?
score_pairs = [(1, 1), (0, 0), (1, 0), (0, 0), (1, 1)]  # (run1, run2) per item
f = mm.judge_consistency_check(score_pairs, flip_threshold=0.20)
# ✅  flip rate 20.0% ≤ 20.0% (1/5 flips). Consistent.

# ⑮ position bias — does A always win?
results = [0, 0, 0, 0, 0, 0, 0, 1, 0, 0]  # 0=A won, 1=B won
f = mm.judge_bias_check(results, bias_threshold=0.60)
# 🔴  Position A win rate 90.0% > 60.0%. Strong position bias detected.

# ⑯ inter-rater — Cohen's κ between two raters / runs
matrix = [(1, 1), (0, 0), (1, 1), (0, 1), (1, 0), (0, 0)]
f = mm.inter_rater_agreement(matrix, min_kappa=0.40)
# ⚠️  Cohen's κ=0.333 < 0.40 — fair agreement only.

# ⑰ score sanity — degenerate distribution?
scores = [8, 8, 8, 8, 8, 8, 8, 7, 8, 8]  # 90% are 8
f = mm.judge_score_sanity(scores)
# ⚠️  90% of scores are '8' — near-degenerate distribution.

# ⑱ position-swap — does the verdict follow content or slot?
forward = [0, 1, 0, 1, 0]   # winners in (A, B) order
swapped = [0, 1, 0, 1, 0]   # winners after AB→BA swap — identical = locked!
f = mm.judge_swap_check(forward, swapped)
# 🔴  Position-lock rate 100.0% > 65.0%. Judge is reading position, not content.

# ⑲ transitivity — does the judge have a consistent quality scale?
matches = [("gpt", "claude", 0), ("claude", "llama", 0), ("llama", "gpt", 0)]
f = mm.judge_transitivity_check(matches)
# 🔴  Preference cycle detected: gpt > claude > llama > gpt.
#     Any leaderboard from these verdicts is order-dependent.

# ⑳ ranking stability — does "A beats B" survive resampling?
f = mm.ranking_stability_check(scores_model_a, scores_model_b)
# 🔴  Ranking 'A > B' survives only 64.2% of 1000 bootstrap resamples (n=7).
#     The ranking is noise — indistinguishable from a tie.
# CLI: audit pre-collected judge scores from a JSON file
mm judge --file judge_scores.json
# keys: score_pairs / pairwise_results / ratings_matrix / scores /
#       forward_results + swapped_results / matches / scores_a + scores_b

Certificate 📜 — one sealed verdict per claim

certificate() collapses the full integrity state of a claim into a single verifiable artifact you can embed in a paper, README, or release notes:

# Structural certificate (prereg seal + chain + retraction status)
cert = mm.certificate("mm_ledger.jsonl", "my_model")

# Full certificate — fold audit findings in
findings = mm.audit("mm_ledger.jsonl", "my_model",
                    reported_metric="acc", reported_acc=0.72, n=500)
cert = mm.certificate("mm_ledger.jsonl", "my_model", findings=findings)
# {"verdict": "CERTIFIED", "prereg_seal": "6c802655ab095e8b",
#  "anchor_hash": "sha256...", "findings": {"ok": 4, "warn": 0, "fail": 0},
#  "seal": "9d1e83a4b72f0c5e", ...}
# CLI
mm certify my_model --pretty                  # structural only
mm certify my_model --acc 0.72 --n 500        # + audit findings folded in
mm certify my_model | gh gist create -        # publish externally
Verdict Meaning
CERTIFIED Pre-registered, chain intact, not retracted, no FAIL/WARN findings
CERTIFIED-WITH-WARNINGS Valid but has stale dependencies or WARN findings
UNVERIFIED No pre-registration exists — nothing to certify against
REJECTED Chain broken, seal tampered, retracted, or FAIL findings

The certificate embeds the ledger's anchor_hash, so it attests to one specific ledger state — and the certificate itself is sealed, so any field edit is detectable.

Badge 🏷 — embed the verdict in your README:

mm certify my_model --badge markdown >> README.md   # shields.io badge
mm certify my_model --badge svg > badge.svg          # offline self-contained SVG
cert = mm.certificate("mm_ledger.jsonl", "my_model")
print(mm.badge(cert))                 # markdown (default)
print(mm.badge(cert, fmt="svg"))      # SVG with cert seal in the tooltip
# ![🪞 my_model: CERTIFIED](https://img.shields.io/badge/🪞_my__model-CERTIFIED-brightgreen)

Badge color follows the verdict: CERTIFIED = green · WITH-WARNINGS = yellow · UNVERIFIED = grey · REJECTED = red. The SVG variant embeds the certificate seal and anchor-hash prefix in its tooltip, making the badge traceable back to the exact sealed certificate it renders.

Anchor ⎈

# Print tamper-evident ledger snapshot to stdout — pipe wherever you trust
mm anchor                              # compact JSON (default)
mm anchor --pretty                     # human-readable

# External archival examples (zero new dependencies)
mm anchor >> ~/Dropbox/mm_anchors.jsonl          # local backup
mm anchor | gh gist create -                     # GitHub Gist
mm anchor | aws s3 cp - s3://bucket/anchor.json  # S3
a = mm.anchor("mm_ledger.jsonl")
# {"_type": "anchor", "ts": "...", "entry_count": 3,
#  "head_seal": "a3b9f2c1", "anchor_hash": "sha256hex...", "chain_ok": true}

The anchor_hash (full SHA-256 of the ledger file) detects even complete file replacement — the one attack that chain hashes alone cannot catch. Save it externally before publishing results.

Calibrate + Witness run

# Verify the mirror itself is working correctly
mm calibrate
# ✅ [⚙ calibrate] 5/5 synthetic cases correct — mirror is calibrated.

# Witness-execute a command: calibrate first, then run and seal the record
mm run my_model -- python evaluate.py --model my_model
# ✅ [⚙ calibrate] 5/5 synthetic cases correct — mirror is calibrated.
#
# 🎬 Witnessed: my_model
#    Command:     python evaluate.py --model my_model
#    Started:     2026-06-11T12:00:00Z
#    Ended:       2026-06-11T12:00:03Z
#    Exit code:   0  (ok)
#    Output hash: a3b9f2c1e8d74f6a
#    Prev seal:   6c802655ab095e8b
#    Seal:        9d1e83a4b72f0c5e
# Python API
findings = mm.calibrate()
mm.report("Mirror calibration", findings)

entry = mm.witness("mm_ledger.jsonl", "my_model",
                   ["python", "evaluate.py", "--model", "my_model"])
# entry["output_hash"] changes if the script output ever changes

GRIM ⑩

# catch arithmetically impossible accuracy values
f = mm.grim_check(reported_acc=0.33, n=10)
# 🔴  acc=0.33 is arithmetically impossible for n=10.
#     No integer k satisfies round(k/10, 2) = 0.33.
#     (candidates: k=3 → 0.3, k=4 → 0.4). Fabricated value or mis-reported n.

# GRIM runs automatically inside audit() — FAIL is appended to findings
findings = mm.audit(LEDGER, "my_model", reported_metric="acc", reported_acc=0.33, n=10)

MCP Server — AI Agent Integration

Any MCP-compatible AI (Claude Code, Cursor, Windsurf, …) can call Measurement Mirror directly mid-conversation.

Setup

pip install "measure-mirror[mcp]"

Claude Code — add to .mcp.json in your project root:

{
  "mcpServers": {
    "measure-mirror": {
      "command": "python",
      "args": ["-m", "measure_mirror.mcp_server"],
      "cwd": "/path/to/measure-mirror"
    }
  }
}

Other MCP clients — run mm-mcp as the stdio server command.

All 23 probes + 6 utilities + the mm_verify umbrella are exposed as MCP tools:
mm_verify (full / group-filtered) ·
mm_register · mm_verify_chain · mm_audit · mm_continuous_audit · mm_full_audit ·
mm_baseline_fairness · mm_gaming_check · mm_multiseed_check · mm_scope_check ·
mm_too_good_check · mm_power_check · mm_multiple_comparisons_check · mm_grim_check ·
mm_falsifiability_check · mm_cascade_check · mm_negative_audit ·
mm_judge_consistency_check · mm_judge_bias_check · mm_inter_rater_agreement ·
mm_judge_score_sanity · mm_judge_swap_check · mm_judge_transitivity_check ·
mm_ranking_stability_check ·
mm_anchor · mm_calibrate · mm_witness · mm_retract · mm_certificate · mm_badge


Real Catches — Dog-Fooding

We ran Measurement Mirror on our own AI research before publishing. It caught:

🪞 Audit: ZERO "55.6% Best" claim
   🔴 FAIL
   🔴 [④a small-sample CI] n=9, acc=0.556 → 95%CI [0.267, 0.811] ⊃ baseline(0.5)
       Statistically indistinguishable from chance.
   🔴 [① pre-registration(min_n)] reported n=9 < registered min_n=200. Undersized.
   🔴 [① pre-registration(metric swap)] reported 'best_of_9' ≠ registered 'acc_full_balanced'
       Post-hoc metric swap detected. (seal=6c802655ab095e8b)

🪞 Audit: Field candidate 5 (control sim)
   🔴 FAIL
   🔴 [② fair baseline] Field 0.996 ≈ GRU-ODE 0.998 (Δ+0.002 < 0.01). Tied — no genuine advantage.

Run the demos yourself:

python examples/quickstart.py    # happy path: honest researcher
python examples/demo_zero.py     # ZERO 55.6% mirage (our own project killed)
python examples/demo_field.py    # Field candidate false positives

Project Structure

measure-mirror/
├── measure_mirror/
│   ├── mm.py              # verify() + 20 probes + CLI + DB lookup (zero deps)
│   ├── mcp_server.py      # MCP server — 30 tools (pip install .[mcp])
│   ├── judge.py           # LLM-as-a-Judge runner (pip install .[judge])
│   └── pytest_plugin.py   # assert_clean() for CI gates
├── docs/
│   ├── GUIDE.md           # full per-probe reference (English)
│   └── GUIDE_KO.md        # full per-probe reference (Korean)
├── examples/
│   ├── quickstart.py      # happy path demo
│   ├── demo_zero.py       # ZERO false-positive (dog-food)
│   ├── demo_field.py      # Field false-positive (dog-food)
│   ├── demo_judge.py      # LLM-judge failure modes (no API key needed)
│   └── mcp_example.py     # MCP tool usage reference
├── db/                    # local memory, split by who produced the record
│   ├── README.md              the measured/ vs curated/ distinction
│   ├── measured/             ← measure-mirror's own output (quantitative)
│   │   ├── baselines.json         task-level fair baselines
│   │   └── reproductions.jsonl    reproductions (verdict auto-judged)
│   └── curated/              ← human-curated (qualitative)
│       ├── self_catches.jsonl          false positives on ourselves
│       ├── false_negative_guards.jsonl false negatives re-checked
│       ├── gaming_patterns.json        gaming signatures
│       ├── contamination.jsonl         data leakage found
│       └── research_closures.jsonl     qualitative negative conclusions
└── tests/
    ├── test_mm.py         # 145 tests for core probes, CI-enforced
    ├── test_judge.py      # 17 tests for judge.py module
    └── test_sync.py       # sync gate: probe ↔ MCP ↔ tests ↔ README ↔ exports ↔ version

Local Memory (db/)

db/ is your local memory of past audits — not a shared/crowd database. We tried the "CVE / shared signature" framing and for us it didn't earn its keep (this is a scoped observation, not a universal law): contributing would mean publishing your own research that was wrong (self_catches) or a peer's failed reproduction (reproductions), which runs into the trust ⊥ reputation dilemma. A team with the right incentives might sustain a shared DB; we didn't, and the value below needs no sharing anyway.

The value that does hold needs no sharing at all: warn future-you about a pattern past-you already got burned by. It works regardless of how private the data is — it never leaves your machine.

db/ is split by who produced the record, so the two kinds are never confused (see db/README.md for the full structure):

db/measured/ — what measure-mirror produces (quantitative)

Verdicts computed by the tool itself; re-running on the same numbers reproduces the same verdict exactly. These are wired into the audit loop and grow only via record_reproduction().

File How it's used
measured/baselines.json audit(task="musr") auto-fetches the fair baseline
measured/reproductions.jsonl audit(task=...) warns on a prior reproduction failure; record_reproduction(...) appends new ones (verdict auto-judged from Wilson CI)
# memory grows: you reproduce a claim and record the result
mm.record_reproduction("musr", claim="ZERO 55.6%", acc_claimed=0.556,
                       n_claimed=9, acc=0.385, n=1050, note="collapsed at scale")
# → verdict auto-judged FAIL, appended to db/measured/reproductions.jsonl

# later: any audit on the same task surfaces it automatically
mm.audit("ledger.jsonl", "new_claim", reported_metric="acc",
         reported_acc=0.62, n=120, task="musr")
# ⚠️ [⚙ prior-reproduction] task 'musr' has a prior reproduction failure:
#    'ZERO 55.6%' claimed 0.556 (n=9) → reproduced 0.385 (n=1050). collapsed at scale

db/curated/ — what we wrote by hand (qualitative)

Our catch log and research closures — human-curated, not the tool's automatic output. Searchable via catch_history(), but not auto-wired into audit() (matching is fuzzy text, not a clean task key, so auto-warning would mean false positives).

File kind Catch history of
curated/self_catches.jsonl self_catch false positives you flagged on yourself
curated/false_negative_guards.jsonl false_negative false negatives you re-checked
curated/gaming_patterns.json gaming gaming / mirage signatures you've seen
curated/contamination.jsonl contamination data leakage you found
curated/research_closures.jsonl closure qualitative negative conclusions (no acc/nnot a tool verdict)
mm.catch_history(db_dir="db")                 # all curated records
mm.catch_history(kind="gaming", db_dir="db")  # the gaming-signature catalog
mm.catch_history(source="fm_cde_pixel_feasibility")  # records from one arc

Why the split is honest: calling db/ as a whole "measure-mirror history" would over-claim — only measured/ is that. Every measured/ record was cross-checked: feeding its (acc, n) back through the tool's own Wilson-CI logic reproduces the recorded verdict with 0 mismatches.


Design Principles

  • Zero-dep core — pure Python stdlib. The optional judge module adds openai/anthropic for LLM-as-a-Judge; nothing else, nothing in the core.
  • Bidirectional — catches false positives and false negatives. Premature negative closures are also illusions.
  • Tamper-evident pre-registration — SHA-256 seal on first write. Re-registration is silently ignored. Ledger tampering is detected on every audit.
  • Independent probes — each check is a standalone function. Add new ones without touching existing code.
  • Adversarial by default — "too good to be true" is flagged before you believe it.

Contributing

New probes and baseline contributions are welcome.

  1. Fork → branch → PR
  2. db/baselines.json: shareable task baselines (not your private failures)
  3. New probes: add the function to mm.py + tests in tests/test_mm.py
  4. CI must stay green: pytest tests/

License

Apache 2.0


한국어 README →

About

🪞 Catch false positives/negatives in AI eval claims — pre-registration, fair-baseline, small-sample CI. Zero training, zero deps.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages