🪞 Measurement Mirror

Catch AI evaluation illusions — false positives and false negatives — automatically.
Zero training · Deterministic · Zero-dep core (Python 3.10+ stdlib; judge module optional).

Built while honestly killing our own project.
The makers ran it on themselves first. → 🦋 Origin Story

📖 Full Probe Guide → — detailed explanations, worked examples, and workflows for all 23 probes

🪞🔎🪪 New — Mirror Stack (stack/): measure-mirror is the claims layer of a three-mirror integrity stack for autonomous research agents (claims · actions · provenance, joined by five conventions + one verify-all command). Includes a real case study: an agent that retracted its own experiment before spending a single token — preregistration → power-check design fix → adversarial amendments → prior-art retraction, with the actual chain-sealed ledger bundled so you can verify it yourself. measure-mirror itself is unchanged (feature-frozen core; the stack adds conventions, not probes).

Tool	Audits	Question
🪞 measure-mirror (you are here)	AI evaluation claims	Is the claim honest?
🪪 action-mirror	Agent behaviour	Who did what, provably?
🔎 provenance-mirror	Content authenticity	Is the origin proven?
👁 mirror-witness	Cross-operator witness board	Who else witnessed it?

💬 Discussions — questions · ideas · independent reproductions welcome.

The Problem

AI/ML papers routinely overclaim. The most common failure modes:

Illusion	How it happens
Small-sample mirage	n=9, acc=55.6% reported as breakthrough
Post-hoc metric swap	Register accuracy, report the F1 that happened to look better
Crippled baseline	Compare against a deliberately weak competitor
Data leakage	Train/test overlap inflating every number
Scope overreach	"Works on task A" claimed as "general reasoning"

Measurement Mirror catches these structurally, not by opinion.

Precisely what it does (and does not). It runs deterministic checks on the inputs you provide — it is input-driven, not an autonomous flaw-hunter. Arithmetic/statistical probes (small-sample CI, GRIM, power, multiple-comparisons) are fully deterministic. But design-flaw probes (crippled baseline, gaming, scope) only fire on the baseline / reward terms / scopes you declare — the tool cannot discover a flaw you hide, and many real catches are still your judgment, guided by the discipline. It makes honesty provable; it does not force it.

Install

pip install -e .                        # core (zero deps)
pip install -e ".[mcp]"                 # + MCP server for AI agents
pip install -e ".[judge]"               # + LLM-as-a-Judge runner (openai / anthropic)
pip install -e ".[test]"                # + pytest plugin
pip install -e ".[mcp,judge,test]"      # everything

CLI entry point: mm
MCP entry point: mm-mcp

Quick Start

CLI

# Step 1 — BEFORE running your experiment: seal the criteria
mm register my_model --metric acc --min-n 200 --baseline 0.5 --pass 0.60

# Step 2 — AFTER evaluation: one-command audit
mm my_model                            # auto-loads my_model.json
mm audit my_model --acc 0.72 --n 500
mm audit --file results.json

results.json format: {"claim_id": "my_model", "metric": "acc", "acc": 0.72, "n": 500}

Python API

from measure_mirror import mm

LEDGER = "mm_ledger.jsonl"

# ① Before experiment — seal criteria (tamper-evident hash)
mm.preregister(LEDGER, "my_model",
               metric="acc", min_n=200, baseline=0.5, pass_threshold=0.60)

# ② After evaluation — full 7-probe audit at once
findings = mm.full_audit(
    LEDGER, "my_model",
    reported_metric="acc", reported_acc=0.72, n=500,
    baseline=0.5,
    competing_name="strong_baseline", competing_acc=0.68,   # ② fairness
    reward_terms=["cross_entropy"],                          # ③ gaming check
    train_items=train_set, test_items=test_set,              # ④ leakage
    seed_results=[0.70, 0.72, 0.74],                         # ⑤ multi-seed
    claimed_scope=["reasoning"], tested_scope=["task_a"],    # ⑥ scope
)
mm.report("my_model", findings)

# Individual probes
mm.report("fairness", [mm.baseline_fairness("vs GRU", 0.72, 0.68)])
mm.report("leakage",  [mm.leakage_check(train_items, test_items)])
mm.report("seeds",    [mm.multiseed_check([0.70, 0.72, 0.74], baseline=0.5)])

Regression / Continuous Metrics

For MSE, Pearson r, RMSE, and other non-binary metrics:

findings = mm.continuous_audit(
    LEDGER, "my_regressor",
    reported_metric="mse", reported_value=0.10,
    baseline_value=0.15, n=500,
    higher_better=False,   # lower MSE is better
    std=0.02,              # optional: enables effect-size check
)
mm.report("regression", findings)

pytest Integration (CI Gate)

# conftest.py
pytest_plugins = ["measure_mirror.pytest_plugin"]

# test_eval.py
from measure_mirror import mm
from measure_mirror.pytest_plugin import assert_clean

def test_my_model_is_real():
    findings = mm.audit("ledger.jsonl", "my_model",
                        reported_metric="acc", reported_acc=0.78, n=1000)
    assert_clean(findings)   # FAIL findings → test fails → CI goes red

Three Verification Tiers

You don't need to memorize 23 probes — there are exactly three ways to use the mirror:

# FULL — one shot, everything applicable runs automatically
mm verify --file results.json

# GROUP — restrict to verification groups
mm verify --file results.json --groups stats judge
mm verify --list-groups

# INDIVIDUAL — any probe directly (Python)
mm.grim_check(reported_acc=0.33, n=10)

from measure_mirror import mm

# FULL: probes activate based on which keys exist in data
findings = mm.verify("ledger.jsonl", {
    "claim_id": "my_model", "acc": 0.72, "n": 500,        # → ledger + stats
    "seed_results": [0.70, 0.72, 0.74],                    # → ⑤
    "scores": [3, 7, 5, 8, 4],                             # → judge ⑰
})

# GROUP: same data, judge group only
findings = mm.verify("ledger.jsonl", data, groups=["judge"])

verify() runs only the probes whose inputs are present — an empty key fires nothing, extra keys fire extra probes. The data keys are identical to the JSON file format.

Probes by Verification Group

`ledger` — pre-registration & ledger integrity

Probe	#	Catches
`preregister` / `audit`	①	Post-hoc metric swap · sample underrun · ledger tampering
`verify_chain`	①	Deleted/inserted entries · ledger tampering
`cascade_check`	⑫	Claim or transitive dependency retracted → FAIL/WARN stale

`stats` — statistical validity

Probe	#	Catches
`audit` — Wilson CI	④a	Results indistinguishable from chance (small sample)
`audit` — direction	④a	Performance worse than baseline (anti-signal)
`multiseed_check`	⑤	Unstable signal / lucky seed
`too_good_check`	⑦	Suspiciously large Δ over baseline
`power_check`	⑧	n too small to detect minimum effect (false-negative guard)
`multiple_comparisons_check`	⑨	k>1 experiments in ledger — Bonferroni correction alarm
`grim_check`	⑩	Reported acc × n is arithmetically impossible (fabricated value)

`design` — experiment-design fairness

Probe	#	Catches
`baseline_fairness`	②	Crippled / tied / reversed baseline
`gaming_check`	③	Metric directly in reward/loss (self-fulfilling)
`leakage_check`	④a	Train∩test data contamination
`scope_check`	⑥	Claimed scope wider than tested scope
`falsifiability_check`	⑪	No kill-condition → unfalsifiable; kill_threshold triggered → claim is dead

`negative` — Resolved-Negative closure gate

Probe	#	Catches
`negative_audit`	⑬	Negative conclusion has too few independent angles, unregistered angles, or scope overshoot

`judge` — LLM-judge reliability

Probe	#	Catches
`judge_consistency_check`	⑭	LLM judge flip-rate too high — unreliable judge detector
`judge_bias_check`	⑮	Judge systematically favors position A or B — position bias detector
`inter_rater_agreement`	⑯	Cohen's κ between two different judges below threshold (standalone-only)
`judge_score_sanity`	⑰	Judge assigns identical/near-identical scores — degenerate distribution
`judge_swap_check`	⑱	Verdict stays with the slot after AB→BA swap — judge reads position, not content

`ranking` — leaderboard integrity

Probe	#	Catches
`judge_transitivity_check`	⑲	A>B>C>A preference cycles — judge has no consistent quality scale
`ranking_stability_check`	⑳	"A beats B" flips under bootstrap resampling — ranking mirage

Utility	Purpose
`anchor`	Print tamper-evident ledger snapshot (hash + head seal) to stdout for external archival
`calibrate`	Self-test: 5 synthetic known-good/bad cases; confirms tool health
`witness`	Execute a command, capture output, seal tamper-evident run record
`retract`	Append a chain-linked retraction entry; cascades to dependents via cascade_check
`certificate`	Issue a sealed verification certificate: CERTIFIED / WITH-WARNINGS / UNVERIFIED / REJECTED
`badge`	Render a certificate as an embeddable markdown / SVG badge (verdict-colored)

Chain hash ledger (① extended)

Every preregister() call now embeds the previous entry's seal into the new one before computing the SHA-256. This makes the ledger tamper-evident end-to-end:

# verify the entire ledger chain at any time
findings = mm.verify_chain("mm_ledger.jsonl")
mm.report("ledger integrity", findings)

Catches: entry deletion, entry insertion, content modification.
Documented limitation: complete file deletion + fresh re-registration is not caught — commit the ledger file to git for that guarantee.

Power check ⑧

# warn when n is too small to detect a real effect
f = mm.power_check(n=50, baseline=0.5, min_detectable_effect=0.05)
# ⚠️  n=50 insufficient to detect Δ=+0.05 at 80% power (need n≥388)

# or activate via full_audit
findings = mm.full_audit(LEDGER, "my_model", ..., min_detectable_effect=0.05)

Multiple comparisons ⑨

# warn when k>1 experiments share a ledger (Bonferroni)
f = mm.multiple_comparisons_check("mm_ledger.jsonl")
# ⚠️  k=3 experiments → Bonferroni α=0.0167 (not 0.05)

# or activate via full_audit
findings = mm.full_audit(LEDGER, "my_model", ..., check_multiplicity=True)

Falsifiability ⑪ — the Popper gate

# Register BEFORE the experiment — seal what would kill the claim
mm.preregister("mm_ledger.jsonl", "my_model",
               metric="acc", min_n=200, baseline=0.5, pass_threshold=0.60,
               # human-readable (optional)
               kill_condition="accuracy on held-out test drops below 0.55",
               # structured: auto-evaluated at audit time
               kill_threshold={"metric": "acc", "threshold": 0.55, "direction": "below"})

# After the experiment — audit checks ⑪ automatically
findings = mm.audit("mm_ledger.jsonl", "my_model",
                    reported_metric="acc", reported_acc=0.50, n=500)
# 🔴 [⑪ falsifiability] Kill condition triggered: acc=0.5 < 0.55.
#    Claim 'my_model' is falsified by its own pre-registered criterion.

# Or check standalone
f = mm.falsifiability_check("mm_ledger.jsonl", "my_model", reported_acc=0.50)

# CLI
mm register my_model --metric acc --min-n 200 --baseline 0.5 --pass 0.60 \
  --kill "accuracy < 0.55 on held-out" \
  --kill-threshold 0.55 --kill-direction below

Claims registered without any kill_condition or kill_threshold receive a WARN: Unfalsifiable at audit time — operationalizing Popper's criterion as a code contract. OSF pre-registration accepts hypotheses; this is the first tool that also seals the kill condition.

Retraction cascade ⑫

# Register a claim that depends on prior work
mm.preregister("mm_ledger.jsonl", "model_v2",
               metric="acc", min_n=200, baseline=0.5, pass_threshold=0.60,
               depends_on=["dataset_v1", "baseline_eval"])

# Later: dataset_v1 turns out to have contamination — retract it
mm.retract("mm_ledger.jsonl", "dataset_v1", "train/test overlap discovered")

# cascade_check flags model_v2 as STALE automatically
f = mm.cascade_check("mm_ledger.jsonl", "model_v2")
# ⚠️  [⑫ retraction-cascade] Claim 'model_v2' is STALE: depends (transitively) on
#     retracted claim(s): 'dataset_v1'

# audit() runs cascade_check automatically — only WARN/FAIL are appended
findings = mm.audit("mm_ledger.jsonl", "model_v2",
                    reported_metric="acc", reported_acc=0.72, n=500)

# CLI
mm register model_v2 --metric acc --min-n 200 --baseline 0.5 --pass 0.60 \
  --depends-on dataset_v1 baseline_eval

mm retract dataset_v1 --reason "train/test overlap discovered"

The retraction entry is chain-linked — deleting it from the ledger breaks the chain and is detected by verify_chain(). Retraction propagates regardless of publication order: a claim built on a retracted foundation is automatically stale.

Negative-claim audit ⑬ — the premature-closure gate

A single negative result may reflect a frame flaw, not a universal wall. negative_audit gates "Resolved-Negative" conclusions: you must present ≥ min_angles independent pre-registered experiments before closing.

# Register each independent angle BEFORE running the experiment
for angle_id in ["oee_test_wave", "oee_test_bilinear", "oee_test_alife"]:
    mm.preregister("mm_ledger.jsonl", angle_id,
                   metric="oee_score", min_n=100, baseline=0.5, pass_threshold=0.0)

# After all angles converge negative — gate the closure
f = mm.negative_audit("mm_ledger.jsonl",
                      angles=["oee_test_wave", "oee_test_bilinear", "oee_test_alife"],
                      min_angles=3)
# ✅ [⑬ negative-audit] 3/3 independent pre-registered angle(s) verified —
#    negative conclusion is supported.

# Optional scope check: negative conclusion must not overgeneralize
f = mm.negative_audit("mm_ledger.jsonl",
                      angles=["oee_test_wave", "oee_test_bilinear", "oee_test_alife"],
                      conclusion_scope=["all_substrates"],   # claimed scope
                      tested_scope=["in_silico"])            # actually tested
# 🔴 [⑬ negative-audit] conclusion scope includes untested domain(s): ['all_substrates'].

# CLI
mm negative --angles oee_test_wave oee_test_bilinear oee_test_alife --min-angles 3

# Or activate inside full_audit
findings = mm.full_audit(LEDGER, "main_claim", ..., angles=["a1", "a2", "a3"])

Condition	Level
`len(angles) < min_angles`	FAIL — premature closure
Any angle not pre-registered	FAIL — unverifiable evidence
`conclusion_scope ⊄ tested_scope`	FAIL — scope overshoot
A registered angle is retracted	WARN — weakened case
All checks pass	OK

LLM-as-a-Judge probes ⑭⑮⑯⑰

LLM judges introduce their own failure modes that numeric metrics don't catch. The four judge probes audit the judge itself, not just the model being evaluated.

pip install "measure-mirror[judge]"   # adds openai and anthropic

from measure_mirror.judge import anthropic_judge, openai_judge, judge_run

# Build a judge callable (pairwise A-vs-B mode)
judge_fn = anthropic_judge(model="claude-opus-4-8")
# or: judge_fn = openai_judge(model="gpt-4o")

# Each item: {"prompt": str, "a": str, "b": str}  (pairwise)
#            {"prompt": str, "response": str}       (rating, pairwise=False)
pairs = [
    {"prompt": "Summarize quantum entanglement",
     "a": "candidate_A output", "b": "candidate_B output"},
    ...
]

# judge_run calls judge_fn runs×len(items) times,
# fires ⑭⑮⑯⑰(⑱) automatically, and seals a chain-linked entry.
result = judge_run("mm_ledger.jsonl", "my_llm_eval",
                   judge_fn=judge_fn,
                   items=pairs,
                   runs=2,               # run each item twice → ⑭ consistency
                   pairwise=True,        # A-vs-B → ⑮ bias check
                   swap_positions=True)  # extra AB→BA pass → ⑱ swap test

for f in result["findings"]:
    print(f"  {f.level}  [{f.probe}]  {f.msg}")

The checks triggered by judge_run:

Probe	Catches
`judge_consistency_check` ⑭	Judge gives different verdict on re-run (stochastic / unreliable)
`judge_bias_check` ⑮	Judge systematically favors position A or B regardless of content
`judge_score_sanity` ⑰	Judge assigns identical / near-identical scores to everything
`judge_swap_check` ⑱	Verdict stays with the slot after AB→BA swap (content-blind judge)

⑯ inter_rater_agreement is standalone-only: it compares two genuinely different judges; re-running the same judge is already covered by ⑭.

Unparseable judge responses score -1 and are excluded from all probes; a judge-parse WARN fires when the failure rate exceeds 10% (FAIL when nothing parsed).

Why ⑱ matters — a deterministic judge that never reads the responses passes ⑭ (perfectly consistent), ⑮ (balanced win-rate), ⑯ (κ=1.0), and ⑰ (varied scores). Only swapping A and B exposes it: a content-driven judge must invert its verdict, a content-blind judge keeps choosing the same slot. Run the demo:

python examples/demo_judge.py   # no API key needed — mock judges

Standalone usage (bring-your-own scores):

from measure_mirror import mm

# ⑭ consistency — did the judge flip its verdict?
score_pairs = [(1, 1), (0, 0), (1, 0), (0, 0), (1, 1)]  # (run1, run2) per item
f = mm.judge_consistency_check(score_pairs, flip_threshold=0.20)
# ✅  flip rate 20.0% ≤ 20.0% (1/5 flips). Consistent.

# ⑮ position bias — does A always win?
results = [0, 0, 0, 0, 0, 0, 0, 1, 0, 0]  # 0=A won, 1=B won
f = mm.judge_bias_check(results, bias_threshold=0.60)
# 🔴  Position A win rate 90.0% > 60.0%. Strong position bias detected.

# ⑯ inter-rater — Cohen's κ between two raters / runs
matrix = [(1, 1), (0, 0), (1, 1), (0, 1), (1, 0), (0, 0)]
f = mm.inter_rater_agreement(matrix, min_kappa=0.40)
# ⚠️  Cohen's κ=0.333 < 0.40 — fair agreement only.

# ⑰ score sanity — degenerate distribution?
scores = [8, 8, 8, 8, 8, 8, 8, 7, 8, 8]  # 90% are 8
f = mm.judge_score_sanity(scores)
# ⚠️  90% of scores are '8' — near-degenerate distribution.

# ⑱ position-swap — does the verdict follow content or slot?
forward = [0, 1, 0, 1, 0]   # winners in (A, B) order
swapped = [0, 1, 0, 1, 0]   # winners after AB→BA swap — identical = locked!
f = mm.judge_swap_check(forward, swapped)
# 🔴  Position-lock rate 100.0% > 65.0%. Judge is reading position, not content.

# ⑲ transitivity — does the judge have a consistent quality scale?
matches = [("gpt", "claude", 0), ("claude", "llama", 0), ("llama", "gpt", 0)]
f = mm.judge_transitivity_check(matches)
# 🔴  Preference cycle detected: gpt > claude > llama > gpt.
#     Any leaderboard from these verdicts is order-dependent.

# ⑳ ranking stability — does "A beats B" survive resampling?
f = mm.ranking_stability_check(scores_model_a, scores_model_b)
# 🔴  Ranking 'A > B' survives only 64.2% of 1000 bootstrap resamples (n=7).
#     The ranking is noise — indistinguishable from a tie.

# CLI: audit pre-collected judge scores from a JSON file
mm judge --file judge_scores.json
# keys: score_pairs / pairwise_results / ratings_matrix / scores /
#       forward_results + swapped_results / matches / scores_a + scores_b

Certificate 📜 — one sealed verdict per claim

certificate() collapses the full integrity state of a claim into a single verifiable artifact you can embed in a paper, README, or release notes:

# Structural certificate (prereg seal + chain + retraction status)
cert = mm.certificate("mm_ledger.jsonl", "my_model")

# Full certificate — fold audit findings in
findings = mm.audit("mm_ledger.jsonl", "my_model",
                    reported_metric="acc", reported_acc=0.72, n=500)
cert = mm.certificate("mm_ledger.jsonl", "my_model", findings=findings)
# {"verdict": "CERTIFIED", "prereg_seal": "6c802655ab095e8b",
#  "anchor_hash": "sha256...", "findings": {"ok": 4, "warn": 0, "fail": 0},
#  "seal": "9d1e83a4b72f0c5e", ...}

# CLI
mm certify my_model --pretty                  # structural only
mm certify my_model --acc 0.72 --n 500        # + audit findings folded in
mm certify my_model | gh gist create -        # publish externally

Verdict	Meaning
`CERTIFIED`	Pre-registered, chain intact, not retracted, no FAIL/WARN findings
`CERTIFIED-WITH-WARNINGS`	Valid but has stale dependencies or WARN findings
`UNVERIFIED`	No pre-registration exists — nothing to certify against
`REJECTED`	Chain broken, seal tampered, retracted, or FAIL findings

The certificate embeds the ledger's anchor_hash, so it attests to one specific ledger state — and the certificate itself is sealed, so any field edit is detectable.

Badge 🏷 — embed the verdict in your README:

mm certify my_model --badge markdown >> README.md   # shields.io badge
mm certify my_model --badge svg > badge.svg          # offline self-contained SVG

cert = mm.certificate("mm_ledger.jsonl", "my_model")
print(mm.badge(cert))                 # markdown (default)
print(mm.badge(cert, fmt="svg"))      # SVG with cert seal in the tooltip
# ![🪞 my_model: CERTIFIED](https://img.shields.io/badge/🪞_my__model-CERTIFIED-brightgreen)

Badge color follows the verdict: CERTIFIED = green · WITH-WARNINGS = yellow · UNVERIFIED = grey · REJECTED = red. The SVG variant embeds the certificate seal and anchor-hash prefix in its tooltip, making the badge traceable back to the exact sealed certificate it renders.

Anchor ⎈

# Print tamper-evident ledger snapshot to stdout — pipe wherever you trust
mm anchor                              # compact JSON (default)
mm anchor --pretty                     # human-readable

# External archival examples (zero new dependencies)
mm anchor >> ~/Dropbox/mm_anchors.jsonl          # local backup
mm anchor | gh gist create -                     # GitHub Gist
mm anchor | aws s3 cp - s3://bucket/anchor.json  # S3

a = mm.anchor("mm_ledger.jsonl")
# {"_type": "anchor", "ts": "...", "entry_count": 3,
#  "head_seal": "a3b9f2c1", "anchor_hash": "sha256hex...", "chain_ok": true}

The anchor_hash (full SHA-256 of the ledger file) detects even complete file replacement — the one attack that chain hashes alone cannot catch. Save it externally before publishing results.

Calibrate + Witness run

# Verify the mirror itself is working correctly
mm calibrate
# ✅ [⚙ calibrate] 5/5 synthetic cases correct — mirror is calibrated.

# Witness-execute a command: calibrate first, then run and seal the record
mm run my_model -- python evaluate.py --model my_model
# ✅ [⚙ calibrate] 5/5 synthetic cases correct — mirror is calibrated.
#
# 🎬 Witnessed: my_model
#    Command:     python evaluate.py --model my_model
#    Started:     2026-06-11T12:00:00Z
#    Ended:       2026-06-11T12:00:03Z
#    Exit code:   0  (ok)
#    Output hash: a3b9f2c1e8d74f6a
#    Prev seal:   6c802655ab095e8b
#    Seal:        9d1e83a4b72f0c5e

# Python API
findings = mm.calibrate()
mm.report("Mirror calibration", findings)

entry = mm.witness("mm_ledger.jsonl", "my_model",
                   ["python", "evaluate.py", "--model", "my_model"])
# entry["output_hash"] changes if the script output ever changes

GRIM ⑩

# catch arithmetically impossible accuracy values
f = mm.grim_check(reported_acc=0.33, n=10)
# 🔴  acc=0.33 is arithmetically impossible for n=10.
#     No integer k satisfies round(k/10, 2) = 0.33.
#     (candidates: k=3 → 0.3, k=4 → 0.4). Fabricated value or mis-reported n.

# GRIM runs automatically inside audit() — FAIL is appended to findings
findings = mm.audit(LEDGER, "my_model", reported_metric="acc", reported_acc=0.33, n=10)

MCP Server — AI Agent Integration

Any MCP-compatible AI (Claude Code, Cursor, Windsurf, …) can call Measurement Mirror directly mid-conversation.

Setup

pip install "measure-mirror[mcp]"

Claude Code — add to .mcp.json in your project root:

{
  "mcpServers": {
    "measure-mirror": {
      "command": "python",
      "args": ["-m", "measure_mirror.mcp_server"],
      "cwd": "/path/to/measure-mirror"
    }
  }
}

Other MCP clients — run mm-mcp as the stdio server command.

All 23 probes + 6 utilities + the mm_verify umbrella are exposed as MCP tools:
mm_verify (full / group-filtered) ·
mm_register · mm_verify_chain · mm_audit · mm_continuous_audit · mm_full_audit ·
mm_baseline_fairness · mm_gaming_check · mm_multiseed_check · mm_scope_check ·
mm_too_good_check · mm_power_check · mm_multiple_comparisons_check · mm_grim_check ·
mm_falsifiability_check · mm_cascade_check · mm_negative_audit ·
mm_judge_consistency_check · mm_judge_bias_check · mm_inter_rater_agreement ·
mm_judge_score_sanity · mm_judge_swap_check · mm_judge_transitivity_check ·
mm_ranking_stability_check ·
mm_anchor · mm_calibrate · mm_witness · mm_retract · mm_certificate · mm_badge

Real Catches — Dog-Fooding

We ran Measurement Mirror on our own AI research before publishing. It caught:

🪞 Audit: ZERO "55.6% Best" claim
   🔴 FAIL
   🔴 [④a small-sample CI] n=9, acc=0.556 → 95%CI [0.267, 0.811] ⊃ baseline(0.5)
       Statistically indistinguishable from chance.
   🔴 [① pre-registration(min_n)] reported n=9 < registered min_n=200. Undersized.
   🔴 [① pre-registration(metric swap)] reported 'best_of_9' ≠ registered 'acc_full_balanced'
       Post-hoc metric swap detected. (seal=6c802655ab095e8b)

🪞 Audit: Field candidate 5 (control sim)
   🔴 FAIL
   🔴 [② fair baseline] Field 0.996 ≈ GRU-ODE 0.998 (Δ+0.002 < 0.01). Tied — no genuine advantage.

Run the demos yourself:

python examples/quickstart.py    # happy path: honest researcher
python examples/demo_zero.py     # ZERO 55.6% mirage (our own project killed)
python examples/demo_field.py    # Field candidate false positives

Project Structure

measure-mirror/
├── measure_mirror/
│   ├── mm.py              # verify() + 20 probes + CLI + DB lookup (zero deps)
│   ├── mcp_server.py      # MCP server — 30 tools (pip install .[mcp])
│   ├── judge.py           # LLM-as-a-Judge runner (pip install .[judge])
│   └── pytest_plugin.py   # assert_clean() for CI gates
├── docs/
│   ├── GUIDE.md           # full per-probe reference (English)
│   └── GUIDE_KO.md        # full per-probe reference (Korean)
├── examples/
│   ├── quickstart.py      # happy path demo
│   ├── demo_zero.py       # ZERO false-positive (dog-food)
│   ├── demo_field.py      # Field false-positive (dog-food)
│   ├── demo_judge.py      # LLM-judge failure modes (no API key needed)
│   └── mcp_example.py     # MCP tool usage reference
├── db/                    # local memory, split by who produced the record
│   ├── README.md              the measured/ vs curated/ distinction
│   ├── measured/             ← measure-mirror's own output (quantitative)
│   │   ├── baselines.json         task-level fair baselines
│   │   └── reproductions.jsonl    reproductions (verdict auto-judged)
│   └── curated/              ← human-curated (qualitative)
│       ├── self_catches.jsonl          false positives on ourselves
│       ├── false_negative_guards.jsonl false negatives re-checked
│       ├── gaming_patterns.json        gaming signatures
│       ├── contamination.jsonl         data leakage found
│       └── research_closures.jsonl     qualitative negative conclusions
└── tests/
    ├── test_mm.py         # 145 tests for core probes, CI-enforced
    ├── test_judge.py      # 17 tests for judge.py module
    └── test_sync.py       # sync gate: probe ↔ MCP ↔ tests ↔ README ↔ exports ↔ version

Local Memory (`db/`)

db/ is your local memory of past audits — not a shared/crowd database. We tried the "CVE / shared signature" framing and for us it didn't earn its keep (this is a scoped observation, not a universal law): contributing would mean publishing your own research that was wrong (self_catches) or a peer's failed reproduction (reproductions), which runs into the trust ⊥ reputation dilemma. A team with the right incentives might sustain a shared DB; we didn't, and the value below needs no sharing anyway.

The value that does hold needs no sharing at all: warn future-you about a pattern past-you already got burned by. It works regardless of how private the data is — it never leaves your machine.

db/ is split by who produced the record, so the two kinds are never confused (see db/README.md for the full structure):

`db/measured/` — what measure-mirror produces (quantitative)

Verdicts computed by the tool itself; re-running on the same numbers reproduces the same verdict exactly. These are wired into the audit loop and grow only via record_reproduction().

File	How it's used
`measured/baselines.json`	`audit(task="musr")` auto-fetches the fair baseline
`measured/reproductions.jsonl`	`audit(task=...)` warns on a prior reproduction failure; `record_reproduction(...)` appends new ones (verdict auto-judged from Wilson CI)

# memory grows: you reproduce a claim and record the result
mm.record_reproduction("musr", claim="ZERO 55.6%", acc_claimed=0.556,
                       n_claimed=9, acc=0.385, n=1050, note="collapsed at scale")
# → verdict auto-judged FAIL, appended to db/measured/reproductions.jsonl

# later: any audit on the same task surfaces it automatically
mm.audit("ledger.jsonl", "new_claim", reported_metric="acc",
         reported_acc=0.62, n=120, task="musr")
# ⚠️ [⚙ prior-reproduction] task 'musr' has a prior reproduction failure:
#    'ZERO 55.6%' claimed 0.556 (n=9) → reproduced 0.385 (n=1050). collapsed at scale

`db/curated/` — what we wrote by hand (qualitative)

Our catch log and research closures — human-curated, not the tool's automatic output. Searchable via catch_history(), but not auto-wired into audit() (matching is fuzzy text, not a clean task key, so auto-warning would mean false positives).

File	`kind`	Catch history of
`curated/self_catches.jsonl`	`self_catch`	false positives you flagged on yourself
`curated/false_negative_guards.jsonl`	`false_negative`	false negatives you re-checked
`curated/gaming_patterns.json`	`gaming`	gaming / mirage signatures you've seen
`curated/contamination.jsonl`	`contamination`	data leakage you found
`curated/research_closures.jsonl`	`closure`	qualitative negative conclusions (no `acc`/`n` — not a tool verdict)

mm.catch_history(db_dir="db")                 # all curated records
mm.catch_history(kind="gaming", db_dir="db")  # the gaming-signature catalog
mm.catch_history(source="fm_cde_pixel_feasibility")  # records from one arc

Why the split is honest: calling db/ as a whole "measure-mirror history" would over-claim — only measured/ is that. Every measured/ record was cross-checked: feeding its (acc, n) back through the tool's own Wilson-CI logic reproduces the recorded verdict with 0 mismatches.

Design Principles

Zero-dep core — pure Python stdlib. The optional judge module adds openai/anthropic for LLM-as-a-Judge; nothing else, nothing in the core.
Bidirectional — catches false positives and false negatives. Premature negative closures are also illusions.
Tamper-evident pre-registration — SHA-256 seal on first write. Re-registration is silently ignored. Ledger tampering is detected on every audit.
Independent probes — each check is a standalone function. Add new ones without touching existing code.
Adversarial by default — "too good to be true" is flagged before you believe it.

Contributing

New probes and baseline contributions are welcome.

Fork → branch → PR
db/baselines.json: shareable task baselines (not your private failures)
New probes: add the function to mm.py + tests in tests/test_mm.py
CI must stay green: pytest tests/

License

Apache 2.0

한국어 README →

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.github/workflows		.github/workflows
db		db
docs		docs
eval/self_fpfn		eval/self_fpfn
examples		examples
measure_mirror		measure_mirror
stack		stack
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
README_KO.md		README_KO.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

🪞 Measurement Mirror

The Problem

Install

Quick Start

CLI

Python API

Regression / Continuous Metrics

pytest Integration (CI Gate)

Three Verification Tiers

Probes by Verification Group

ledger — pre-registration & ledger integrity

stats — statistical validity

design — experiment-design fairness

negative — Resolved-Negative closure gate

judge — LLM-judge reliability

ranking — leaderboard integrity

Chain hash ledger (① extended)

Power check ⑧

Multiple comparisons ⑨

Falsifiability ⑪ — the Popper gate

Retraction cascade ⑫

Negative-claim audit ⑬ — the premature-closure gate

LLM-as-a-Judge probes ⑭⑮⑯⑰

Certificate 📜 — one sealed verdict per claim

Anchor ⎈

Calibrate + Witness run

GRIM ⑩

MCP Server — AI Agent Integration

Setup

Real Catches — Dog-Fooding

Project Structure

Local Memory (db/)

db/measured/ — what measure-mirror produces (quantitative)

db/curated/ — what we wrote by hand (qualitative)

Design Principles

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`ledger` — pre-registration & ledger integrity

`stats` — statistical validity

`design` — experiment-design fairness

`negative` — Resolved-Negative closure gate

`judge` — LLM-judge reliability

`ranking` — leaderboard integrity

Local Memory (`db/`)

`db/measured/` — what measure-mirror produces (quantitative)

`db/curated/` — what we wrote by hand (qualitative)

Packages