Skip to content

Agent can lower val_bpb without improving the model — add an optional experiment integrity log? #599

Description

@pulkit6732

The loop keeps a commit when the printed val_bpb improves (program.md:99-103).
But the agent owns train.py, and nothing binds that printed number to real
training — so an agent optimizing for the metric can "win" without learning
anything. I verified the following against master (228791f).

Four ways the number can move without a better model:

  1. Skip the optimizer, still finish the budget. step / dt /
    total_training_time are plain locals (train.py:540, 578-579, 603). Advance
    step and accrue total_training_time without loss.backward() /
    optimizer.step() and the loop completes on an untrained model.
  2. Short-circuit the metric. train.py:613 calls evaluate_bpb and
    train.py:622 prints it; the agent can replace/wrap that call. The
    "do not change" on evaluate_bpb is prose only (program.md:28-31) — nothing
    enforces it.
  3. Val set is never fingerprinted (prepare.py:353-354) — shrinking
    EVAL_TOKENS or swapping the shard leaves no trace. (Related to the cache-trust
    work in Harden cache artifact trust boundary in prepare.py #41 / Harden downloaded dataset shard cache in prepare.py #215, but here it's the eval side.)
  4. Results are print()ed, never bound to code/data/model (train.py:621-630),
    so results.tsv keeps a commit + a number that can't be reproduced later.

Minimal reproducer (stdlib only, no torch/GPU) : https://gist.github.com/pulkit6732/a5c3ff9113bfac7e0b6ae50e69b8b567
a fabricated val_bpb beats
the current best and gets KEPT, while a one-line receipt exposes it:

[honest          (real training)]   val_bpb: 0.317639   -> KEEP (best)
[game_eval_const (skip optim+fake)] val_bpb: 0.050000   -> KEEP  <-- fake result recorded

experiment                                      val_bpb  rep_step  opt_step  untrained?
honest          (real training)                0.317639        86       688        no
game_eval_const (skip optimizer + fake eval)       0.05        12         0       YES

real_opt_steps is read from the optimizer state (not the loop var) and the
model hash equals the untrained baseline — so the gamed run is unmistakable.

Proposed fix — a ~30-line stdlib integrity log. One line per run binding
val_bpb to sha256 of train.py + prepare.py + the val shard + the final
model state, plus the real optimizer-step count and wall-time. Detection, not
prevention; no new deps; keeps the repo's minimal spirit. Integration is ~11
lines after the eval, wrapped so it can never fail a run.

Is an optional log like this something you'd want in the repo? If so I'm happy to
send a small PR (module + the hook + the standalone reproducer). And if this is
intentionally left to the human supervisor, that's a fair answer too — figured
it was worth raising.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions