Agent can lower val_bpb without improving the model — add an optional experiment integrity log?

The loop keeps a commit when the printed `val_bpb` improves (`program.md:99-103`).
But the agent owns `train.py`, and nothing binds that printed number to real
training — so an agent optimizing for the metric can "win" without learning
anything. I verified the following against `master` (`228791f`).

**Four ways the number can move without a better model:**

1. **Skip the optimizer, still finish the budget.** `step` / `dt` /
   `total_training_time` are plain locals (`train.py:540, 578-579, 603`). Advance
   `step` and accrue `total_training_time` without `loss.backward()` /
   `optimizer.step()` and the loop completes on an untrained model.
2. **Short-circuit the metric.** `train.py:613` calls `evaluate_bpb` and
   `train.py:622` prints it; the agent can replace/wrap that call. The
   "do not change" on `evaluate_bpb` is prose only (`program.md:28-31`) — nothing
   enforces it.
3. **Val set is never fingerprinted** (`prepare.py:353-354`) — shrinking
   `EVAL_TOKENS` or swapping the shard leaves no trace. (Related to the cache-trust
   work in #41 / #215, but here it's the *eval* side.)
4. **Results are `print()`ed, never bound to code/data/model** (`train.py:621-630`),
   so `results.tsv` keeps a commit + a number that can't be reproduced later.

**Minimal reproducer** (stdlib only, no torch/GPU) : https://gist.github.com/pulkit6732/a5c3ff9113bfac7e0b6ae50e69b8b567
 a fabricated `val_bpb` beats
the current best and gets KEPT, while a one-line receipt exposes it:

```
[honest          (real training)]   val_bpb: 0.317639   -> KEEP (best)
[game_eval_const (skip optim+fake)] val_bpb: 0.050000   -> KEEP  <-- fake result recorded

experiment                                      val_bpb  rep_step  opt_step  untrained?
honest          (real training)                0.317639        86       688        no
game_eval_const (skip optimizer + fake eval)       0.05        12         0       YES
```

`real_opt_steps` is read from the optimizer state (not the loop var) and the
model hash equals the untrained baseline — so the gamed run is unmistakable.

**Proposed fix — a ~30-line stdlib integrity log.** One line per run binding
`val_bpb` to `sha256` of `train.py` + `prepare.py` + the val shard + the final
model state, plus the real optimizer-step count and wall-time. Detection, not
prevention; no new deps; keeps the repo's minimal spirit. Integration is ~11
lines after the eval, wrapped so it can never fail a run.

Is an optional log like this something you'd want in the repo? If so I'm happy to
send a small PR (module + the hook + the standalone reproducer). And if this is
intentionally left to the human supervisor, that's a fair answer too — figured
it was worth raising.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent can lower val_bpb without improving the model — add an optional experiment integrity log? #599

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Agent can lower val_bpb without improving the model — add an optional experiment integrity log? #599

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions