You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The loop keeps a commit when the printed val_bpb improves (program.md:99-103).
But the agent owns train.py, and nothing binds that printed number to real
training — so an agent optimizing for the metric can "win" without learning
anything. I verified the following against master (228791f).
Four ways the number can move without a better model:
Skip the optimizer, still finish the budget.step / dt / total_training_time are plain locals (train.py:540, 578-579, 603). Advance step and accrue total_training_time without loss.backward() / optimizer.step() and the loop completes on an untrained model.
Short-circuit the metric.train.py:613 calls evaluate_bpb and train.py:622 prints it; the agent can replace/wrap that call. The
"do not change" on evaluate_bpb is prose only (program.md:28-31) — nothing
enforces it.
real_opt_steps is read from the optimizer state (not the loop var) and the
model hash equals the untrained baseline — so the gamed run is unmistakable.
Proposed fix — a ~30-line stdlib integrity log. One line per run binding val_bpb to sha256 of train.py + prepare.py + the val shard + the final
model state, plus the real optimizer-step count and wall-time. Detection, not
prevention; no new deps; keeps the repo's minimal spirit. Integration is ~11
lines after the eval, wrapped so it can never fail a run.
Is an optional log like this something you'd want in the repo? If so I'm happy to
send a small PR (module + the hook + the standalone reproducer). And if this is
intentionally left to the human supervisor, that's a fair answer too — figured
it was worth raising.
The loop keeps a commit when the printed
val_bpbimproves (program.md:99-103).But the agent owns
train.py, and nothing binds that printed number to realtraining — so an agent optimizing for the metric can "win" without learning
anything. I verified the following against
master(228791f).Four ways the number can move without a better model:
step/dt/total_training_timeare plain locals (train.py:540, 578-579, 603). Advancestepand accruetotal_training_timewithoutloss.backward()/optimizer.step()and the loop completes on an untrained model.train.py:613callsevaluate_bpbandtrain.py:622prints it; the agent can replace/wrap that call. The"do not change" on
evaluate_bpbis prose only (program.md:28-31) — nothingenforces it.
prepare.py:353-354) — shrinkingEVAL_TOKENSor swapping the shard leaves no trace. (Related to the cache-trustwork in Harden cache artifact trust boundary in prepare.py #41 / Harden downloaded dataset shard cache in prepare.py #215, but here it's the eval side.)
print()ed, never bound to code/data/model (train.py:621-630),so
results.tsvkeeps a commit + a number that can't be reproduced later.Minimal reproducer (stdlib only, no torch/GPU) : https://gist.github.com/pulkit6732/a5c3ff9113bfac7e0b6ae50e69b8b567
a fabricated
val_bpbbeatsthe current best and gets KEPT, while a one-line receipt exposes it:
real_opt_stepsis read from the optimizer state (not the loop var) and themodel hash equals the untrained baseline — so the gamed run is unmistakable.
Proposed fix — a ~30-line stdlib integrity log. One line per run binding
val_bpbtosha256oftrain.py+prepare.py+ the val shard + the finalmodel state, plus the real optimizer-step count and wall-time. Detection, not
prevention; no new deps; keeps the repo's minimal spirit. Integration is ~11
lines after the eval, wrapped so it can never fail a run.
Is an optional log like this something you'd want in the repo? If so I'm happy to
send a small PR (module + the hook + the standalone reproducer). And if this is
intentionally left to the human supervisor, that's a fair answer too — figured
it was worth raising.