Predict-then-attribute loop: 1.349 val_bpb in 26 experiments (vs 1.405 baseline in 25) #340
Hzz-Git
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I ran a 4-arm experiment testing whether making the agent predict outcomes before running and attribute errors after can improve search efficiency in this setup. Single run on one hardware config, so treat as exploratory — sharing to see if the pattern holds elsewhere.
Setup: M5 Pro (18-core, 64GB), MPS backend, 5-min budget. 4 program.md variants, same starting train.py.
Arm 4 reached a better config in fewer trials. Arms 2 and 3 ran longer but didn't reach the same level. The uneven experiment counts are a limitation — ideally each arm would run the same number.
What the reflective agent did differently:
The agent maintains a
beliefs.mdfile (max 20 beliefs, rewritten not appended). Before each experiment it writes a prediction with reasoning. After, it compares prediction vs reality and identifies which belief was wrong.My working hypothesis is that forcing explicit predictions makes the agent expose its internal assumptions, and attribution turns failed runs into updates rather than just logs.
By experiment ~10, the agent had learned "MPS throughput is the binding constraint." It combined this with "softcap removal helps convergence" to reason: try a shallower model (depth=3) → more training steps in 5 minutes → wins. This hypothesis did not emerge in the other arms over the runs shown.
Prediction calibration appeared to improve over the run. Measuring gap as absolute error |predicted − actual| in val_bpb: first-half mean gap was 0.066, second-half was 0.019 (n=26 total, so treat as suggestive).
The baseline agent found batch 8K at experiment 5, then spent 20 experiments without further improvement.
Intervention is minimal: just changes to program.md + one markdown file. No infra, no databases, no fine-tuning.
Repo: https://github.com/Hzz-Git/reflective-autoresearch
This is one run on MPS. Curious if others see similar patterns on different hardware. CUDA/H100 results especially welcome since throughput dynamics would be different.
Beta Was this translation helpful? Give feedback.
All reactions