Predict-then-attribute loop: 1.349 val_bpb in 26 experiments (vs 1.405 baseline in 25) #340

Hzz-Git · 2026-03-19T16:04:30Z

Hzz-Git
Mar 19, 2026

I ran a 4-arm experiment testing whether making the agent predict outcomes before running and attribute errors after can improve search efficiency in this setup. Single run on one hardware config, so treat as exploratory — sharing to see if the pattern holds elsewhere.

Setup: M5 Pro (18-core, 64GB), MPS backend, 5-min budget. 4 program.md variants, same starting train.py.

Arm	Strategy	Best val_bpb	Experiments
1	Original program.md	1.405	25
2	+ running summary	1.404	50
3	+ structured beliefs file	1.392	52
4	+ beliefs + prediction + attribution	1.349	26

Arm 4 reached a better config in fewer trials. Arms 2 and 3 ran longer but didn't reach the same level. The uneven experiment counts are a limitation — ideally each arm would run the same number.

What the reflective agent did differently:

The agent maintains a beliefs.md file (max 20 beliefs, rewritten not appended). Before each experiment it writes a prediction with reasoning. After, it compares prediction vs reality and identifies which belief was wrong.

My working hypothesis is that forcing explicit predictions makes the agent expose its internal assumptions, and attribution turns failed runs into updates rather than just logs.

By experiment ~10, the agent had learned "MPS throughput is the binding constraint." It combined this with "softcap removal helps convergence" to reason: try a shallower model (depth=3) → more training steps in 5 minutes → wins. This hypothesis did not emerge in the other arms over the runs shown.

Prediction calibration appeared to improve over the run. Measuring gap as absolute error |predicted − actual| in val_bpb: first-half mean gap was 0.066, second-half was 0.019 (n=26 total, so treat as suggestive).

The baseline agent found batch 8K at experiment 5, then spent 20 experiments without further improvement.

Intervention is minimal: just changes to program.md + one markdown file. No infra, no databases, no fine-tuning.

Repo: https://github.com/Hzz-Git/reflective-autoresearch

This is one run on MPS. Curious if others see similar patterns on different hardware. CUDA/H100 results especially welcome since throughput dynamics would be different.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Predict-then-attribute loop: 1.349 val_bpb in 26 experiments (vs 1.405 baseline in 25) #340

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Predict-then-attribute loop: 1.349 val_bpb in 26 experiments (vs 1.405 baseline in 25) #340

Uh oh!

Hzz-Git Mar 19, 2026

Replies: 0 comments

Hzz-Git
Mar 19, 2026