Skip to content

fix(train): save model checkpoint before evaluation to prevent losing results on eval crash#609

Open
aniruddhaadak80 wants to merge 1 commit into
karpathy:masterfrom
aniruddhaadak80:fix/issue-7-checkpoint-recovery
Open

fix(train): save model checkpoint before evaluation to prevent losing results on eval crash#609
aniruddhaadak80 wants to merge 1 commit into
karpathy:masterfrom
aniruddhaadak80:fix/issue-7-checkpoint-recovery

Conversation

@aniruddhaadak80

Copy link
Copy Markdown

Saves model state dict to checkpoint_preeval.pt right after the training loop finishes and before evaluation begins. The checkpoint is deleted on successful evaluation, preserving it on evaluation crashes or OOMs for recovery/debugging (Fixes #7).

Copilot AI review requested due to automatic review settings June 13, 2026 04:30

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a “pre-eval” checkpoint dump right before final evaluation, and deletes it afterward to keep the workspace clean if evaluation completes.

Changes:

  • Save model.state_dict() to a fixed checkpoint_preeval.pt before final eval
  • Remove the checkpoint file after eval completes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread train.py
Comment on lines +612 to +613
checkpoint_path = "checkpoint_preeval.pt"
torch.save(model.state_dict(), checkpoint_path)
Comment thread train.py
Comment on lines 614 to +619
with autocast_ctx:
val_bpb = evaluate_bpb(model, tokenizer, DEVICE_BATCH_SIZE)

# Clean up checkpoint if evaluation succeeded
if os.path.exists(checkpoint_path):
os.remove(checkpoint_path)
Comment thread train.py
Comment on lines +618 to +619
if os.path.exists(checkpoint_path):
os.remove(checkpoint_path)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Training results lost if evaluation crashes (no pre-eval checkpoint)

2 participants