fix(train,prepare): resolve open issues (#556, #547, #549, #552, #542)#604
Open
aniruddhaadak80 wants to merge 5 commits into
Open
fix(train,prepare): resolve open issues (#556, #547, #549, #552, #542)#604aniruddhaadak80 wants to merge 5 commits into
aniruddhaadak80 wants to merge 5 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds multi-GPU DistributedDataParallel (DDP) support to the training script and introduces an automated “Critic Sandbox” workflow to run/score experiments and manage keep/discard decisions, alongside small robustness/formatting improvements in data preparation.
Changes:
- Add DDP initialization, rank-aware seeding, gradient-sync control for grad accumulation, and MFU FLOPs estimation updates in
train.py. - Introduce
sandbox.pyto compile-check, run, parse metrics, log results, and auto-commit/rollback experiments. - Improve
prepare.pydownload integrity checking and tokenizer/token-bytes handling; update experiment loop docs inprogram.md.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| train.py | Adds DDP setup, world-size-aware batching/accumulation, and architecture-based FLOPs for MFU. |
| sandbox.py | New automation script to run experiments, log metrics, and manage git keep/discard. |
| program.md | Updates the experiment loop to use the new sandbox workflow. |
| prepare.py | Adds truncated-download detection, safer tensor loading, and refactors formatting/token-bytes computation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+488
to
+497
| # Peak BF16/FP16 Tensor Core FLOPS for different GPU architectures | ||
| gpu_peak_flops = { | ||
| (8, 0): 312e12, # A100 SXM/PCIe | ||
| (8, 6): 142e12, # RTX 3090 / A10G | ||
| (8, 9): 330e12, # L4 / L40 / RTX 4090 | ||
| (9, 0): 989.5e12, # H100 SXM | ||
| (10, 0): 2250e12, # Blackwell sm_100 (B200 SXM) | ||
| (10, 2): 660e12, # Blackwell sm_120 (RTX 5090 / Workstation) | ||
| } | ||
| peak_flops = gpu_peak_flops.get(cap, 989.5e12) |
Comment on lines
+548
to
+556
| def ddp_dataloader_wrapper(loader, rank, world_size): | ||
| for _ in range(rank): | ||
| next(loader) | ||
| while True: | ||
| yield next(loader) | ||
| for _ in range(world_size - 1): | ||
| next(loader) | ||
|
|
||
| train_loader = ddp_dataloader_wrapper(train_loader, ddp_rank, ddp_world_size) |
Comment on lines
+594
to
+595
| if ddp: | ||
| model._orig_mod.require_backward_grad_sync = (micro_step == grad_accum_steps - 1) |
Comment on lines
+529
to
+531
| tokens_per_fwdbwd = DEVICE_BATCH_SIZE * MAX_SEQ_LEN * ddp_world_size | ||
| grad_accum_steps = max(1, TOTAL_BATCH_SIZE // tokens_per_fwdbwd) | ||
| TOTAL_BATCH_SIZE = grad_accum_steps * tokens_per_fwdbwd |
| @@ -0,0 +1,105 @@ | |||
| #!/usr/bin/env python3 | |||
|
|
||
| The idea is that you are a completely autonomous researcher trying things out. The Sandbox handles the Git and execution discipline. If things work, the Sandbox keeps them. If they don't, the Sandbox discards them. If you feel like you're getting stuck in some way, review the code and brainstorm radically new directions. | ||
|
|
||
| **Timeout**: The Sandbox handles the timeout. If a run crashes (OOM or bug) and the Sandbox rolled back, read `run.log` to diagnose, adapt your idea, and quickly try again. If it is fundamentally broken, just skip it—the Sandbox already logged "crash"—and move on to a new idea. |
Comment on lines
+276
to
+277
| with open(path, "rb") as f: | ||
| return torch.load(f, map_location=device) | ||
| return torch.load(f, map_location=device, weights_only=True) |
This change mitigates the risk of insecure deserialization by restricting the classes that can be loaded via torch.load to a safe subset. This prevents potential arbitrary code execution from malicious files. Co-authored-by: aniruddhaadak80 <127435065+aniruddhaadak80@users.noreply.github.com>
102d953 to
2cc9630
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR fixes multiple open issues in the repository: