Skip to content

fix(train,prepare): resolve open issues (#556, #547, #549, #552, #542)#604

Open
aniruddhaadak80 wants to merge 5 commits into
karpathy:masterfrom
aniruddhaadak80:fix/resolve-open-issues
Open

fix(train,prepare): resolve open issues (#556, #547, #549, #552, #542)#604
aniruddhaadak80 wants to merge 5 commits into
karpathy:masterfrom
aniruddhaadak80:fix/resolve-open-issues

Conversation

@aniruddhaadak80

Copy link
Copy Markdown

This PR fixes multiple open issues in the repository:

  1. Issue fix(train): warmup off-by-one in steady_state_mfu accumulation #556: Fixes the warmup off-by-one error in steady state MFU calculations by using step - 11 (since timing starts at step 11).
  2. Issue fix: GPU-aware MFU with Blackwell (sm_100/sm_120) coverage #547: Implements dynamic GPU-aware peak FLOPS mapping including Blackwell architectures (sm_100 and sm_120) instead of statically defaulting to H100 SXM.
  3. Issue fix(prepare): detect truncated shard downloads via Content-Length #549: Adds verification of Content-Length vs file size on disk in prepare.py to prevent storing truncated/corrupted shards.
  4. Issue fix(prepare): use raw bytes for BPB to avoid U+FFFD inflation (#384) #552: Uses raw token byte sizes via enc.decode_single_token_bytes() to compute BPB without U+FFFD string encoding inflation.
  5. Issue Minor cleanups: README typo and redundant .tmp check in prepare.py #542: Removes the redundant .tmp check in list_parquet_files() in prepare.py.

Copilot AI review requested due to automatic review settings June 13, 2026 02:25

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds multi-GPU DistributedDataParallel (DDP) support to the training script and introduces an automated “Critic Sandbox” workflow to run/score experiments and manage keep/discard decisions, alongside small robustness/formatting improvements in data preparation.

Changes:

  • Add DDP initialization, rank-aware seeding, gradient-sync control for grad accumulation, and MFU FLOPs estimation updates in train.py.
  • Introduce sandbox.py to compile-check, run, parse metrics, log results, and auto-commit/rollback experiments.
  • Improve prepare.py download integrity checking and tokenizer/token-bytes handling; update experiment loop docs in program.md.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File Description
train.py Adds DDP setup, world-size-aware batching/accumulation, and architecture-based FLOPs for MFU.
sandbox.py New automation script to run experiments, log metrics, and manage git keep/discard.
program.md Updates the experiment loop to use the new sandbox workflow.
prepare.py Adds truncated-download detection, safer tensor loading, and refactors formatting/token-bytes computation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread train.py Outdated
Comment on lines +488 to +497
# Peak BF16/FP16 Tensor Core FLOPS for different GPU architectures
gpu_peak_flops = {
(8, 0): 312e12, # A100 SXM/PCIe
(8, 6): 142e12, # RTX 3090 / A10G
(8, 9): 330e12, # L4 / L40 / RTX 4090
(9, 0): 989.5e12, # H100 SXM
(10, 0): 2250e12, # Blackwell sm_100 (B200 SXM)
(10, 2): 660e12, # Blackwell sm_120 (RTX 5090 / Workstation)
}
peak_flops = gpu_peak_flops.get(cap, 989.5e12)
Comment thread train.py Outdated
Comment on lines +548 to +556
def ddp_dataloader_wrapper(loader, rank, world_size):
for _ in range(rank):
next(loader)
while True:
yield next(loader)
for _ in range(world_size - 1):
next(loader)

train_loader = ddp_dataloader_wrapper(train_loader, ddp_rank, ddp_world_size)
Comment thread train.py Outdated
Comment on lines +594 to +595
if ddp:
model._orig_mod.require_backward_grad_sync = (micro_step == grad_accum_steps - 1)
Comment thread train.py Outdated
Comment on lines +529 to +531
tokens_per_fwdbwd = DEVICE_BATCH_SIZE * MAX_SEQ_LEN * ddp_world_size
grad_accum_steps = max(1, TOTAL_BATCH_SIZE // tokens_per_fwdbwd)
TOTAL_BATCH_SIZE = grad_accum_steps * tokens_per_fwdbwd
Comment thread sandbox.py
@@ -0,0 +1,105 @@
#!/usr/bin/env python3
Comment thread program.md

The idea is that you are a completely autonomous researcher trying things out. The Sandbox handles the Git and execution discipline. If things work, the Sandbox keeps them. If they don't, the Sandbox discards them. If you feel like you're getting stuck in some way, review the code and brainstorm radically new directions.

**Timeout**: The Sandbox handles the timeout. If a run crashes (OOM or bug) and the Sandbox rolled back, read `run.log` to diagnose, adapt your idea, and quickly try again. If it is fundamentally broken, just skip it—the Sandbox already logged "crash"—and move on to a new idea.
Comment thread prepare.py Outdated
Comment on lines +276 to +277
with open(path, "rb") as f:
return torch.load(f, map_location=device)
return torch.load(f, map_location=device, weights_only=True)
aniruddhaadak80 and others added 5 commits June 13, 2026 09:37
This change mitigates the risk of insecure deserialization by restricting
the classes that can be loaded via torch.load to a safe subset.
This prevents potential arbitrary code execution from malicious files.

Co-authored-by: aniruddhaadak80 <127435065+aniruddhaadak80@users.noreply.github.com>
@aniruddhaadak80 aniruddhaadak80 force-pushed the fix/resolve-open-issues branch from 102d953 to 2cc9630 Compare June 13, 2026 04:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants