feat(train): GPU-aware MFU, warmup MFU fix, experiment integrity log#623
Open
eli-labz wants to merge 1 commit into
Open
feat(train): GPU-aware MFU, warmup MFU fix, experiment integrity log#623eli-labz wants to merge 1 commit into
eli-labz wants to merge 1 commit into
Conversation
Three targeted improvements to train.py: fix(train): GPU-aware peak FLOPS for accurate MFU on non-H100 GPUs (karpathy#547) - Add GPU_PEAK_FLOPS dict mapping (compute_cap) -> peak BF16 FLOPS - Covers V100, A100, RTX 3090/A10G, L4/L40S/RTX 4090, H100, B200, RTX 5090 - Falls back to H100 value (989.5e12) for unknown GPUs - H100_BF16_PEAK_FLOPS remains as the resolved scalar for backward compat fix(train): warmup off-by-one in steady_state_mfu accumulation (karpathy#556) - Training skips timing for steps 0..10 inclusive (11 steps, not 10) - Denominator corrected from (step - 10) to (step - 11) feat(train): experiment integrity log to detect metric gaming (karpathy#599) - Add log_integrity() function using stdlib only (hashlib, datetime) - Writes one line per run to integrity.log binding val_bpb to: - sha256[:16] of train.py and prepare.py source files - sha256[:16] of first 1 MB of final model weights - real optimizer step count from optimizer internal state (not loop var) - wall-clock training seconds - Detection only, never raises, wrapped in try/except so it can never crash a run; integrity.log should be gitignored like results.tsv
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Three targeted improvements to train.py:
fix(train): GPU-aware peak FLOPS for accurate MFU on non-H100 GPUs (#547)
fix(train): warmup off-by-one in steady_state_mfu accumulation (#556)
feat(train): experiment integrity log to detect metric gaming (#599)