Problem
After PR #2 added support for non-Hopper GPUs, the MFU (Model FLOPS Utilization) metric is now misleading on any GPU that isn't an H100.
The peak FLOPS value is hardcoded at line 491:
\\python
H100_BF16_PEAK_FLOPS = 989.5e12
\\
On an A100 (312 TFLOPS BF16), the reported MFU would be ~3.2x too high. On an RTX 4090 (165 TFLOPS), it would be ~6x too high. This makes the MFU metric useless for comparing runs across different hardware.
Proposed Fix
Auto-detect the GPU via \ orch.cuda.get_device_capability()\ and look up the correct theoretical peak FLOPS from a table of known GPU architectures. Fall back to the H100 value for unknown GPUs.
This is a minimal change (~10 lines) that makes the logging output accurate on all supported hardware.
I'll submit a PR for this.
Problem
After PR #2 added support for non-Hopper GPUs, the MFU (Model FLOPS Utilization) metric is now misleading on any GPU that isn't an H100.
The peak FLOPS value is hardcoded at line 491:
\\python
H100_BF16_PEAK_FLOPS = 989.5e12
\\
On an A100 (312 TFLOPS BF16), the reported MFU would be ~3.2x too high. On an RTX 4090 (165 TFLOPS), it would be ~6x too high. This makes the MFU metric useless for comparing runs across different hardware.
Proposed Fix
Auto-detect the GPU via \ orch.cuda.get_device_capability()\ and look up the correct theoretical peak FLOPS from a table of known GPU architectures. Fall back to the H100 value for unknown GPUs.
This is a minimal change (~10 lines) that makes the logging output accurate on all supported hardware.
I'll submit a PR for this.