Summary
Enable consistent GPU metrics collection across Kubernetes, SLURM, and Flux environments using the same binary and output format. Users should be able to compare GPU performance across on-prem and cloud with identical metrics.
Motivation
HPSF TAC reviewer feedback: "what would be nice to have is a means to collect metrics in the same way across environments. Allow me to run a workload on-premises and in cloud and use the same tool to compare the GPU performance."
Tasks
References
- Standalone CLI:
cmd/gpu-metrics/main.go
- KEDA scaler:
cmd/keda-gpu-scaler/main.go
Summary
Enable consistent GPU metrics collection across Kubernetes, SLURM, and Flux environments using the same binary and output format. Users should be able to compare GPU performance across on-prem and cloud with identical metrics.
Motivation
HPSF TAC reviewer feedback: "what would be nice to have is a means to collect metrics in the same way across environments. Allow me to run a workload on-premises and in cloud and use the same tool to compare the GPU performance."
Tasks
--envflag to auto-detect environment (k8s, slurm, flux, standalone)References
cmd/gpu-metrics/main.gocmd/keda-gpu-scaler/main.go