Skip to content

feat: cross-environment GPU metrics parity (K8s, SLURM, Flux) #54

@pmady

Description

@pmady

Summary

Enable consistent GPU metrics collection across Kubernetes, SLURM, and Flux environments using the same binary and output format. Users should be able to compare GPU performance across on-prem and cloud with identical metrics.

Motivation

HPSF TAC reviewer feedback: "what would be nice to have is a means to collect metrics in the same way across environments. Allow me to run a workload on-premises and in cloud and use the same tool to compare the GPU performance."

Tasks

  • Add --env flag to auto-detect environment (k8s, slurm, flux, standalone)
  • Include environment metadata in output (orchestrator, node, job ID)
  • Unified JSON schema across all environments
  • Comparison documentation with examples
  • Contact NVIDIA re: go-nvml collaboration for cross-platform metrics

References

  • Standalone CLI: cmd/gpu-metrics/main.go
  • KEDA scaler: cmd/keda-gpu-scaler/main.go

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions