GPUDirect RDMA can not be used with ROCM on Fedora CoreOS (and possibly other OS).
HSA_FORCE_FINE_GRAIN_PCIE=1 UCX_ROCM_COPY_D2H_THRESH=0 UCX_ROCM_COPY_H2D_THRESH=0 UCX_ROCM_COPY_DMABUF=yes HIP_VISIBLE_DEVICES=0 UCX_IB_GPU_DIRECT_RDMA=yes UCX_TLS=rc,rocm
- or: MLNX_OFED version `ofed_info -s`
# Memory domain: self
# Component: self
# register: unlimited, cost: 0 nsec
# remote key: 0 bytes
# rkey_ptr is supported
# memory types: host (access,reg_nonblock,reg,cache)
#
# Transport: self
# Device: memory
# Type: loopback
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 19360.00 MB/sec
# latency: 0 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 8K
# am_bcopy: <= 8K
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 0 bytes
# iface address: 8 bytes
# error handling: ep_check
# device mem_element: 0 bytes
#
#
# Memory domain: tcp
# Component: tcp
# memory types:
#
# Transport: tcp
# Device: enp2s0f4u1u2c2
# Type: network
# System device: <unknown>
#
# capabilities:
# bandwidth: 11.32/ppn + 0.00 MB/sec
# latency: 10960 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 1
# device num paths: 1
# max eps: 256
# device address: 6 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
# device mem_element: 0 bytes
#
# Transport: tcp
# Device: enp97s0np0
# Type: network
# System device: enp97s0np0 (0)
#
# capabilities:
# bandwidth: 2200.00/ppn + 0.00 MB/sec
# latency: 5203 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 0
# device num paths: 1
# max eps: 256
# device address: 6 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
# device mem_element: 0 bytes
#
# Transport: tcp
# Device: lo
# Type: network
# System device: <unknown>
#
# capabilities:
# bandwidth: 11.91/ppn + 0.00 MB/sec
# latency: 10960 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 1
# device num paths: 1
# max eps: 256
# device address: 18 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
# device mem_element: 0 bytes
#
# Transport: tcp
# Device: tunl0
# Type: network
# System device: <unknown>
#
# capabilities:
# bandwidth: 11.60/ppn + 0.00 MB/sec
# latency: 10960 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 1
# device num paths: 1
# max eps: 256
# device address: 6 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
# device mem_element: 0 bytes
#
#
# Connection manager: tcp
# max_conn_priv: 2064 bytes
#
# Memory domain: sysv
# Component: sysv
# allocate: unlimited
# remote key: 12 bytes
# rkey_ptr is supported
# memory types: host (access,alloc,cache)
#
# Transport: sysv
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 15360.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 100
# am_bcopy: <= 8256
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 16 bytes
# iface address: 8 bytes
# error handling: ep_check
# device mem_element: 0 bytes
#
#
# Memory domain: posix
# Component: posix
# allocate: <= 703227384K
# remote key: 32 bytes
# rkey_ptr is supported
# memory types: host (access,alloc,cache)
#
# Transport: posix
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 15360.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 100
# am_bcopy: <= 8256
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 16 bytes
# iface address: 16 bytes
# error handling: ep_check
# device mem_element: 0 bytes
#
#
# Memory domain: bnxt_re0
# Component: ib
# register: unlimited, dmabuf, cost: 16000 + 0.060 * N nsec
# remote key: 8 bytes
# local memory handle is required for zcopy
# memory types: host (access,reg,cache), rocm (reg,cache)
#
# Transport: rc_verbs
# Device: bnxt_re0:1
# Type: network
# System device: bnxt_re0 (0)
#
# capabilities:
# bandwidth: 23329.25/ppn + 0.00 MB/sec
# latency: 800 + 1.000 * N nsec
# overhead: 75 nsec
# put_short: <= 96
# put_bcopy: <= 8256
# put_zcopy: <= 1G, up to 5 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4K
# get_bcopy: <= 8256
# get_zcopy: 65..1G, up to 5 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4K
# am_short: <= 95
# am_bcopy: <= 8255
# am_zcopy: <= 8255, up to 4 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 127
# domain: cpu
# atomic_add: 64 bit
# atomic_fadd: 64 bit
# atomic_cswap: 64 bit
# connection: to ep
# device priority: 0
# device num paths: 1
# max eps: 256
# device address: 17 bytes
# ep address: 4 bytes
# error handling: peer failure, ep_check
# device mem_element: 0 bytes
#
#
# Transport: ud_verbs
# Device: bnxt_re0:1
# Type: network
# System device: bnxt_re0 (0)
#
# capabilities:
# bandwidth: 23329.25/ppn + 0.00 MB/sec
# latency: 830 nsec
# overhead: 105 nsec
# am_short: <= 88
# am_bcopy: <= 4088
# am_zcopy: <= 4088, up to 5 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 3992
# connection: to ep, to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 17 bytes
# iface address: 3 bytes
# ep address: 6 bytes
# error handling: peer failure, ep_check
# device mem_element: 0 bytes
#
#
# Connection manager: rdmacm
# max_conn_priv: 54 bytes
#
# Memory domain: rocm_cpy
# Component: rocm_cpy
# allocate: unlimited
# register: unlimited, cost: 0 nsec
# remote key: 16 bytes
# memory types: host (reg,cache), rocm (access,alloc,reg,cache,detect)
#
# Transport: rocm_copy
# Device: rocm_cpy
# Type: accelerator
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 6911.00 MB/sec
# latency: 100 nsec
# overhead: 0 nsec
# put_short: <= 4294967295
# put_zcopy: unlimited, up to 1 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 1
# get_short: <= 4294967295
# get_zcopy: unlimited, up to 1 iov
# get_opt_zcopy_align: <= 1
# get_align_mtu: <= 1
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 0 bytes
# iface address: 8 bytes
# error handling: none
# device mem_element: 0 bytes
#
#
# Memory domain: rocm_ipc
# Component: rocm_ipc
# register: unlimited, cost: 9 nsec
# remote key: 56 bytes
# memory types: rocm (access,reg,cache)
#
# Transport: rocm_ipc
# Device: rocm_ipc
# Type: accelerator
# System device: <unknown>
#
# capabilities:
# bandwidth: 204800.00/ppn + 0.00 MB/sec
# latency: 100 nsec
# overhead: 0 nsec
# put_zcopy: 128..inf, up to 1 iov
# put_opt_zcopy_align: <= 4
# put_align_mtu: <= 4
# get_zcopy: 128..inf, up to 1 iov
# get_opt_zcopy_align: <= 4
# get_align_mtu: <= 4
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 4 bytes
# error handling: none
# device mem_element: 0 bytes
#
#
# Memory domain: cma
# Component: cma
# memory types:
#
# Transport: cma
# Device: memory
# Type: intra-node
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 11145.00 MB/sec
# latency: 80 nsec
# overhead: 2000 nsec
# put_zcopy: unlimited, up to 16 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 1
# get_zcopy: unlimited, up to 16 iov
# get_opt_zcopy_align: <= 1
# get_align_mtu: <= 1
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 16 bytes
# iface address: 16 bytes
# error handling: peer failure, ep_check
# device mem_element: 0 bytes
Describe the bug
GPUDirect RDMA can not be used with ROCM on Fedora CoreOS (and possibly other OS).
Steps to Reproduce
ucx_perftestHSA_FORCE_FINE_GRAIN_PCIE=1 UCX_ROCM_COPY_D2H_THRESH=0 UCX_ROCM_COPY_H2D_THRESH=0 UCX_ROCM_COPY_DMABUF=yes HIP_VISIBLE_DEVICES=0 UCX_IB_GPU_DIRECT_RDMA=yes UCX_TLS=rc,rocm ucx_perftest $HOST -t ucp_am_bw -m rocm -s 10MIt will fail with:
[1781012436.055940] [ngt-003-w7900-roc-x2mfskb45hrx-node-3:29366:0] rocm_copy_md.c:425 UCX ERROR ROCm dmabuf support requested but not found [1781012436.061926] [ngt-003-w7900-roc-x2mfskb45hrx-node-3:29366:0] libperf.c:2081 UCX WARN ucp test failed to allocate memory [1781012436.062123] [ngt-003-w7900-roc-x2mfskb45hrx-node-3:29366:0] perftest_run.c:363 UCX ERROR Failed to run test: Out of memoryucx_info -v)Setup and versions
cat /etc/issueorcat /etc/redhat-release+uname -aAlmaLinux release 9.7 (Moss Jungle Cat) Linux 6.8.4-200.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Apr 4 20:45:21 UTC 2024 x86_64 x86_64 x86_64 GNU/Linuxcat /etc/mlnx-release(the string identifies software and firmware setup)rpm -q rdma-coreorrpm -q libibverbsibstatoribv_devinfo -vvcommandAdditional information (depending on the issue)
ucx_info -dto show transports and devices recognized by UCXRoot Cause
UCX checks
/boot/config-for dmabuf support at https://github.com/openucx/ucx/blob/master/src/uct/rocm/base/rocm_base.c#L300.This file does not exist in Fedora CoreOS.
Fix
Check other locations for the config file as already implemented in ROCM at https://github.com/ROCm/rocm-systems/blob/develop/projects/rccl/src/misc/rocmwrap.cc#L282.