Skip to content

GPUDirect RDMA is not supported on Fedora CoreOS #11540

@hahahannes

Description

@hahahannes

Describe the bug

GPUDirect RDMA can not be used with ROCM on Fedora CoreOS (and possibly other OS).

Steps to Reproduce

  • Command line
  1. Install UCX on CoreOS with ROCM
git clone https://github.com/openucx/ucx.git
git checkout v1.20.0
./autogen.sh
mkdir build
cd build
../configure \
        --prefix=/usr/local/ucx \
        --with-rocm=/opt/rocm \
        --with-go=no \
        --with-verbs \
        --without-java \
        --disable-doxygen-doc \
        --enable-optimizations \
        --enable-mt \
        --disable-debug \
        --disable-logging \
        --disable-assertions \
        --disable-params-check \
        --disable-dependency-tracking \
        --enable-cma \
        --with-rc \
        --with-ud \
        --with-dc \
        --with-mlx5-dv \
        --with-ib-hw-tm \
        --with-dm \
        --with-avx \
        --without-cm \
        --with-rdmacm
sudo make && sudo make install
  1. Run ucx_perftest
HSA_FORCE_FINE_GRAIN_PCIE=1 UCX_ROCM_COPY_D2H_THRESH=0 UCX_ROCM_COPY_H2D_THRESH=0 UCX_ROCM_COPY_DMABUF=yes HIP_VISIBLE_DEVICES=0 UCX_IB_GPU_DIRECT_RDMA=yes UCX_TLS=rc,rocm ucx_perftest $HOST -t ucp_am_bw -m rocm -s 10M

It will fail with:

[1781012436.055940] [ngt-003-w7900-roc-x2mfskb45hrx-node-3:29366:0] rocm_copy_md.c:425 UCX ERROR ROCm dmabuf support requested but not found
[1781012436.061926] [ngt-003-w7900-roc-x2mfskb45hrx-node-3:29366:0] libperf.c:2081 UCX WARN ucp test failed to allocate memory
[1781012436.062123] [ngt-003-w7900-roc-x2mfskb45hrx-node-3:29366:0] perftest_run.c:363 UCX ERROR Failed to run test: Out of memory
  • UCX version used (from github branch XX or release YY) + UCX configure flags (can be checked by ucx_info -v)
# Library version: 1.20.0
# Library path: /usr/local/ucx/lib/libucs.so.0
# API headers version: 1.20.0
# Git branch '', revision 4b7a6ca
# Configured with: --prefix=/usr/local/ucx --with-rocm=/opt/rocm --with-go=no --with-verbs --without-java --disable-doxygen-doc --enable-optimizations --enable-mt --disable-debug --disable-logging --disable-assertions --disable-params-check --disable-dependency-tracking --enable-cma --with-rc --with-ud --with-dc --with-mlx5-dv --with-ib-hw-tm --with-dm --with-avx --without-cm --with-rdmacm
  • Any UCX environment variables used
HSA_FORCE_FINE_GRAIN_PCIE=1 UCX_ROCM_COPY_D2H_THRESH=0 UCX_ROCM_COPY_H2D_THRESH=0 UCX_ROCM_COPY_DMABUF=yes HIP_VISIBLE_DEVICES=0 UCX_IB_GPU_DIRECT_RDMA=yes UCX_TLS=rc,rocm

Setup and versions

  • OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
    • cat /etc/issue or cat /etc/redhat-release + uname -a
AlmaLinux release 9.7 (Moss Jungle Cat)
Linux  6.8.4-200.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Apr  4 20:45:21 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
  • For Nvidia Bluefield SmartNIC include cat /etc/mlnx-release (the string identifies software and firmware setup)
  • For RDMA/IB/RoCE related issues:
    • Driver version:
      • rpm -q rdma-core or rpm -q libibverbs
rdma-core-61.0-2.el9.x86_64
    - or: MLNX_OFED version `ofed_info -s`
  • HW information from ibstat or ibv_devinfo -vv command
  • For GPU related issues:
    • AMD W7900
ROCM: 7.1.0
Driver: 6.16.6 (30.20)

Additional information (depending on the issue)

  • OpenMPI version
  • Output of ucx_info -d to show transports and devices recognized by UCX
# Memory domain: self
#     Component: self
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#           rkey_ptr is supported
#         memory types: host (access,reg_nonblock,reg,cache)
#
#      Transport: self
#         Device: memory
#           Type: loopback
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 19360.00 MB/sec
#              latency: 0 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 8K
#             am_bcopy: <= 8K
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#   device mem_element: 0 bytes
#
#
# Memory domain: tcp
#     Component: tcp
#         memory types: 
#
#      Transport: tcp
#         Device: enp2s0f4u1u2c2
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 11.32/ppn + 0.00 MB/sec
#              latency: 10960 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 1
#     device num paths: 1
#              max eps: 256
#       device address: 6 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#   device mem_element: 0 bytes
#
#      Transport: tcp
#         Device: enp97s0np0
#           Type: network
#  System device: enp97s0np0 (0)
#
#      capabilities:
#            bandwidth: 2200.00/ppn + 0.00 MB/sec
#              latency: 5203 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 0
#     device num paths: 1
#              max eps: 256
#       device address: 6 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#   device mem_element: 0 bytes
#
#      Transport: tcp
#         Device: lo
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 11.91/ppn + 0.00 MB/sec
#              latency: 10960 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 1
#     device num paths: 1
#              max eps: 256
#       device address: 18 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#   device mem_element: 0 bytes
#
#      Transport: tcp
#         Device: tunl0
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 11.60/ppn + 0.00 MB/sec
#              latency: 10960 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 1
#     device num paths: 1
#              max eps: 256
#       device address: 6 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#   device mem_element: 0 bytes
#
#
# Connection manager: tcp
#      max_conn_priv: 2064 bytes
#
# Memory domain: sysv
#     Component: sysv
#             allocate: unlimited
#           remote key: 12 bytes
#           rkey_ptr is supported
#         memory types: host (access,alloc,cache)
#
#      Transport: sysv
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 15360.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 16 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#   device mem_element: 0 bytes
#
#
# Memory domain: posix
#     Component: posix
#             allocate: <= 703227384K
#           remote key: 32 bytes
#           rkey_ptr is supported
#         memory types: host (access,alloc,cache)
#
#      Transport: posix
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 15360.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 16 bytes
#        iface address: 16 bytes
#       error handling: ep_check
#   device mem_element: 0 bytes
#
#
# Memory domain: bnxt_re0
#     Component: ib
#             register: unlimited, dmabuf, cost: 16000 + 0.060 * N nsec
#           remote key: 8 bytes
#           local memory handle is required for zcopy
#         memory types: host (access,reg,cache), rocm (reg,cache)
#
#      Transport: rc_verbs
#         Device: bnxt_re0:1
#           Type: network
#  System device: bnxt_re0 (0)
#
#      capabilities:
#            bandwidth: 23329.25/ppn + 0.00 MB/sec
#              latency: 800 + 1.000 * N nsec
#             overhead: 75 nsec
#            put_short: <= 96
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 5 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 5 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 95
#             am_bcopy: <= 8255
#             am_zcopy: <= 8255, up to 4 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 127
#               domain: cpu
#           atomic_add: 64 bit
#          atomic_fadd: 64 bit
#         atomic_cswap: 64 bit
#           connection: to ep
#      device priority: 0
#     device num paths: 1
#              max eps: 256
#       device address: 17 bytes
#           ep address: 4 bytes
#       error handling: peer failure, ep_check
#   device mem_element: 0 bytes
#
#
#      Transport: ud_verbs
#         Device: bnxt_re0:1
#           Type: network
#  System device: bnxt_re0 (0)
#
#      capabilities:
#            bandwidth: 23329.25/ppn + 0.00 MB/sec
#              latency: 830 nsec
#             overhead: 105 nsec
#             am_short: <= 88
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 5 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 3992
#           connection: to ep, to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 17 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure, ep_check
#   device mem_element: 0 bytes
#
#
# Connection manager: rdmacm
#      max_conn_priv: 54 bytes
#
# Memory domain: rocm_cpy
#     Component: rocm_cpy
#             allocate: unlimited
#             register: unlimited, cost: 0 nsec
#           remote key: 16 bytes
#         memory types: host (reg,cache), rocm (access,alloc,reg,cache,detect)
#
#      Transport: rocm_copy
#         Device: rocm_cpy
#           Type: accelerator
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 6911.00 MB/sec
#              latency: 100 nsec
#             overhead: 0 nsec
#            put_short: <= 4294967295
#            put_zcopy: unlimited, up to 1 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_short: <= 4294967295
#            get_zcopy: unlimited, up to 1 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: none
#   device mem_element: 0 bytes
#
#
# Memory domain: rocm_ipc
#     Component: rocm_ipc
#             register: unlimited, cost: 9 nsec
#           remote key: 56 bytes
#         memory types: rocm (access,reg,cache)
#
#      Transport: rocm_ipc
#         Device: rocm_ipc
#           Type: accelerator
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 204800.00/ppn + 0.00 MB/sec
#              latency: 100 nsec
#             overhead: 0 nsec
#            put_zcopy: 128..inf, up to 1 iov
#  put_opt_zcopy_align: <= 4
#        put_align_mtu: <= 4
#            get_zcopy: 128..inf, up to 1 iov
#  get_opt_zcopy_align: <= 4
#        get_align_mtu: <= 4
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 4 bytes
#       error handling: none
#   device mem_element: 0 bytes
#
#
# Memory domain: cma
#     Component: cma
#         memory types: 
#
#      Transport: cma
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 11145.00 MB/sec
#              latency: 80 nsec
#             overhead: 2000 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 16 bytes
#        iface address: 16 bytes
#       error handling: peer failure, ep_check
#   device mem_element: 0 bytes
  • Configure result - config.log
  • Log file - configure UCX with "--enable-logging" - and run with "UCX_LOG_LEVEL=data"

Root Cause

UCX checks /boot/config- for dmabuf support at https://github.com/openucx/ucx/blob/master/src/uct/rocm/base/rocm_base.c#L300.
This file does not exist in Fedora CoreOS.

Fix

Check other locations for the config file as already implemented in ROCM at https://github.com/ROCm/rocm-systems/blob/develop/projects/rccl/src/misc/rocmwrap.cc#L282.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions