Skip to content

Multi-GPU HIP distributed factorization fails on AMD MI300A with Multiple MPI ranks #141

Description

@safa0rhan

Hi,

I’m reporting a distributed multi-GPU failure in STRUMPACK on AMD/ROCm. I tried to reduce this carefully and rule out local setup mistakes before posting.

Summary

I can build STRUMPACK successfully with:

  • MPI
  • HIP
  • OpenMP
  • ParMETIS
  • SLATE

and:

  • 1 MPI rank / 1 GPU works
  • 2 MPI ranks / 2 GPUs fail during distributed GPU factorization

I also tested the fix from issue #126 (8f095d2a), but it does not resolve the problem in my case.

The failure mode depends on OMP_NUM_THREADS:

  • OMP_NUM_THREADS=8
    • fails in src/dense/HIPWrapper.cpp:351
    • hipBLAS assertion failed: 1
  • OMP_NUM_THREADS=1
    • segfault in libblaspp.so / libslate.so
    • specifically in blas::device_free, called from:
      • slate::MatrixStorage<double>::allocateBatchArrays(...)
      • slate::impl::getrf_tntpiv(...)

So this appears to be a distributed ROCm multi-GPU failure somewhere in the STRUMPACK + SLATE/BLAS++ path.

Environment

Hardware:

  • AMD Instinct MI300A
  • gfx942

Software stack:

  • GCC 14
  • ROCm 7.2
  • OpenMPI 5.0
  • METIS 5.1
  • ParMETIS 4.0
  • AOCL 5.2
  • ScaLAPACK
  • HDF5 1.14.x

STRUMPACK:

  • built from tag v8.0.0
  • tested also after cherry-picking:
    • 8f095d2a Set device cuda/hip for each openmp thread. This isn't ideal as it assumes the threads are reused.

SLATE stack:

  • locally built from v2025.05.28
  • matching BLAS++ / LAPACK++

STRUMPACK configure highlights:

  • STRUMPACK_USE_MPI=ON
  • STRUMPACK_USE_HIP=ON
  • STRUMPACK_USE_OPENMP=ON
  • STRUMPACK_USE_CUDA=OFF
  • TPL_ENABLE_SLATE=ON
  • TPL_ENABLE_PARMETIS=ON
  • HIP architecture set to gfx942

Note about ROCm 7.2 source compatibility

To build v8.0.0 against ROCm 7.2, I had to patch STRUMPACK source to replace the older hipBLAS complex typedef names:

  • hipblasComplex -> hipFloatComplex
  • hipblasDoubleComplex -> hipDoubleComplex

This was only a local source compatibility patch in STRUMPACK, not a modification of any vendor libraries.

What works

1 MPI rank / 1 GPU

The benchmark runs successfully with GPU enabled.

This validates:

  • HIP build works
  • runtime libraries are correct
  • single-rank GPU factorization/solve works

MPI sanity checks

Plain MPI works:
mpirun -np 2 hostname

Standalone MPI_Init_thread(MPI_THREAD_MULTIPLE) test works on 2 ranks and returns:
provided=3

So the OpenMPI stack on this system can provide MPI_THREAD_MULTIPLE.

Reproducer

Run on one node with 2 MPI ranks and 2 GPUs visible, then:

mpirun -np 2 ./strumpack_benchmark <matrix_dir> --sp_enable_gpu --sp_reordering_method parmetis

The benchmark gets through:

  • matrix read
  • reorder
  • symbolic factorization

and fails in numerical factorization.

Failure mode 1: OMP_NUM_THREADS=8

Run:
export OMP_NUM_THREADS=8
unset SLATE_GPU_AWARE_MPI
mpirun -np 2 ./strumpack_benchmark <matrix_dir> --sp_enable_gpu --sp_reordering_method parmetis

Failure:
hipBLAS assertion failed: 1 .../src/dense/HIPWrapper.cpp 351

The failing line is the hipblasDgemm(...) call in HIPWrapper.cpp.

Failure mode 2: OMP_NUM_THREADS=1

Run:
export OMP_NUM_THREADS=1
unset SLATE_GPU_AWARE_MPI
mpirun -np 2 ./strumpack_benchmark <matrix_dir> --sp_enable_gpu --sp_reordering_method parmetis

Failure:

  • segfault in libblaspp.so
  • blas::device_free(...)
  • called from:
    • slate::MatrixStorage<double>::allocateBatchArrays(...)
    • slate::impl::getrf_tntpiv(...)

So with 1 OpenMP thread, the failure moves from hipblasDgemm to the SLATE / BLAS++ GPU memory-management path.

SLATE_GPU_AWARE_MPI test

I tested both:

  • export SLATE_GPU_AWARE_MPI=0
  • export SLATE_GPU_AWARE_MPI=1

This did not resolve the OMP_NUM_THREADS=1 crash. The failure remained in the same libblaspp / libslate path.

Issue #126 relation

I checked issue #126 and confirmed that my original v8.0.0 tree did not include commit 8f095d2a.

I cherry-picked 8f095d2a, rebuilt, reinstalled, and reran the tests above.

Result:

  • OMP_NUM_THREADS=8: still fails with hipBLAS assertion failed: 1 at HIPWrapper.cpp:351
  • OMP_NUM_THREADS=1: still segfaults in libblaspp / libslate

So 8f095d2a does not fix this ROCm/MI300A case.

Why I think this is a real upstream distributed GPU issue

At this point I believe I have ruled out the common local mistakes:

The remaining failures are now inside:

  • STRUMPACK HIP wrapper (hipblasDgemm)
  • or SLATE/BLAS++ GPU memory handling (device_free, allocateBatchArrays, getrf_tntpiv)

Questions

  1. Is distributed multi-GPU HIP on AMD/ROCm expected to work with:

    • STRUMPACK v8.0.0
    • SLATE v2025.05.28
    • ROCm 7.2
    • MI300A / gfx942
    • OpenMPI 5.0
  2. Does the hipBLAS assertion failed: 1 at HIPWrapper.cpp:351 suggest a known ROCm multi-rank handle/context issue in STRUMPACK?

  3. Does the OMP_NUM_THREADS=1 segfault in blas::device_free / slate::MatrixStorage::allocateBatchArrays suggest a BLAS++ / SLATE ROCm multi-GPU problem rather than STRUMPACK proper?

  4. Is there a recommended STRUMPACK / SLATE / BLAS++ version combination for ROCm multi-GPU that is known to be more stable than:

    • STRUMPACK v8.0.0
    • SLATE v2025.05.28

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions