Hi,
I’m reporting a distributed multi-GPU failure in STRUMPACK on AMD/ROCm. I tried to reduce this carefully and rule out local setup mistakes before posting.
Summary
I can build STRUMPACK successfully with:
- MPI
- HIP
- OpenMP
- ParMETIS
- SLATE
and:
- 1 MPI rank / 1 GPU works
- 2 MPI ranks / 2 GPUs fail during distributed GPU factorization
I also tested the fix from issue #126 (8f095d2a), but it does not resolve the problem in my case.
The failure mode depends on OMP_NUM_THREADS:
OMP_NUM_THREADS=8
- fails in
src/dense/HIPWrapper.cpp:351
hipBLAS assertion failed: 1
OMP_NUM_THREADS=1
- segfault in
libblaspp.so / libslate.so
- specifically in
blas::device_free, called from:
slate::MatrixStorage<double>::allocateBatchArrays(...)
slate::impl::getrf_tntpiv(...)
So this appears to be a distributed ROCm multi-GPU failure somewhere in the STRUMPACK + SLATE/BLAS++ path.
Environment
Hardware:
- AMD Instinct MI300A
gfx942
Software stack:
- GCC 14
- ROCm 7.2
- OpenMPI 5.0
- METIS 5.1
- ParMETIS 4.0
- AOCL 5.2
- ScaLAPACK
- HDF5 1.14.x
STRUMPACK:
- built from tag
v8.0.0
- tested also after cherry-picking:
8f095d2a Set device cuda/hip for each openmp thread. This isn't ideal as it assumes the threads are reused.
SLATE stack:
- locally built from
v2025.05.28
- matching BLAS++ / LAPACK++
STRUMPACK configure highlights:
STRUMPACK_USE_MPI=ON
STRUMPACK_USE_HIP=ON
STRUMPACK_USE_OPENMP=ON
STRUMPACK_USE_CUDA=OFF
TPL_ENABLE_SLATE=ON
TPL_ENABLE_PARMETIS=ON
- HIP architecture set to
gfx942
Note about ROCm 7.2 source compatibility
To build v8.0.0 against ROCm 7.2, I had to patch STRUMPACK source to replace the older hipBLAS complex typedef names:
hipblasComplex -> hipFloatComplex
hipblasDoubleComplex -> hipDoubleComplex
This was only a local source compatibility patch in STRUMPACK, not a modification of any vendor libraries.
What works
1 MPI rank / 1 GPU
The benchmark runs successfully with GPU enabled.
This validates:
- HIP build works
- runtime libraries are correct
- single-rank GPU factorization/solve works
MPI sanity checks
Plain MPI works:
mpirun -np 2 hostname
Standalone MPI_Init_thread(MPI_THREAD_MULTIPLE) test works on 2 ranks and returns:
provided=3
So the OpenMPI stack on this system can provide MPI_THREAD_MULTIPLE.
Reproducer
Run on one node with 2 MPI ranks and 2 GPUs visible, then:
mpirun -np 2 ./strumpack_benchmark <matrix_dir> --sp_enable_gpu --sp_reordering_method parmetis
The benchmark gets through:
- matrix read
- reorder
- symbolic factorization
and fails in numerical factorization.
Failure mode 1: OMP_NUM_THREADS=8
Run:
export OMP_NUM_THREADS=8
unset SLATE_GPU_AWARE_MPI
mpirun -np 2 ./strumpack_benchmark <matrix_dir> --sp_enable_gpu --sp_reordering_method parmetis
Failure:
hipBLAS assertion failed: 1 .../src/dense/HIPWrapper.cpp 351
The failing line is the hipblasDgemm(...) call in HIPWrapper.cpp.
Failure mode 2: OMP_NUM_THREADS=1
Run:
export OMP_NUM_THREADS=1
unset SLATE_GPU_AWARE_MPI
mpirun -np 2 ./strumpack_benchmark <matrix_dir> --sp_enable_gpu --sp_reordering_method parmetis
Failure:
- segfault in
libblaspp.so
blas::device_free(...)
- called from:
slate::MatrixStorage<double>::allocateBatchArrays(...)
slate::impl::getrf_tntpiv(...)
So with 1 OpenMP thread, the failure moves from hipblasDgemm to the SLATE / BLAS++ GPU memory-management path.
SLATE_GPU_AWARE_MPI test
I tested both:
export SLATE_GPU_AWARE_MPI=0
export SLATE_GPU_AWARE_MPI=1
This did not resolve the OMP_NUM_THREADS=1 crash. The failure remained in the same libblaspp / libslate path.
Issue #126 relation
I checked issue #126 and confirmed that my original v8.0.0 tree did not include commit 8f095d2a.
I cherry-picked 8f095d2a, rebuilt, reinstalled, and reran the tests above.
Result:
OMP_NUM_THREADS=8: still fails with hipBLAS assertion failed: 1 at HIPWrapper.cpp:351
OMP_NUM_THREADS=1: still segfaults in libblaspp / libslate
So 8f095d2a does not fix this ROCm/MI300A case.
Why I think this is a real upstream distributed GPU issue
At this point I believe I have ruled out the common local mistakes:
The remaining failures are now inside:
- STRUMPACK HIP wrapper (
hipblasDgemm)
- or SLATE/BLAS++ GPU memory handling (
device_free, allocateBatchArrays, getrf_tntpiv)
Questions
-
Is distributed multi-GPU HIP on AMD/ROCm expected to work with:
- STRUMPACK
v8.0.0
- SLATE
v2025.05.28
- ROCm
7.2
- MI300A /
gfx942
- OpenMPI
5.0
-
Does the hipBLAS assertion failed: 1 at HIPWrapper.cpp:351 suggest a known ROCm multi-rank handle/context issue in STRUMPACK?
-
Does the OMP_NUM_THREADS=1 segfault in blas::device_free / slate::MatrixStorage::allocateBatchArrays suggest a BLAS++ / SLATE ROCm multi-GPU problem rather than STRUMPACK proper?
-
Is there a recommended STRUMPACK / SLATE / BLAS++ version combination for ROCm multi-GPU that is known to be more stable than:
- STRUMPACK
v8.0.0
- SLATE
v2025.05.28
Hi,
I’m reporting a distributed multi-GPU failure in STRUMPACK on AMD/ROCm. I tried to reduce this carefully and rule out local setup mistakes before posting.
Summary
I can build STRUMPACK successfully with:
and:
I also tested the fix from issue #126 (
8f095d2a), but it does not resolve the problem in my case.The failure mode depends on
OMP_NUM_THREADS:OMP_NUM_THREADS=8src/dense/HIPWrapper.cpp:351hipBLAS assertion failed: 1OMP_NUM_THREADS=1libblaspp.so/libslate.soblas::device_free, called from:slate::MatrixStorage<double>::allocateBatchArrays(...)slate::impl::getrf_tntpiv(...)So this appears to be a distributed ROCm multi-GPU failure somewhere in the STRUMPACK + SLATE/BLAS++ path.
Environment
Hardware:
gfx942Software stack:
STRUMPACK:
v8.0.08f095d2a Set device cuda/hip for each openmp thread. This isn't ideal as it assumes the threads are reused.SLATE stack:
v2025.05.28STRUMPACK configure highlights:
STRUMPACK_USE_MPI=ONSTRUMPACK_USE_HIP=ONSTRUMPACK_USE_OPENMP=ONSTRUMPACK_USE_CUDA=OFFTPL_ENABLE_SLATE=ONTPL_ENABLE_PARMETIS=ONgfx942Note about ROCm 7.2 source compatibility
To build
v8.0.0against ROCm 7.2, I had to patch STRUMPACK source to replace the older hipBLAS complex typedef names:hipblasComplex->hipFloatComplexhipblasDoubleComplex->hipDoubleComplexThis was only a local source compatibility patch in STRUMPACK, not a modification of any vendor libraries.
What works
1 MPI rank / 1 GPU
The benchmark runs successfully with GPU enabled.
This validates:
MPI sanity checks
Plain MPI works:
mpirun -np 2 hostnameStandalone
MPI_Init_thread(MPI_THREAD_MULTIPLE)test works on 2 ranks and returns:provided=3So the OpenMPI stack on this system can provide
MPI_THREAD_MULTIPLE.Reproducer
Run on one node with 2 MPI ranks and 2 GPUs visible, then:
mpirun -np 2 ./strumpack_benchmark <matrix_dir> --sp_enable_gpu --sp_reordering_method parmetisThe benchmark gets through:
and fails in numerical factorization.
Failure mode 1:
OMP_NUM_THREADS=8Run:
export OMP_NUM_THREADS=8unset SLATE_GPU_AWARE_MPImpirun -np 2 ./strumpack_benchmark <matrix_dir> --sp_enable_gpu --sp_reordering_method parmetisFailure:
hipBLAS assertion failed: 1 .../src/dense/HIPWrapper.cpp 351The failing line is the
hipblasDgemm(...)call inHIPWrapper.cpp.Failure mode 2:
OMP_NUM_THREADS=1Run:
export OMP_NUM_THREADS=1unset SLATE_GPU_AWARE_MPImpirun -np 2 ./strumpack_benchmark <matrix_dir> --sp_enable_gpu --sp_reordering_method parmetisFailure:
libblaspp.soblas::device_free(...)slate::MatrixStorage<double>::allocateBatchArrays(...)slate::impl::getrf_tntpiv(...)So with 1 OpenMP thread, the failure moves from
hipblasDgemmto the SLATE / BLAS++ GPU memory-management path.SLATE_GPU_AWARE_MPItestI tested both:
export SLATE_GPU_AWARE_MPI=0export SLATE_GPU_AWARE_MPI=1This did not resolve the
OMP_NUM_THREADS=1crash. The failure remained in the samelibblaspp/libslatepath.Issue #126 relation
I checked issue #126 and confirmed that my original
v8.0.0tree did not include commit8f095d2a.I cherry-picked
8f095d2a, rebuilt, reinstalled, and reran the tests above.Result:
OMP_NUM_THREADS=8: still fails withhipBLAS assertion failed: 1atHIPWrapper.cpp:351OMP_NUM_THREADS=1: still segfaults inlibblaspp/libslateSo
8f095d2adoes not fix this ROCm/MI300A case.Why I think this is a real upstream distributed GPU issue
At this point I believe I have ruled out the common local mistakes:
The remaining failures are now inside:
hipblasDgemm)device_free,allocateBatchArrays,getrf_tntpiv)Questions
Is distributed multi-GPU HIP on AMD/ROCm expected to work with:
v8.0.0v2025.05.287.2gfx9425.0Does the
hipBLAS assertion failed: 1atHIPWrapper.cpp:351suggest a known ROCm multi-rank handle/context issue in STRUMPACK?Does the
OMP_NUM_THREADS=1segfault inblas::device_free/slate::MatrixStorage::allocateBatchArrayssuggest a BLAS++ / SLATE ROCm multi-GPU problem rather than STRUMPACK proper?Is there a recommended STRUMPACK / SLATE / BLAS++ version combination for ROCm multi-GPU that is known to be more stable than:
v8.0.0v2025.05.28