Multi-GPU HIP distributed factorization fails on AMD MI300A with Multiple MPI ranks

Hi,

I’m reporting a distributed multi-GPU failure in STRUMPACK on AMD/ROCm. I tried to reduce this carefully and rule out local setup mistakes before posting.

## Summary

I can build STRUMPACK successfully with:
- MPI
- HIP
- OpenMP
- ParMETIS
- SLATE

and:

- 1 MPI rank / 1 GPU works
- 2 MPI ranks / 2 GPUs fail during distributed GPU factorization

I also tested the fix from issue #126 (`8f095d2a`), but it does not resolve the problem in my case.

The failure mode depends on `OMP_NUM_THREADS`:

- `OMP_NUM_THREADS=8`
  - fails in `src/dense/HIPWrapper.cpp:351`
  - `hipBLAS assertion failed: 1`
- `OMP_NUM_THREADS=1`
  - segfault in `libblaspp.so` / `libslate.so`
  - specifically in `blas::device_free`, called from:
    - `slate::MatrixStorage<double>::allocateBatchArrays(...)`
    - `slate::impl::getrf_tntpiv(...)`

So this appears to be a distributed ROCm multi-GPU failure somewhere in the STRUMPACK + SLATE/BLAS++ path.

## Environment

Hardware:
- AMD Instinct MI300A
- `gfx942`

Software stack:
- GCC 14
- ROCm 7.2
- OpenMPI 5.0
- METIS 5.1
- ParMETIS 4.0
- AOCL 5.2
- ScaLAPACK
- HDF5 1.14.x

STRUMPACK:
- built from tag `v8.0.0`
- tested also after cherry-picking:
  - `8f095d2a Set device cuda/hip for each openmp thread. This isn't ideal as it assumes the threads are reused.`

SLATE stack:
- locally built from `v2025.05.28`
- matching BLAS++ / LAPACK++

STRUMPACK configure highlights:
- `STRUMPACK_USE_MPI=ON`
- `STRUMPACK_USE_HIP=ON`
- `STRUMPACK_USE_OPENMP=ON`
- `STRUMPACK_USE_CUDA=OFF`
- `TPL_ENABLE_SLATE=ON`
- `TPL_ENABLE_PARMETIS=ON`
- HIP architecture set to `gfx942`

## Note about ROCm 7.2 source compatibility

To build `v8.0.0` against ROCm 7.2, I had to patch STRUMPACK source to replace the older hipBLAS complex typedef names:

- `hipblasComplex` -> `hipFloatComplex`
- `hipblasDoubleComplex` -> `hipDoubleComplex`

This was only a local source compatibility patch in STRUMPACK, not a modification of any vendor libraries.

## What works

### 1 MPI rank / 1 GPU
The benchmark runs successfully with GPU enabled.

This validates:
- HIP build works
- runtime libraries are correct
- single-rank GPU factorization/solve works

### MPI sanity checks
Plain MPI works:
`mpirun -np 2 hostname`

Standalone `MPI_Init_thread(MPI_THREAD_MULTIPLE)` test works on 2 ranks and returns:
`provided=3`

So the OpenMPI stack on this system can provide `MPI_THREAD_MULTIPLE`.

## Reproducer

Run on one node with 2 MPI ranks and 2 GPUs visible, then:

`mpirun -np 2 ./strumpack_benchmark <matrix_dir> --sp_enable_gpu --sp_reordering_method parmetis`

The benchmark gets through:
- matrix read
- reorder
- symbolic factorization

and fails in numerical factorization.

## Failure mode 1: `OMP_NUM_THREADS=8`

Run:
`export OMP_NUM_THREADS=8`  
`unset SLATE_GPU_AWARE_MPI`  
`mpirun -np 2 ./strumpack_benchmark <matrix_dir> --sp_enable_gpu --sp_reordering_method parmetis`

Failure:
`hipBLAS assertion failed: 1 .../src/dense/HIPWrapper.cpp 351`

The failing line is the `hipblasDgemm(...)` call in `HIPWrapper.cpp`.

## Failure mode 2: `OMP_NUM_THREADS=1`

Run:
`export OMP_NUM_THREADS=1`  
`unset SLATE_GPU_AWARE_MPI`  
`mpirun -np 2 ./strumpack_benchmark <matrix_dir> --sp_enable_gpu --sp_reordering_method parmetis`

Failure:
- segfault in `libblaspp.so`
- `blas::device_free(...)`
- called from:
  - `slate::MatrixStorage<double>::allocateBatchArrays(...)`
  - `slate::impl::getrf_tntpiv(...)`

So with 1 OpenMP thread, the failure moves from `hipblasDgemm` to the SLATE / BLAS++ GPU memory-management path.

## `SLATE_GPU_AWARE_MPI` test

I tested both:
- `export SLATE_GPU_AWARE_MPI=0`
- `export SLATE_GPU_AWARE_MPI=1`

This did not resolve the `OMP_NUM_THREADS=1` crash. The failure remained in the same `libblaspp` / `libslate` path.

## Issue #126 relation

I checked issue #126 and confirmed that my original `v8.0.0` tree did not include commit `8f095d2a`.

I cherry-picked `8f095d2a`, rebuilt, reinstalled, and reran the tests above.

Result:
- `OMP_NUM_THREADS=8`: still fails with `hipBLAS assertion failed: 1` at `HIPWrapper.cpp:351`
- `OMP_NUM_THREADS=1`: still segfaults in `libblaspp` / `libslate`

So `8f095d2a` does not fix this ROCm/MI300A case.

## Why I think this is a real upstream distributed GPU issue

At this point I believe I have ruled out the common local mistakes:

- build is successful
- dependency chain is coherent
- runtime libs are correct
- single-rank GPU works
- MPI itself works
- issue #126 patch was applied and tested

The remaining failures are now inside:
- STRUMPACK HIP wrapper (`hipblasDgemm`)
- or SLATE/BLAS++ GPU memory handling (`device_free`, `allocateBatchArrays`, `getrf_tntpiv`)

## Questions

1. Is distributed multi-GPU HIP on AMD/ROCm expected to work with:
   - STRUMPACK `v8.0.0`
   - SLATE `v2025.05.28`
   - ROCm `7.2`
   - MI300A / `gfx942`
   - OpenMPI `5.0`

2. Does the `hipBLAS assertion failed: 1` at `HIPWrapper.cpp:351` suggest a known ROCm multi-rank handle/context issue in STRUMPACK?

3. Does the `OMP_NUM_THREADS=1` segfault in `blas::device_free` / `slate::MatrixStorage::allocateBatchArrays` suggest a BLAS++ / SLATE ROCm multi-GPU problem rather than STRUMPACK proper?

4. Is there a recommended STRUMPACK / SLATE / BLAS++ version combination for ROCm multi-GPU that is known to be more stable than:
   - STRUMPACK `v8.0.0`
   - SLATE `v2025.05.28`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU HIP distributed factorization fails on AMD MI300A with Multiple MPI ranks #141

Summary

Environment

Note about ROCm 7.2 source compatibility

What works

1 MPI rank / 1 GPU

MPI sanity checks

Reproducer

Failure mode 1: `OMP_NUM_THREADS=8`

Failure mode 2: `OMP_NUM_THREADS=1`

`SLATE_GPU_AWARE_MPI` test

Issue #126 relation

Why I think this is a real upstream distributed GPU issue

Questions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Multi-GPU HIP distributed factorization fails on AMD MI300A with Multiple MPI ranks #141

Description

Summary

Environment

Note about ROCm 7.2 source compatibility

What works

1 MPI rank / 1 GPU

MPI sanity checks

Reproducer

Failure mode 1: OMP_NUM_THREADS=8

Failure mode 2: OMP_NUM_THREADS=1

SLATE_GPU_AWARE_MPI test

Issue #126 relation

Why I think this is a real upstream distributed GPU issue

Questions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Failure mode 1: `OMP_NUM_THREADS=8`

Failure mode 2: `OMP_NUM_THREADS=1`

`SLATE_GPU_AWARE_MPI` test