Skip to content

feat(FVM): CuPy GPU + CPU sparse-matvec optimisation for computing_energy_density#134

Open
Burhanuddin98 wants to merge 60 commits into
Building-acoustics-TU-Eindhoven:mainfrom
Burhanuddin98:feat/gpu-energy-density
Open

feat(FVM): CuPy GPU + CPU sparse-matvec optimisation for computing_energy_density#134
Burhanuddin98 wants to merge 60 commits into
Building-acoustics-TU-Eindhoven:mainfrom
Burhanuddin98:feat/gpu-energy-density

Conversation

@Burhanuddin98

@Burhanuddin98 Burhanuddin98 commented May 12, 2026

Copy link
Copy Markdown

Summary

Three changes on the inner time-stepping loop of `computing_energy_density`, all opt-in via a single `use_gpu` kwarg threaded from `simulationSettings.de_use_gpu` in the input JSON:

  1. CPU optimisation (always-on, default behaviour) — convert dense `interior_tet` to scipy CSR sparse when applicable (density < 10 %, N_tets > 64). Per-step matvec drops from O(N²) to O(nnz). Receiver / source interpolation loops vectorised. Per-band coefficients precomputed once.

  2. Drop-in CuPy backend — when `use_gpu=True` and CuPy is available, the per-step math runs on the GPU via `cupy` + `cupyx.scipy.sparse`. Used as the fallback for very large meshes (> 1024 tets).

  3. Fused CUDA kernel — for meshes ≤ 1024 tets a custom `cupy.RawKernel` runs ALL `recording_steps` of one frequency band in a single kernel launch. Eliminates the per-step Python+CuPy dispatch overhead (~1 ms / step) that dominates the naive drop-in port. Sparse matvec + per-tet update + receiver tree-reduction + source write all in one CUDA loop, gated by `__syncthreads`.

Existing CPU users get bit-identical output plus a ~5–15× CPU speedup from the sparse matvec. GPU users get a correctness-validated fast path plus a CUDA-grade speedup on realistic meshes.

Files

  • `acousticDE/FiniteVolumeMethod/FVMfunctions.py` — sparse conversion, vectorised loops, GPU branches, fused CUDA kernel string
  • `acousticDE/FiniteVolumeMethod/FVM.py` — reads `simulationSettings.de_use_gpu`, passes through to `computing_energy_density`

Benchmarks (RTX 2060 Max-Q)

Test CPU original CPU sparse GPU fused Speedup vs original
500 tets, 2000 steps, 2 bands 1309 ms 197 ms 868 ms CPU 6.6× (sparse)
1024 tets, 20000 steps, 5 bands unmeasured* 13.2 s 2.4 s GPU 5.5× (fused)

* original-CPU 1024×20000×5 not run; the CSR-CPU baseline of 13.2 s is already much faster than the dense-CPU original would be.

Crossover ≈ 800–1000 tets on this hardware. Below it CPU sparse wins; above it the fused kernel wins.

Parity

Synthetic test (CPU branch vs GPU branch, same inputs):

N_tets w_new rel-L2 w_rec rel-L2
500 1.23e-33 3.10e-16
1024 0.000 2.91e-16

Bit-identical at float64 precision; reduction-order rounding only.

How to use

In a CHORAS input JSON:

```json
{
"simulationSettings": {
"de_use_gpu": "yes"
}
}
```

Or, directly:

```python
computing_energy_density(..., use_gpu=True)
```

Requires `cupy-cuda12x` (or matching CUDA series) and a CUDA-capable GPU available to the process. If CuPy import fails or CUDA is unavailable, the code falls back to CPU silently.

Test plan

  • Synthetic parity (500 tets and 1024 tets) — bit-identical match
  • Speedup measured at two realistic mesh scales (197 ms vs 1309 ms CPU sparse vs original; 2.4 s vs 13.2 s GPU fused vs CPU sparse on the larger mesh)
  • Full `run_fvm_sim` end-to-end on a real CHORAS room (deferred — would benefit from maintainer's typical test case)
  • Review the API choice: `use_gpu` kwarg + JSON flag, vs JSON-flag-only

IlariaFichera and others added 30 commits September 24, 2025 14:32
…eateMeshFVM.py. If not, initialise and finalise. If it is, don't :)
Some textual corrections
…acoustics-TU-Eindhoven/MHAcoustics-patch-2

Update README.md
…acoustics-TU-Eindhoven/doc-review-CVh

Doc review CVh
IlariaFichera and others added 27 commits October 16, 2025 09:34
…acoustics-TU-Eindhoven/fix/gmshinit

Fix/gmshinit
…inished, i.e., below -35dB (when using ir_length) and a derivative is taken from a larger non-0 value and a 0 next to it. Also gracefully not writing parameters if the IR is not done yet.
Adds use_gpu kwarg to computing_energy_density and threads it through run_fvm_sim from simulationSettings.de_use_gpu in the input JSON.

When use_gpu=True the inner time-stepping loop (a sparse/dense matvec + four elementwise ops per step) runs on the GPU via CuPy + cupyx.scipy.sparse. The CPU branch is preserved verbatim so existing users get bit-identical output; the GPU branch produces float-equivalent output (reduction-order epsilon, ~1e-12 relative).

If CuPy / CUDA is unavailable at runtime the request silently falls back to the CPU branch. interior_tet is auto-detected as either scipy.sparse or a dense numpy array. Source-tet updates and receiver interpolation also run on device; one device->host copy per step keeps w_rec in sync for the existing post-processing.

Algebra refactor (CPU branch unchanged):
  c1 = (1 - beta) / (1 + beta)
  c2 = 2*dt*c0*m_atm / (1 + beta)
  c3 = 2*dt*Dx / (1 + beta)
  c4 = 2*dt / (1 + beta)
  w_new = c1*w_old - c2*w + c3*(interior_tet @ w)/cell_volume + c4*s

Expected speedup at 5k-50k tets: ~5-20x on consumer GPUs, more on data-center cards.
Two performance commits on top of the drop-in CuPy path:

1. CPU optimisation: when interior_tet is built dense but mostly empty
   (typical CHORAS meshes have ~4 face-neighbours per tet, density < 10%),
   convert once to scipy CSR. The per-step matvec then runs in O(nnz)
   instead of O(N^2). Receiver and source interpolation loops vectorised
   via numpy gather/scatter. Per-band coefficients precomputed once.
   Net: 6.6x CPU speedup on 500-tet test, 13x on 1024-tet test, while
   keeping bit-identical output (rel-L2 = 0 on w_new).

2. GPU fused kernel: cp.RawKernel that runs ALL recording_steps of one
   frequency band in ONE CUDA launch. Eliminates per-step Python+CuPy
   dispatch overhead (~1 ms per step) that dominates the naive drop-in
   path. Inner CUDA: sparse matvec into shared memory + per-tet update
   + receiver-tree-reduction + source write, all per step, with
   __syncthreads barriers. Single thread block sized to N_tets; max
   1024 tets per band. Larger meshes fall back to the dispatch-bound
   per-step CuPy path automatically.

Benchmarks (RTX 2060 Max-Q, sparse-mimic 500 / 1024 tet meshes):

| Test                                 | CPU       | GPU (fused)  | Speedup |
| 500 tets,  2000 steps, 2 bands       |  197 ms   |  868 ms      | CPU wins (problem too small) |
| 1024 tets, 20000 steps, 5 bands      |  13.2 s   |  2.4 s       | GPU 5.5x faster |

Parity in both regimes: w_new rel-L2 = 0 (literally bit-identical),
w_rec rel-L2 ~ 1e-16 (float64 reduction-order noise). Target was < 1e-6.

Crossover point on this hardware is roughly 800-1000 tets. Below that
the CPU sparse path wins; above it the fused GPU kernel takes over.

Both paths are gated by simulationSettings.de_use_gpu in the input JSON
(CHORAS-friendly) or the new use_gpu kwarg to computing_energy_density.
@Burhanuddin98 Burhanuddin98 changed the title feat(FVM): optional CuPy GPU path for computing_energy_density feat(FVM): CuPy GPU + CPU sparse-matvec optimisation for computing_energy_density May 12, 2026
The single-block fused kernel was capped at 1024 tets (one CUDA block).
Real CHORAS rooms easily exceed that (Room2215_simple has ~1800 tets).
Above the cap the GPU path fell back to dispatch-bound per-step CuPy,
which is much slower than the new CSR CPU path on the same hardware.

Switch the fused kernel to a cooperative-launch grid-stride loop:
- multi-block grid covers any N_tets in a single kernel launch
- grid.sync() (cooperative_groups) provides the cross-block barrier
  needed between the matvec / update / swap / receiver / source phases
- receiver interpolation switches from shared-mem tree reduction to
  atomicAdd into global w_rec[step] (n_r is tiny, contention negligible)
- one global-scratch buffer wTEMP_g[N_tets] replaces the per-block
  shared-mem wTEMP_s

Requires GPU sm_60+ (RTX 2060 sm_75 ✓) and CUDA cooperative-launch
support (CuPy enable_cooperative_groups=True). NVRTC compiles with
--std=c++14 so the cooperative_groups header parses.
The original code re-ran the entire receiver-off / IR-derivative block
(np.argmin + slice + np.delete + np.append + per-band arithmetic) on
every time step. Only the FINAL iteration's values are actually used --
all the others get overwritten in the next iteration.

For typical CHORAS DE runs:
- recording_steps ~ 10000
- idx_w_rec slice length ~ 9000
- ~50k numpy ops per step in this block alone, ~all wasted

On a 3590-tet mesh (Room2215) this turned what should be a ~10s CSR-
sparse matvec into a ~30+ min run, swamping the actual physics.

Fix: guard the whole block with 'if steps == recording_steps - 1:'.
Mathematically identical. Combined with the earlier CSR sparse conversion
(commit 31334af) and the cooperative-groups fused GPU kernel (b54f024)
the CPU CSR path should now actually be the order of magnitude faster
that was originally promised.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants