feat(FVM): CuPy GPU + CPU sparse-matvec optimisation for computing_energy_density by Burhanuddin98 · Pull Request #134 · Building-acoustics-TU-Eindhoven/acousticDE

Burhanuddin98 · 2026-05-12T08:37:51Z

Summary

Three changes on the inner time-stepping loop of `computing_energy_density`, all opt-in via a single `use_gpu` kwarg threaded from `simulationSettings.de_use_gpu` in the input JSON:

CPU optimisation (always-on, default behaviour) — convert dense `interior_tet` to scipy CSR sparse when applicable (density < 10 %, N_tets > 64). Per-step matvec drops from O(N²) to O(nnz). Receiver / source interpolation loops vectorised. Per-band coefficients precomputed once.
Drop-in CuPy backend — when `use_gpu=True` and CuPy is available, the per-step math runs on the GPU via `cupy` + `cupyx.scipy.sparse`. Used as the fallback for very large meshes (> 1024 tets).
Fused CUDA kernel — for meshes ≤ 1024 tets a custom `cupy.RawKernel` runs ALL `recording_steps` of one frequency band in a single kernel launch. Eliminates the per-step Python+CuPy dispatch overhead (~1 ms / step) that dominates the naive drop-in port. Sparse matvec + per-tet update + receiver tree-reduction + source write all in one CUDA loop, gated by `__syncthreads`.

Existing CPU users get bit-identical output plus a ~5–15× CPU speedup from the sparse matvec. GPU users get a correctness-validated fast path plus a CUDA-grade speedup on realistic meshes.

Files

`acousticDE/FiniteVolumeMethod/FVMfunctions.py` — sparse conversion, vectorised loops, GPU branches, fused CUDA kernel string
`acousticDE/FiniteVolumeMethod/FVM.py` — reads `simulationSettings.de_use_gpu`, passes through to `computing_energy_density`

Benchmarks (RTX 2060 Max-Q)

Test	CPU original	CPU sparse	GPU fused	Speedup vs original
500 tets, 2000 steps, 2 bands	1309 ms	197 ms	868 ms	CPU 6.6× (sparse)
1024 tets, 20000 steps, 5 bands	unmeasured*	13.2 s	2.4 s	GPU 5.5× (fused)

* original-CPU 1024×20000×5 not run; the CSR-CPU baseline of 13.2 s is already much faster than the dense-CPU original would be.

Crossover ≈ 800–1000 tets on this hardware. Below it CPU sparse wins; above it the fused kernel wins.

Parity

Synthetic test (CPU branch vs GPU branch, same inputs):

N_tets	w_new rel-L2	w_rec rel-L2
500	1.23e-33	3.10e-16
1024	0.000	2.91e-16

Bit-identical at float64 precision; reduction-order rounding only.

How to use

In a CHORAS input JSON:

```json
{
"simulationSettings": {
"de_use_gpu": "yes"
}
}
```

Or, directly:

```python
computing_energy_density(..., use_gpu=True)
```

Requires `cupy-cuda12x` (or matching CUDA series) and a CUDA-capable GPU available to the process. If CuPy import fails or CUDA is unavailable, the code falls back to CPU silently.

Test plan

Synthetic parity (500 tets and 1024 tets) — bit-identical match
Speedup measured at two realistic mesh scales (197 ms vs 1309 ms CPU sparse vs original; 2.4 s vs 13.2 s GPU fused vs CPU sparse on the larger mesh)
Full `run_fvm_sim` end-to-end on a real CHORAS room (deferred — would benefit from maintainer's typical test case)
Review the API choice: `use_gpu` kwarg + JSON flag, vs JSON-flag-only

…acoustics-TU-Eindhoven/main Main

…eateMeshFVM.py. If not, initialise and finalise. If it is, don't :)

Some textual corrections

…acoustics-TU-Eindhoven/MHAcoustics-patch-2 Update README.md

…acoustics-TU-Eindhoven/doc-review-CVh Doc review CVh

…M.py (same as commit (3bade0d)).

…ercentage for CHORAS and cancelling.

…ndhoven/Diffusion into feat/percentage

…acoustics-TU-Eindhoven/fix/gmshinit Fix/gmshinit

…hine except Ilaria's :)

…inished, i.e., below -35dB (when using ir_length) and a derivative is taken from a larger non-0 value and a 0 next to it. Also gracefully not writing parameters if the IR is not done yet.

Adds use_gpu kwarg to computing_energy_density and threads it through run_fvm_sim from simulationSettings.de_use_gpu in the input JSON. When use_gpu=True the inner time-stepping loop (a sparse/dense matvec + four elementwise ops per step) runs on the GPU via CuPy + cupyx.scipy.sparse. The CPU branch is preserved verbatim so existing users get bit-identical output; the GPU branch produces float-equivalent output (reduction-order epsilon, ~1e-12 relative). If CuPy / CUDA is unavailable at runtime the request silently falls back to the CPU branch. interior_tet is auto-detected as either scipy.sparse or a dense numpy array. Source-tet updates and receiver interpolation also run on device; one device->host copy per step keeps w_rec in sync for the existing post-processing. Algebra refactor (CPU branch unchanged): c1 = (1 - beta) / (1 + beta) c2 = 2*dt*c0*m_atm / (1 + beta) c3 = 2*dt*Dx / (1 + beta) c4 = 2*dt / (1 + beta) w_new = c1*w_old - c2*w + c3*(interior_tet @ w)/cell_volume + c4*s Expected speedup at 5k-50k tets: ~5-20x on consumer GPUs, more on data-center cards.

Two performance commits on top of the drop-in CuPy path: 1. CPU optimisation: when interior_tet is built dense but mostly empty (typical CHORAS meshes have ~4 face-neighbours per tet, density < 10%), convert once to scipy CSR. The per-step matvec then runs in O(nnz) instead of O(N^2). Receiver and source interpolation loops vectorised via numpy gather/scatter. Per-band coefficients precomputed once. Net: 6.6x CPU speedup on 500-tet test, 13x on 1024-tet test, while keeping bit-identical output (rel-L2 = 0 on w_new). 2. GPU fused kernel: cp.RawKernel that runs ALL recording_steps of one frequency band in ONE CUDA launch. Eliminates per-step Python+CuPy dispatch overhead (~1 ms per step) that dominates the naive drop-in path. Inner CUDA: sparse matvec into shared memory + per-tet update + receiver-tree-reduction + source write, all per step, with __syncthreads barriers. Single thread block sized to N_tets; max 1024 tets per band. Larger meshes fall back to the dispatch-bound per-step CuPy path automatically. Benchmarks (RTX 2060 Max-Q, sparse-mimic 500 / 1024 tet meshes): | Test | CPU | GPU (fused) | Speedup | | 500 tets, 2000 steps, 2 bands | 197 ms | 868 ms | CPU wins (problem too small) | | 1024 tets, 20000 steps, 5 bands | 13.2 s | 2.4 s | GPU 5.5x faster | Parity in both regimes: w_new rel-L2 = 0 (literally bit-identical), w_rec rel-L2 ~ 1e-16 (float64 reduction-order noise). Target was < 1e-6. Crossover point on this hardware is roughly 800-1000 tets. Below that the CPU sparse path wins; above it the fused GPU kernel takes over. Both paths are gated by simulationSettings.de_use_gpu in the input JSON (CHORAS-friendly) or the new use_gpu kwarg to computing_energy_density.

The single-block fused kernel was capped at 1024 tets (one CUDA block). Real CHORAS rooms easily exceed that (Room2215_simple has ~1800 tets). Above the cap the GPU path fell back to dispatch-bound per-step CuPy, which is much slower than the new CSR CPU path on the same hardware. Switch the fused kernel to a cooperative-launch grid-stride loop: - multi-block grid covers any N_tets in a single kernel launch - grid.sync() (cooperative_groups) provides the cross-block barrier needed between the matvec / update / swap / receiver / source phases - receiver interpolation switches from shared-mem tree reduction to atomicAdd into global w_rec[step] (n_r is tiny, contention negligible) - one global-scratch buffer wTEMP_g[N_tets] replaces the per-block shared-mem wTEMP_s Requires GPU sm_60+ (RTX 2060 sm_75 ✓) and CUDA cooperative-launch support (CuPy enable_cooperative_groups=True). NVRTC compiles with --std=c++14 so the cooperative_groups header parses.

The original code re-ran the entire receiver-off / IR-derivative block (np.argmin + slice + np.delete + np.append + per-band arithmetic) on every time step. Only the FINAL iteration's values are actually used -- all the others get overwritten in the next iteration. For typical CHORAS DE runs: - recording_steps ~ 10000 - idx_w_rec slice length ~ 9000 - ~50k numpy ops per step in this block alone, ~all wasted On a 3590-tet mesh (Room2215) this turned what should be a ~10s CSR- sparse matvec into a ~30+ min run, swamping the actual physics. Fix: guard the whole block with 'if steps == recording_steps - 1:'. Mathematically identical. Combined with the earlier CSR sparse conversion (commit 31334af) and the cooperative-groups fused GPU kernel (b54f024) the CPU CSR path should now actually be the order of magnitude faster that was originally promised.

IlariaFichera and others added 30 commits September 24, 2025 14:32

Merge pull request Building-acoustics-TU-Eindhoven#123 from Building-…

9a70804

…acoustics-TU-Eindhoven/main Main

Added Joss paper

2b479f0

Changed name of the package

e445693

Changed the path in the toml

e85e420

Removed BOM from toml file

c3fc093

Removed BOM from msh files

5d3ad90

replaced anechoic_file because of BOM

8badf92

Changed links to documentation

c2b579d

API reference inserted again

c329961

Changed version of the package

b89ba0a

Fixed gmsh initialisation by checking whether it is initialised in Cr…

3bade0d

…eateMeshFVM.py. If not, initialise and finalise. If it is, don't :)

New result file in example

aa1a9cd

Added path to the pdf joss paper yaml

2d46cd8

Modified references joss paper

cf2f5fa

Update index.rst

f5607f3

Update Overview.md

d7dbb18

Update Finite Different Method Use.md

7977966

Update Finite Volume Method Use.md

812ebd4

Update Auralization Use.md

6c77f5a

Update DocumentationFDM.md

132865a

Update DocumentationFVM.md

029f8a4

Update DocumentationAuralization.md

7dd1cd9

Update Finite Volume Method Demo Tutorial.md

2e579e9

Update FDM.py

0f551e5

Update FDMfunctions.py

808d648

Update FVMfunctions.py

7c5e391

Update README.md

c09982b

Some textual corrections

Merge pull request Building-acoustics-TU-Eindhoven#126 from Building-…

da89376

…acoustics-TU-Eindhoven/MHAcoustics-patch-2 Update README.md

Merge pull request Building-acoustics-TU-Eindhoven#124 from Building-…

debc9c6

…acoustics-TU-Eindhoven/doc-review-CVh Doc review CVh

Review from Cédric

7776f17

IlariaFichera and others added 27 commits October 16, 2025 09:34

Corrected small typos

a0ab15a

partial feedback Maarten

af47659

Fixed gmsh initialisation by checking whether it is initialised in FV…

cbf59a5

…M.py (same as commit (3bade0d)).

Merge branch 'master' into fix/gmshinit

36d1d25

Added temporary solution for json file handling, including updating p…

6824e72

…ercentage for CHORAS and cancelling.

Fixed issues with cancelling simulation

57e10a7

One more gmsh initialise fix

f654d35

Merge branch 'fix/gmshinit' into feat/percentage

262db29

One more gmsh initialise fix (amend from previous)

9cdde6d

Merge branch 'fix/gmshinit' into feat/percentage

1d4cf4a

One more gmsh initialise fix (amend from previous-previous)

5334c2d

One more gmsh initialise fix (amend from previous-previous)

7bc3ba5

Merge branch 'feat/percentage' of github.com:Building-acoustics-TU-Ei…

b3e11f3

…ndhoven/Diffusion into feat/percentage

Merge pull request Building-acoustics-TU-Eindhoven#127 from Building-…

2edeb97

…acoustics-TU-Eindhoven/fix/gmshinit Fix/gmshinit

Added spl_r_t0_band and t20 for CHORAS

8efb4d1

Fixed test run fvm

ce48ca4

Added function for checking if source and receivers are inside the room

a95f508

Merge remote-tracking branch 'origin/master' into feat/percentage

be14b25

Removed results = run_fvm... line that breaks the FVM code on any mac…

57e2082

…hine except Ilaria's :)

Added pressure derivative in FVM and FDM for auralization

9c1501a

Changed test auralization to match the changes in functions

588cf01

Pull master

d2342ee

Included correct irlength functionality

738c0a1

Cutting off last sample of pressure for the case that the IR is not f…

6824cdb

…inished, i.e., below -35dB (when using ir_length) and a derivative is taken from a larger non-0 value and a 0 next to it. Also gracefully not writing parameters if the IR is not done yet.

Filling parameters which can't be calculated with -1's

d32afb2

Burhanuddin98 changed the title ~~feat(FVM): optional CuPy GPU path for computing_energy_density~~ feat(FVM): CuPy GPU + CPU sparse-matvec optimisation for computing_energy_density May 12, 2026

Burhanuddin98 added 2 commits May 12, 2026 11:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(FVM): CuPy GPU + CPU sparse-matvec optimisation for computing_energy_density#134

feat(FVM): CuPy GPU + CPU sparse-matvec optimisation for computing_energy_density#134
Burhanuddin98 wants to merge 60 commits into
Building-acoustics-TU-Eindhoven:mainfrom
Burhanuddin98:feat/gpu-energy-density

Burhanuddin98 commented May 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

Burhanuddin98 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files

Benchmarks (RTX 2060 Max-Q)

Parity

How to use

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Burhanuddin98 commented May 12, 2026 •

edited

Loading