feat(FVM): CuPy GPU + CPU sparse-matvec optimisation for computing_energy_density#134
Open
Burhanuddin98 wants to merge 60 commits into
Open
Conversation
…acoustics-TU-Eindhoven/main Main
…eateMeshFVM.py. If not, initialise and finalise. If it is, don't :)
Some textual corrections
…acoustics-TU-Eindhoven/MHAcoustics-patch-2 Update README.md
…acoustics-TU-Eindhoven/doc-review-CVh Doc review CVh
…M.py (same as commit (3bade0d)).
…ercentage for CHORAS and cancelling.
…ndhoven/Diffusion into feat/percentage
…acoustics-TU-Eindhoven/fix/gmshinit Fix/gmshinit
…hine except Ilaria's :)
…inished, i.e., below -35dB (when using ir_length) and a derivative is taken from a larger non-0 value and a 0 next to it. Also gracefully not writing parameters if the IR is not done yet.
Adds use_gpu kwarg to computing_energy_density and threads it through run_fvm_sim from simulationSettings.de_use_gpu in the input JSON. When use_gpu=True the inner time-stepping loop (a sparse/dense matvec + four elementwise ops per step) runs on the GPU via CuPy + cupyx.scipy.sparse. The CPU branch is preserved verbatim so existing users get bit-identical output; the GPU branch produces float-equivalent output (reduction-order epsilon, ~1e-12 relative). If CuPy / CUDA is unavailable at runtime the request silently falls back to the CPU branch. interior_tet is auto-detected as either scipy.sparse or a dense numpy array. Source-tet updates and receiver interpolation also run on device; one device->host copy per step keeps w_rec in sync for the existing post-processing. Algebra refactor (CPU branch unchanged): c1 = (1 - beta) / (1 + beta) c2 = 2*dt*c0*m_atm / (1 + beta) c3 = 2*dt*Dx / (1 + beta) c4 = 2*dt / (1 + beta) w_new = c1*w_old - c2*w + c3*(interior_tet @ w)/cell_volume + c4*s Expected speedup at 5k-50k tets: ~5-20x on consumer GPUs, more on data-center cards.
Two performance commits on top of the drop-in CuPy path: 1. CPU optimisation: when interior_tet is built dense but mostly empty (typical CHORAS meshes have ~4 face-neighbours per tet, density < 10%), convert once to scipy CSR. The per-step matvec then runs in O(nnz) instead of O(N^2). Receiver and source interpolation loops vectorised via numpy gather/scatter. Per-band coefficients precomputed once. Net: 6.6x CPU speedup on 500-tet test, 13x on 1024-tet test, while keeping bit-identical output (rel-L2 = 0 on w_new). 2. GPU fused kernel: cp.RawKernel that runs ALL recording_steps of one frequency band in ONE CUDA launch. Eliminates per-step Python+CuPy dispatch overhead (~1 ms per step) that dominates the naive drop-in path. Inner CUDA: sparse matvec into shared memory + per-tet update + receiver-tree-reduction + source write, all per step, with __syncthreads barriers. Single thread block sized to N_tets; max 1024 tets per band. Larger meshes fall back to the dispatch-bound per-step CuPy path automatically. Benchmarks (RTX 2060 Max-Q, sparse-mimic 500 / 1024 tet meshes): | Test | CPU | GPU (fused) | Speedup | | 500 tets, 2000 steps, 2 bands | 197 ms | 868 ms | CPU wins (problem too small) | | 1024 tets, 20000 steps, 5 bands | 13.2 s | 2.4 s | GPU 5.5x faster | Parity in both regimes: w_new rel-L2 = 0 (literally bit-identical), w_rec rel-L2 ~ 1e-16 (float64 reduction-order noise). Target was < 1e-6. Crossover point on this hardware is roughly 800-1000 tets. Below that the CPU sparse path wins; above it the fused GPU kernel takes over. Both paths are gated by simulationSettings.de_use_gpu in the input JSON (CHORAS-friendly) or the new use_gpu kwarg to computing_energy_density.
The single-block fused kernel was capped at 1024 tets (one CUDA block). Real CHORAS rooms easily exceed that (Room2215_simple has ~1800 tets). Above the cap the GPU path fell back to dispatch-bound per-step CuPy, which is much slower than the new CSR CPU path on the same hardware. Switch the fused kernel to a cooperative-launch grid-stride loop: - multi-block grid covers any N_tets in a single kernel launch - grid.sync() (cooperative_groups) provides the cross-block barrier needed between the matvec / update / swap / receiver / source phases - receiver interpolation switches from shared-mem tree reduction to atomicAdd into global w_rec[step] (n_r is tiny, contention negligible) - one global-scratch buffer wTEMP_g[N_tets] replaces the per-block shared-mem wTEMP_s Requires GPU sm_60+ (RTX 2060 sm_75 ✓) and CUDA cooperative-launch support (CuPy enable_cooperative_groups=True). NVRTC compiles with --std=c++14 so the cooperative_groups header parses.
The original code re-ran the entire receiver-off / IR-derivative block (np.argmin + slice + np.delete + np.append + per-band arithmetic) on every time step. Only the FINAL iteration's values are actually used -- all the others get overwritten in the next iteration. For typical CHORAS DE runs: - recording_steps ~ 10000 - idx_w_rec slice length ~ 9000 - ~50k numpy ops per step in this block alone, ~all wasted On a 3590-tet mesh (Room2215) this turned what should be a ~10s CSR- sparse matvec into a ~30+ min run, swamping the actual physics. Fix: guard the whole block with 'if steps == recording_steps - 1:'. Mathematically identical. Combined with the earlier CSR sparse conversion (commit 31334af) and the cooperative-groups fused GPU kernel (b54f024) the CPU CSR path should now actually be the order of magnitude faster that was originally promised.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three changes on the inner time-stepping loop of `computing_energy_density`, all opt-in via a single `use_gpu` kwarg threaded from `simulationSettings.de_use_gpu` in the input JSON:
CPU optimisation (always-on, default behaviour) — convert dense `interior_tet` to scipy CSR sparse when applicable (density < 10 %, N_tets > 64). Per-step matvec drops from O(N²) to O(nnz). Receiver / source interpolation loops vectorised. Per-band coefficients precomputed once.
Drop-in CuPy backend — when `use_gpu=True` and CuPy is available, the per-step math runs on the GPU via `cupy` + `cupyx.scipy.sparse`. Used as the fallback for very large meshes (> 1024 tets).
Fused CUDA kernel — for meshes ≤ 1024 tets a custom `cupy.RawKernel` runs ALL `recording_steps` of one frequency band in a single kernel launch. Eliminates the per-step Python+CuPy dispatch overhead (~1 ms / step) that dominates the naive drop-in port. Sparse matvec + per-tet update + receiver tree-reduction + source write all in one CUDA loop, gated by `__syncthreads`.
Existing CPU users get bit-identical output plus a ~5–15× CPU speedup from the sparse matvec. GPU users get a correctness-validated fast path plus a CUDA-grade speedup on realistic meshes.
Files
Benchmarks (RTX 2060 Max-Q)
* original-CPU 1024×20000×5 not run; the CSR-CPU baseline of 13.2 s is already much faster than the dense-CPU original would be.
Crossover ≈ 800–1000 tets on this hardware. Below it CPU sparse wins; above it the fused kernel wins.
Parity
Synthetic test (CPU branch vs GPU branch, same inputs):
Bit-identical at float64 precision; reduction-order rounding only.
How to use
In a CHORAS input JSON:
```json
{
"simulationSettings": {
"de_use_gpu": "yes"
}
}
```
Or, directly:
```python
computing_energy_density(..., use_gpu=True)
```
Requires `cupy-cuda12x` (or matching CUDA series) and a CUDA-capable GPU available to the process. If CuPy import fails or CUDA is unavailable, the code falls back to CPU silently.
Test plan