Skip to content

Latest commit

 

History

History
281 lines (216 loc) · 11.6 KB

File metadata and controls

281 lines (216 loc) · 11.6 KB

LocalVQE

Open in Spaces Model on HF

Local Voice Quality Enhancement — compact neural models for acoustic echo cancellation (AEC), noise suppression (NS), and dereverberation of 16 kHz speech, running on commodity CPUs in real time. Causal and streaming (256-sample hop, 16 ms latency). F32 inference in C++ via GGML; a PyTorch reference is included for research.

A streaming, CPU-tuned derivative of DeepVQE (Indenbom et al., Interspeech 2023).

Models

Speed is per 16 ms hop on a Ryzen 9 7900 (Zen4), 4 threads; RT = realtime factor (higher is faster than realtime).

Version Does Params Size (F32) Speed Pick it when
v1.3 (current) AEC + NS + dereverb 4.8 M ~19 MB 3.2 ms · 5.0× RT best joint quality, CPU budget available
v1.2 AEC + NS + dereverb 1.3 M ~5 MB 1.7 ms · 8.9× RT tight CPU / low-power devices
v1.4-AEC echo only (keeps voice, noise, room) 203 K ~3 MB 0.83 ms · 19× RT NS is handled elsewhere, or you want the room kept
v1.4-AEC 2.7K echo only, linear filter (no mask) 2.7 K ~17 KB 0.36 ms · 44× RT lightest echo canceller; echo isn't heavily reverberant
v1.1 / v1 AEC + NS + dereverb 1.3 M ~5 MB superseded by v1.2
  • Joint models (v1.2 / v1.3) clean echo, noise, and reverb in one pass. v1.3 is wider and filters noise better; v1.2 is ~1/4 the per-hop cost.
  • v1.4-AEC removes only the far-end echo and passes voice, room, and background through unchanged. It's a classical adaptive filter followed by a small neural mask. The 2.7K build is that filter alone — cheaper and gentler, but it can't remove heavily reverberant echo the way the mask can.
  • Every model needs a far-end reference signal (a loopback of what your speakers play) in addition to the mic.
  • bf16 GGUFs are ~12 % smaller with identical quality and speed; pick f32 unless download size matters.

Weight files on Hugging Face

File Model
localvqe-v1.3-4.8M-f32.gguf / .pt v1.3 joint (GGUF for inference, .pt for research)
localvqe-v1.2-1.3M-f32.gguf / .pt v1.2 joint
localvqe-v1.4-aec-200K-f32.gguf / -bf16.gguf v1.4-AEC (echo only)
localvqe-v1.4-aec-2.7K-f32.gguf v1.4-AEC front-end only
localvqe-v1.1-1.3M-f32.gguf, localvqe-v1-1.3M-f32.gguf older releases

v1.4-AEC is GGUF-only (no .pt). GGUF integrity is checked at load time against a built-in SHA256 allowlist (ggml/model_hash.cpp). PyTorch checkpoint hashes:

22d3e2f33bb8b25ec1c6a928cfb741bb631d45bae2b3759684818b101c95878e  localvqe-v1.3-4.8M.pt
ff6885e7c8d7d29a8ce963303dcd668ae0f2a7bdafae28631292fe6f06f7cd77  localvqe-v1.2-1.3M.pt

Performance

Full 800-clip eval on the ICASSP 2022 AEC Challenge blind test set (real recordings). AECMOS echo / deg are 1–5 (higher = more echo removed / cleaner speech); blind ERLE is 10·log10(E[mic²]/E[enh²]), only meaningful on far-end-only clips. Unprocessed-mic echo MOS is 2.67 / 2.56 / 1.90 / 2.13 / 5.00 across the five scenarios.

v1.4-AEC — keeps background noise and room by design, so its ERLE and far-end DNSMOS are intentionally lower than the joint models (it isn't deleting the ambience):

Scenario n echo ↑ deg ↑ ERLE ↑ OVRL
doubletalk 115 4.20 2.45 2.59
doubletalk-with-movement 185 4.19 2.45 2.55
farend-singletalk 107 3.80 4.99 14.6 dB 1.37
farend-singletalk-with-movement 193 3.86 4.95 11.1 dB 1.31
nearend-singletalk 200 4.99 3.99 3.08

v1.4-AEC 2.7K (front-end only) — matches or beats the full model's perceptual far-end echo at 1/74 the parameters; the mask's extra work shows up as higher ERLE above, not higher echo MOS:

Scenario n echo ↑ deg ↑ ERLE ↑ OVRL
doubletalk 115 4.00 2.79 2.46
doubletalk-with-movement 185 3.90 2.92 2.42
farend-singletalk 107 4.06 5.00 6.5 dB 1.24
farend-singletalk-with-movement 193 4.05 4.97 3.9 dB 1.22
nearend-singletalk 200 4.98 3.77 3.03

v1.3 (joint) and v1.2 (joint) — these also delete the background, so their far-end ERLE is much higher and not comparable to v1.4-AEC's:

Scenario n v1.3 echo / deg / ERLE / OVRL v1.2 echo / deg / ERLE / OVRL
doubletalk 115 4.73 / 2.62 / 8.5 dB / 2.89 4.72 / 2.37 / 8.4 dB / 2.83
doubletalk-with-movement 185 4.67 / 2.43 / 8.3 dB / 2.85 4.65 / 2.30 / 8.1 dB / 2.79
farend-singletalk 107 3.69 / 4.83 / 50.9 dB / 1.94 3.78 / 4.91 / 45.7 dB / 1.80
farend-singletalk-with-movement 193 3.88 / 4.98 / 49.9 dB / 1.96 4.12 / 4.96 / 40.6 dB / 1.75
nearend-singletalk 200 5.00 / 4.18 / 2.4 dB / 3.17 5.00 / 4.16 / 2.1 dB / 3.17

Latency

Per-hop p50 / p99 and RT factor. 16 kHz, 256-sample hop, 16 ms budget.

v1.4-AEC (Ryzen 9 7900, CPU):

Threads p50 p99 RT
1 1.29 ms 1.89 ms 12.2×
4 0.83 ms 1.30 ms 18.6×

The 2.7K front-end-only build runs at 0.36 ms p50 (≈44× RT), single-threaded by nature. The adaptive front-end always runs on CPU; the neural stage is too small for GPU offload to pay off, so run v1.4-AEC on CPU.

v1.3 (joint):

Hardware Backend Threads p50 p99 RT
Ryzen 9 7900 CPU 1 9.73 ms 14.48 ms 1.58×
Ryzen 9 7900 CPU 4 3.21 ms 3.42 ms 4.97×
Ryzen 9 7900 + RTX 5070 Ti Vulkan 2.57 ms 4.21 ms 6.07×

v1.2 (joint):

Hardware Backend Threads p50 p99 RT
Ryzen 9 7900 CPU 1 4.28 ms 4.85 ms 3.72×
Ryzen 9 7900 CPU 4 1.65 ms 2.91 ms 8.90×
Ryzen 9 7900 + RTX 5070 Ti Vulkan 1.96 ms 3.64 ms 7.85×
Ryzen 7 6800U (laptop) CPU 4 2.11 ms 2.77 ms 7.44×

These graphs are small, so threads hit diminishing returns past ~4. The library defaults to min(4, available CPUs) (respects taskset / cgroup limits); override with localvqe_options_set_threads. Run bench-run (below) to reproduce on your hardware.

Memory (CPU)

Working set the model adds on top of the ~7 MiB binary baseline:

Model Post-load delta Peak RSS
v1.3 (4.8 M) +24.4 MiB 34.1 MiB
v1.2 (1.3 M) +10.0 MiB 19.6 MiB
v1.4-AEC (203 K) +6.7 MiB 17.0 MiB

Usage

Build

Requires CMake ≥ 3.20 and a C++17 compiler. A Nix flake is provided (nix develop); without Nix, install cmake, gcc/clang, pkg-config, and libsndfile.

git clone --recursive https://github.com/localai-org/LocalVQE.git
cd LocalVQE
cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release
cmake --build ggml/build -j$(nproc)

Binaries land in ggml/build/bin/. The CPU build produces several libggml-cpu-*.so variants (SSE4.2 → AVX-512) selected at runtime — keep them next to the binary. For GPU, add -DLOCALVQE_VULKAN=ON (the loader falls back to CPU when no Vulkan ICD is present).

Run (CLI)

./ggml/build/bin/localvqe localvqe-v1.3-4.8M-f32.gguf \
    --in-wav mic.wav ref.wav \
    --out-wav enhanced.wav

16 kHz mono PCM for both mic and far-end reference. Swap the GGUF to switch models — same command for every version (the engine reads what to do from the file).

Embed (C API)

cmake -S ggml -B ggml/build -DLOCALVQE_BUILD_SHARED=ON
cmake --build ggml/build -j$(nproc)   # -> liblocalvqe.so

API in ggml/localvqe_api.h:

localvqe_ctx_t ctx = localvqe_new("localvqe-v1.3-4.8M-f32.gguf");
localvqe_process_f32(ctx, mic, ref, n_samples, out);   // whole clip
// or per 256-sample hop for real-time: localvqe_process_frame_f32(...)
localvqe_free(ctx);

See ggml/example_purego_test.go for a Go / purego binding.

Benchmark / test

cmake --build ggml/build --target bench-run          # downloads a model + clip, benches
cmake --build ggml/build --target test_regression regression-assets
ctest --test-dir ggml/build --output-on-failure      # SKIPs models not downloaded

bench-run honors -DBENCH_BACKEND=Vulkan -DBENCH_DEVICE=N -DBENCH_ITERS=N set at configure time; bench-list-devices enumerates backends.

OBS Studio plugin

obs-plugin/ wraps liblocalvqe.so as an audio filter — appears as "LocalVQE (AEC + Noise + Dereverb)" in any source's filter list, with the bundled v1.3 GGUF preselected. NS and dereverb work out of the box; for AEC, set a Reference source (usually "Desktop Audio") so the model knows what's playing. Browse to localvqe-v1.4-aec-200K-f32.gguf to switch to echo-only mode.

nix develop .#obs-plugin
cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release -DLOCALVQE_BUILD_SHARED=ON
cmake --build ggml/build -j$(nproc)
cmake --build ggml/build --target regression-assets
cp ggml/build/bench_assets/localvqe-v1.3-4.8M-f32.gguf obs-plugin/data/
cmake -S obs-plugin -B obs-plugin/build -DCMAKE_BUILD_TYPE=Release
cmake --build obs-plugin/build -j$(nproc) && cmake --install obs-plugin/build

The install is self-contained (plugin .so + liblocalvqe.so + the libggml-cpu-*.so variants under ~/.config/obs-studio/plugins/). Tested on Linux; macOS expected to work; Windows implemented but unverified.

PyTorch reference

pytorch/ holds the model definition used to train and export the weights — for verification and research, not end-user inference (use the GGML build).

cd pytorch && pip install -r requirements.txt
python -c "import yaml, torch; from localvqe.model import LocalVQE; \
cfg = yaml.safe_load(open('configs/default.yaml')); \
m = LocalVQE(**cfg['model'], n_freqs=cfg['audio']['n_freqs']); \
print(sum(p.numel() for p in m.parameters()))"

Repository layout

ggml/        C++ streaming inference (GGML graph, CLI, C API, tests)
pytorch/     PyTorch reference (model definition only)
obs-plugin/  OBS Studio audio filter wrapping liblocalvqe.so

Citing

Cite the repository via CITATION.cff (GitHub's "Cite this repository" button produces APA / BibTeX), and the upstream DeepVQE paper:

@inproceedings{indenbom2023deepvqe,
  title     = {DeepVQE: Real Time Deep Voice Quality Enhancement for Joint
               Acoustic Echo Cancellation, Noise Suppression and Dereverberation},
  author    = {Indenbom, Evgenii and Beltr{\'a}n, Nicolae-C{\u{a}}t{\u{a}}lin
               and Chernov, Mykola and Aichner, Robert},
  booktitle = {Interspeech}, year = {2023},
  doi       = {10.21437/Interspeech.2023-2176}
}

Attribution, safety, license

Weights are trained on the ICASSP 2023 DNS Challenge (Microsoft, CC BY 4.0) and fine-tuned on the ICASSP 2022/2023 AEC Challenge.

Safety: training data was filtered by DNSMOS, which can misclassify distressed speech (screaming, crying) as noise. LocalVQE may attenuate such signals and must not be relied on for emergency or safety-critical use.

Licensed under Apache 2.0 — see LICENSE.