Exports the DiariZen diarization pipeline to ONNX and binary parameter files for use in Vernacula's C# inference code.
The pipeline has two neural components and one set of static transform parameters:
- Segmentation model — WavLM Large + Conformer (EEND), outputs per-frame speaker activity scores
- Speaker embedding model — WeSpeaker ResNet34, outputs 256-dim embeddings from Kaldi Fbank features
- LDA/PLDA transform — static parameters exported as flat binary files for the VBx clustering step
The clustering logic itself is algorithmic (not a neural net) and is reimplemented directly in C#.
export_diarizen_onnx.py— exports the DiariZen segmentation model to ONNXexport_pyannote_wespeaker_onnx.py— exports the WeSpeaker speaker embedding model to ONNXexport_lda_transform.py— exports the LDA/PLDA transform parameters as flat binary filesdiarizen_requirements.txt— export dependencies
These scripts require Python 3.10 and specific torch versions. Newer torchaudio releases break the local DiariZen/pyannote fork's import path expectations.
pip install -r scripts/diarizen_export/diarizen_requirements.txtThe segmentation and embedding export scripts also require local installs of the DiariZen repo and its pyannote-audio fork:
pip install -e /path/to/DiariZen/pyannote-audio
pip install -e /path/to/DiariZenThe scripts search for the DiariZen repo at ../../DiariZen relative to the repo root, then fall back to /home/chris/Programming/DiariZen.
Downloads the checkpoint from HuggingFace and exports the WavLM + Conformer segmentation model.
python scripts/diarizen_export/export_diarizen_onnx.py \
--model-repo BUT-FIT/diarizen-wavlm-large-s80-md \
--output-dir ~/models/diarizen_onnxOutput: diarizen_segmentation.onnx, metadata.json
Exports the WeSpeaker ResNet34 model from pyannote/wespeaker-voxceleb-resnet34-LM.
python scripts/diarizen_export/export_pyannote_wespeaker_onnx.py \
--output-dir ~/models/diarizen_onnxOutput: wespeaker.onnx
Exports the LDA projection matrix and PLDA parameters from the HuggingFace snapshot as flat float32 binary files.
python scripts/diarizen_export/export_lda_transform.py \
--repo-id BUT-FIT/diarizen-wavlm-large-s80-md \
--output-dir ~/models/diarizen_onnx/pldaOutputs:
| File | Shape | Description |
|---|---|---|
mean1.bin |
float32[256] |
Subtract from raw WeSpeaker embedding before LDA |
lda.bin |
float32[256×128] |
LDA projection matrix (256→128) |
mean2.bin |
float32[128] |
Subtract after LDA projection |
plda_mu.bin |
float32[128] |
PLDA mean (subtract before PLDA transform) |
plda_tr.bin |
float32[128×128] |
PLDA whitening/eigenspace transform |
plda_psi.bin |
float32[128] |
PLDA eigenvalues (Phi for VBx) |
The full pipeline applied in C#:
xvec = sqrt(128) * l2_norm(lda.T @ (sqrt(256) * l2_norm(raw - mean1)) - mean2)
fea = (xvec - plda_mu) @ plda_tr.T
diarizen_segmentation.onnx
| Name | Shape | Description |
|---|---|---|
waveform |
[batch, 1, samples] |
16 kHz mono audio |
scores |
[batch, frames, speakers] |
Per-frame powerset speaker activity scores |
wespeaker.onnx
| Name | Shape | Description |
|---|---|---|
fbank |
[batch, time_frames, 80] |
Kaldi Fbank features (80 mel bins, 10 ms shift) |
embedding |
[batch, 256] |
Raw (un-normalised) speaker embedding |
The embedding model outputs raw embeddings — L2 normalisation is applied after the LDA transform in C#, not inside the ONNX model.
Fbank parameters (must match C# WeSpeakerEmbedder):
num_mel_bins = 80frame_length = 25 ms(400 samples at 16 kHz)frame_shift = 10 ms(160 samples at 16 kHz)window_type = "hamming"waveform_scale = 32768dither = 0.0,use_energy = false