Benchmarks

Throughput and accuracy numbers for the diarization and ASR backends. Throughput figures come from in-house runs; accuracy is reproduced from the model authors' published benchmarks.

Throughput

10-minute English audio file, fp32 models. RTF < 1.0 = faster than real-time.

Backend	Hardware	Diarization	ASR	Total	RTF
Silero VAD	AMD Ryzen 7 7840U	2.1s	50.5s	52.7s	0.088
Sortformer	AMD Ryzen 7 7840U	33.2s	49.2s	82.4s	0.137
DiariZen	AMD Ryzen 7 7840U	502.0s	55.8s	557.8s	0.930
Silero VAD	NVIDIA RTX 3090	2.1s	5.4s	7.4s	0.012
Sortformer	NVIDIA RTX 3090	16.0s	5.5s	21.4s	0.036
DiariZen	NVIDIA RTX 3090	16.8s	5.4s	22.2s	0.037

DiariZen's segmentation and embedding pipeline is heavily GPU-accelerated — CUDA reduces diarization time from 502s to 16.8s (~30×) and brings total runtime in line with Sortformer.

Parakeet TDT beam search (--parakeet-beam 4) adds roughly 3–5× to ASR latency. With KenLM fusion at typical weights the additional lookup cost is a few hundred milliseconds per clip (one-time LM load plus microsecond-per-beam-expansion scoring).

Accuracy (DER)

Diarization Error Rate from published benchmarks. Lower is better.

Backend	AMI-SDM	VoxConverse	DIHARD III	Source
Sortformer v2-stream	20.6%	13.9%	20.2%	HuggingFace
DiariZen-Large	13.9%	9.1%	14.5%	BUTSpeechFIT/DiariZen

Benchmarks use different evaluation conditions (collar, overlap handling) — direct cross-model comparison should be treated as indicative only. The independent survey Benchmarking Diarization Models (2509.26177) found Sortformer v2-stream and DiariZen among the top open-source performers overall.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks

Throughput

Accuracy (DER)

FilesExpand file tree

benchmarks.md

Latest commit

History

benchmarks.md

File metadata and controls

Benchmarks

Throughput

Accuracy (DER)