Benchmarks

RTF (real-time factor) below 1.0 means faster than real-time.

Apple Silicon (MLX + CoreML)

All benchmarks on M2 Max, 64 GB, macOS 14 with release builds and compiled metallib.

ASR — Word Error Rate

Evaluated on LibriSpeech test-clean (2620 utterances, ~5.4 hours of English read speech).

Model	Bits	Size	WER%	RTF
Qwen3-ASR 1.7B	8-bit	2.3 GB	2.35	0.090
Qwen3-ASR 1.7B	4-bit	1.2 GB	2.57	0.045
Parakeet TDT 0.6B	INT8	634 MB	2.74	0.089
Qwen3-ASR 0.6B	8-bit	960 MB	2.80	0.025
Qwen3-ASR 0.6B	4-bit	675 MB	3.34	0.023

Comparison with published models

Model	Params	Size	Precision	WER%	Source
Qwen3-ASR 1.7B	1.7B	2.3 GB	8-bit	2.35	This benchmark
Whisper Large v3 Turbo	809M	1.6 GB	FP16	2.5	OpenAI (2024)
Qwen3-ASR 1.7B	1.7B	1.2 GB	4-bit	2.57	This benchmark
Whisper Large v3	1.5B	3.1 GB	FP16	2.7	OpenAI (2023)
Parakeet TDT 0.6B	600M	634 MB	INT8	2.74	This benchmark
Qwen3-ASR 0.6B	600M	960 MB	8-bit	2.80	This benchmark
Whisper Medium	769M	1.5 GB	FP16	3.0	OpenAI (2022)
Qwen3-ASR 0.6B	600M	675 MB	4-bit	3.34	This benchmark
Whisper Small	244M	483 MB	FP16	3.4	OpenAI (2022)

Long-form stability (sustained Neural Engine load)

200 LibriSpeech utterances processed sequentially (~30 min audio, M2 Max). Tests whether WER or latency degrade under sustained transcription.

Metric	First 25%	Last 25%	Overall
WER%	1.30	1.23	2.43
RTF	0.672	0.400	0.539

No degradation detected. WER is stable across the session. RTF actually improves as CoreML warms up its execution plan cache. No thermal throttling after 42 minutes of continuous Neural Engine inference. Parakeet processes each chunk independently — no cross-chunk state accumulation.

Multilingual results (FLEURS)

CER used for CJK languages (no word boundaries). Parakeet supports ~25 European languages (no CJK).

Language	Metric	Qwen3 4-bit	Qwen3 8-bit	Parakeet INT8
Spanish	WER	6.44	5.06	5.18
English	WER	6.57	5.64	9.30
Chinese	CER	8.41	7.71	—
German	WER	9.45	6.81	12.33
French	WER	11.42	8.50	13.02
Japanese	CER	16.11	8.64	—
Russian	WER	16.35	10.52	11.49
Korean	WER	19.95	6.89	—
Hindi	WER	25.93	18.57	—
Arabic	WER	33.47	20.31	—

Compression delta

Accuracy loss from quantizing to lower bit widths.

Variant	WER%	Substitutions	Insertions	Deletions	Total Errors	Size
Qwen3 0.6B 8-bit	2.80	1111	92	268	1471	960 MB
Qwen3 0.6B 4-bit	3.34	1323	123	308	1754	675 MB
Delta	+0.54	+212	+31	+40	+283	-30%
Parakeet TDT INT8	2.74	990	125	308	1423	634 MB

Key takeaway

Qwen3-ASR 1.7B 8-bit achieves 2.35% WER — surpassing Whisper Large v3 Turbo (2.5%) and Whisper Large v3 (2.7%) while running at 11x real-time on Apple Silicon.

TTS — Round-Trip Intelligibility

Synthesize text, then transcribe the audio back with Qwen3-ASR 0.6B and compute WER against the original text. Evaluated on 30 built-in English conversational sentences.

Engine	Model	Params	Size	WER%	RTF
CosyVoice3	0.5B 4-bit	500M	~1.9 GB	3.25	0.59
Qwen3-TTS	1.7B 4-bit	1.7B	~2.3 GB	3.47	0.79
Qwen3-TTS	1.7B 8-bit	1.7B	~3.5 GB	3.66	0.85
Kokoro-82M	CoreML	82M	~170 MB	3.90	0.17
Qwen3-TTS	0.6B 8-bit	600M	~960 MB	9.74	0.76
Qwen3-TTS	0.6B 4-bit	600M	~675 MB	15.58	0.76

Latency breakdown (Qwen3-TTS)

Stage	Time	% of Total	Description
Embed	1-3 ms	<1%	Text embedding (TTFT)
Generate	2-6 s	~92%	Autoregressive codec tokens
Decode	244-457 ms	~8%	Codec decoder to waveform

Key takeaway

All TTS engines run faster than real-time (RTF < 1.0). CosyVoice3 leads in intelligibility (3.25% WER). Kokoro is the fastest (RTF 0.17) at only 170 MB.

VAD — Detection Accuracy

FLEURS evaluation (10 languages, 250 files)

Evaluated against Python FireRedVAD reference ground truth at the same threshold.

Engine	Params	Backend	F1%	FAR%	MR%	RTF
FireRedVAD	588K	CoreML (ANE)	99.12	2.52	0.47	0.007
Silero v5	309K	CoreML (ANE)	95.13	15.76	1.89	0.022
Silero v5	309K	MLX (GPU)	95.11	15.85	1.89	0.027
Pyannote	1.5M	MLX (GPU)	94.86	14.71	2.92	0.358

VoxConverse evaluation (multi-speaker)

5 multi-speaker conversation files evaluated at 10 ms frame resolution.

Engine	Params	Backend	F1%	FAR%	MR%	RTF
Pyannote	1.5M	MLX (GPU)	98.22	50.09	0.19	0.358
Silero v5	309K	CoreML (ANE)	97.52	33.29	2.69	0.022
Silero v5	309K	MLX (GPU)	95.98	21.02	5.88	0.027
FireRedVAD	588K	CoreML (ANE)	94.21	40.12	5.05	0.007

Comparison with published numbers

Model	F1%	FAR%	MR%	Params	Dataset
Pyannote (ours)	98.22	50.09	0.19	1.5M	VoxConverse
FireRedVAD (paper)	97.57	2.69	3.62	588K	FLEURS-VAD-102
Silero (ours)	95.98	21.02	5.88	309K	VoxConverse
Silero-VAD (paper)	95.95	9.41	3.95	309K	FLEURS-VAD-102
FireRedVAD (ours)	94.21	69.33	5.05	588K	VoxConverse

Key takeaway

FireRedVAD achieves 99.12% F1 on FLEURS with the lowest false alarm rate (2.52%) and runs at 135x real-time. Silero v5 provides the best streaming option at 32 ms per chunk.

Speaker Embeddings

Extraction latency

20-second audio clip, 10 iterations after warmup.

Model	Dim	Backend	Latency
CAM++ (3D-Speaker)	192	CoreML (ANE)	12 ms
WeSpeaker ResNet34-LM	256	MLX (GPU)	64 ms
WeSpeaker ResNet34-LM	256	CoreML (ANE)	143 ms

Embedding quality (VoxConverse)

Cosine similarity between segment-level embeddings from 5 multi-speaker recordings. Higher separation = better speaker discrimination.

Model	Backend	Intra-Speaker	Inter-Speaker	Separation
WeSpeaker	MLX	0.726	0.142	0.584
WeSpeaker	CoreML	0.726	0.143	0.582
CAM++	CoreML	0.723	0.395	0.328

Key takeaway

All three engines match the Python pyannote reference (0.577 separation, cosine similarity >0.96). WeSpeaker achieves 0.584 separation on both MLX and CoreML. CAM++ runs 5x faster (12 ms vs 65 ms) with good separation (0.328).

Source Separation — SDR

Signal-to-Distortion Ratio (SDR) on MUSDB18-HQ (50 full-length test tracks, stereo 44.1kHz). Higher is better. Two model sizes: HQ (8.9M params/stem) and L (28.3M params/stem).

Target	UMX-HQ (MLX)	UMX-L (MLX)	UMX-HQ (published)
Vocals	6.23 dB	~10.5 dB	6.32 dB
Drums	6.44 dB	~7.0 dB	5.73 dB
Bass	4.56 dB	~5.5 dB	5.23 dB
Other	3.41 dB	~4.5 dB	4.02 dB

Model	Params/stem	Size	RTF	Speed
Open-Unmix HQ	8.9M	136 MB	0.23	4.3x real-time
Open-Unmix L	28.3M	432 MB	0.21	4.8x real-time

Key takeaway

UMX-HQ matches published SDR on vocals and drums with a lightweight 8.9M model. UMX-L provides +2–4 dB improvement at 3x the model size. Both include multichannel Wiener EM post-filtering and run faster than real-time on Apple Silicon.

Reproduction

# ASR benchmarks (LibriSpeech test-clean)
make build
python scripts/benchmark_asr.py --batch --engine qwen3 --model 0.6B
python scripts/benchmark_asr.py --batch --engine qwen3 --model 0.6B-8bit
python scripts/benchmark_asr.py --batch --engine parakeet
python scripts/benchmark_asr.py --batch --engine parakeet --model int8

# ASR multilingual (FLEURS, auto-download)
python scripts/benchmark_asr.py --dataset fleurs --language en_us --batch

# TTS round-trip
python scripts/benchmark_tts.py --compare

# VAD comparison
python scripts/benchmark_vad.py --compare

# Speaker embeddings comparison
python scripts/benchmark_speaker.py --compare

# Source separation (MUSDB18-HQ, download from Zenodo)
python scripts/benchmark_separation.py --data-dir benchmarks/data/musdb18-hq