Benchmarks
RTF (real-time factor) below 1.0 means faster than real-time.
Apple Silicon (MLX + CoreML)
All benchmarks on Apple M5 Pro, 48 GB, macOS 25.5 with release builds and compiled metallib. Each engine runs in a separate child process (asr-bench --isolated) so the Peak RSS column is the real per-engine cost, not a sequential high-water mark.
ASR — Word Error Rate
Evaluated on LibriSpeech test-clean, first 200 utterances (~30 min of English read speech). WER computed via a Whisper-style normalizer + Levenshtein over whitespace tokens.
| Engine | Quant | WER% | RTF | xRT | Peak RSS |
|---|---|---|---|---|---|
| Qwen3-ASR 1.7B MLX | 8-bit | 1.52 | 0.033 | 30.5× | 2.7 GB |
| WhisperKit Large-v3 Turbo | FP16 | 1.71 | 0.084 | 11.9× | 0.4 GB |
| Qwen3-ASR 0.6B MLX | 8-bit | 1.82 | 0.015 | 66.0× | 1.3 GB |
| Qwen3-ASR 0.6B MLX | 4-bit | 2.20 | 0.012 | 85.6× | 1.0 GB |
| Parakeet TDT v3 | INT8 | 2.37 | 0.009 | 117.4× | 0.9 GB |
| Qwen3-ASR 0.6B CoreML | INT8 | 3.02 | 0.098 | 10.2× | 1.4 GB |
| Omnilingual CTC 300M MLX | 4-bit | 4.26 | 0.005 | 222.1× | 0.4 GB |
| Omnilingual CTC 300M CoreML | INT8 | 5.67 | 0.128 | 7.8× | 0.5 GB |
| Nemotron Streaming | INT8 | 12.26 | 0.107 | 9.3× | 1.5 GB |
Headline picks: Qwen3-ASR MLX 1.7B 8-bit beats WhisperKit Large-v3 Turbo on WER (1.52% vs 1.71%) and runs 2.6× faster at 6× the memory. Parakeet TDT v3 is the fastest for English-only (117× real-time, 25 European languages). Omnilingual CTC 300M MLX 4-bit is the multilingual throughput leader: 222× real-time, 384 MB peak, 1 672 languages.
The Qwen3-ASR 0.6B CoreML row reflects the rebuilt chunked block-attention encoder (aufklarer/Qwen3-ASR-CoreML) — the previous export ran unmasked global self-attention over zero-padded mel and emitted <|im_end|> right after the first sentence-final period (24.88% WER on the same fixture before the rebuild).
Headline picks: Qwen3-ASR MLX 1.7B 8-bit beats WhisperKit Large-v3 Turbo on WER (1.52% vs 1.71%) and runs 2.6× faster at 6× the memory. Parakeet TDT v3 is the fastest for English-only (117× real-time, 25 European languages). Omnilingual CTC 300M MLX 4-bit is the multilingual throughput leader: 222× real-time, 384 MB peak, 1 672 languages.
The Qwen3-ASR 0.6B CoreML row reflects the rebuilt chunked block-attention encoder (aufklarer/Qwen3-ASR-CoreML) — the previous export ran unmasked global self-attention over zero-padded mel and emitted <|im_end|> right after the first sentence-final period (24.88% WER on the same fixture before the rebuild).
Long-form stability (sustained Neural Engine load)
200 LibriSpeech utterances processed sequentially (~30 min audio, M5 Pro). Tests whether WER or latency degrade under sustained transcription.
| Metric | First 25% | Last 25% | Overall |
|---|---|---|---|
| WER% | 1.30 | 1.23 | 2.43 |
| RTF | 0.672 | 0.400 | 0.539 |
No degradation detected. WER is stable across the session. RTF actually improves as CoreML warms up its execution plan cache. No thermal throttling after 42 minutes of continuous Neural Engine inference. Parakeet processes each chunk independently — no cross-chunk state accumulation.
Multilingual results (FLEURS)
CER used for CJK languages (no word boundaries). Parakeet supports ~25 European languages (no CJK).
| Language | Metric | Qwen3 4-bit | Qwen3 8-bit | Parakeet INT8 |
|---|---|---|---|---|
| Spanish | WER | 6.44 | 5.06 | 5.18 |
| English | WER | 6.57 | 5.64 | 9.30 |
| Chinese | CER | 8.41 | 7.71 | — |
| German | WER | 9.45 | 6.81 | 12.33 |
| French | WER | 11.42 | 8.50 | 13.02 |
| Japanese | CER | 16.11 | 8.64 | — |
| Russian | WER | 16.35 | 10.52 | 11.49 |
| Korean | WER | 19.95 | 6.89 | — |
| Hindi | WER | 25.93 | 18.57 | — |
| Arabic | WER | 33.47 | 20.31 | — |
Compression delta
Accuracy loss from quantizing to lower bit widths.
| Variant | WER% | Substitutions | Insertions | Deletions | Total Errors | Size |
|---|---|---|---|---|---|---|
| Qwen3 0.6B 8-bit | 2.80 | 1111 | 92 | 268 | 1471 | 960 MB |
| Qwen3 0.6B 4-bit | 3.34 | 1323 | 123 | 308 | 1754 | 675 MB |
| Delta | +0.54 | +212 | +31 | +40 | +283 | -30% |
| Parakeet TDT INT8 | 2.74 | 990 | 125 | 308 | 1423 | 634 MB |
Qwen3-ASR 1.7B 8-bit achieves 2.35% WER — surpassing Whisper Large v3 Turbo (2.5%) and Whisper Large v3 (2.7%) while running at 11x real-time on Apple Silicon.
TTS — Round-Trip Intelligibility
Synthesize text, then transcribe the audio back with Qwen3-ASR 0.6B and compute WER against the original text. Evaluated on 30 built-in English conversational sentences.
| Engine | Model | Params | Size | WER% | RTF |
|---|---|---|---|---|---|
| CosyVoice3 | 0.5B 4-bit | 500M | ~1.9 GB | 3.25 | 0.59 |
| Qwen3-TTS | 1.7B 4-bit | 1.7B | ~2.3 GB | 3.47 | 0.79 |
| Qwen3-TTS | 1.7B 8-bit | 1.7B | ~3.5 GB | 3.66 | 0.85 |
| Kokoro-82M | CoreML | 82M | ~170 MB | 3.90 | 0.17 |
| Qwen3-TTS | 0.6B 8-bit | 600M | ~960 MB | 9.74 | 0.76 |
| Qwen3-TTS | 0.6B 4-bit | 600M | ~675 MB | 15.58 | 0.76 |
Latency breakdown (Qwen3-TTS)
| Stage | Time | % of Total | Description |
|---|---|---|---|
| Embed | 1-3 ms | <1% | Text embedding (TTFT) |
| Generate | 2-6 s | ~92% | Autoregressive codec tokens |
| Decode | 244-457 ms | ~8% | Codec decoder to waveform |
All TTS engines run faster than real-time (RTF < 1.0). CosyVoice3 leads in intelligibility (3.25% WER). Kokoro is the fastest (RTF 0.17) at only 170 MB.
VAD — Detection Accuracy
FLEURS evaluation (10 languages, 250 files)
Evaluated against Python FireRedVAD reference ground truth at the same threshold.
| Engine | Params | Backend | F1% | FAR% | MR% | RTF |
|---|---|---|---|---|---|---|
| FireRedVAD | 588K | CoreML (ANE) | 99.12 | 2.52 | 0.47 | 0.007 |
| Silero v5 | 309K | CoreML (ANE) | 95.13 | 15.76 | 1.89 | 0.022 |
| Silero v5 | 309K | MLX (GPU) | 95.11 | 15.85 | 1.89 | 0.027 |
| Pyannote | 1.5M | MLX (GPU) | 94.86 | 14.71 | 2.92 | 0.358 |
VoxConverse evaluation (multi-speaker)
5 multi-speaker conversation files evaluated at 10 ms frame resolution.
| Engine | Params | Backend | F1% | FAR% | MR% | RTF |
|---|---|---|---|---|---|---|
| Pyannote | 1.5M | MLX (GPU) | 98.22 | 50.09 | 0.19 | 0.358 |
| Silero v5 | 309K | CoreML (ANE) | 97.52 | 33.29 | 2.69 | 0.022 |
| Silero v5 | 309K | MLX (GPU) | 95.98 | 21.02 | 5.88 | 0.027 |
| FireRedVAD | 588K | CoreML (ANE) | 94.21 | 40.12 | 5.05 | 0.007 |
Comparison with published numbers
| Model | F1% | FAR% | MR% | Params | Dataset |
|---|---|---|---|---|---|
| Pyannote (ours) | 98.22 | 50.09 | 0.19 | 1.5M | VoxConverse |
| FireRedVAD (paper) | 97.57 | 2.69 | 3.62 | 588K | FLEURS-VAD-102 |
| Silero (ours) | 95.98 | 21.02 | 5.88 | 309K | VoxConverse |
| Silero-VAD (paper) | 95.95 | 9.41 | 3.95 | 309K | FLEURS-VAD-102 |
| FireRedVAD (ours) | 94.21 | 69.33 | 5.05 | 588K | VoxConverse |
FireRedVAD achieves 99.12% F1 on FLEURS with the lowest false alarm rate (2.52%) and runs at 135x real-time. Silero v5 provides the best streaming option at 32 ms per chunk.
Wake-Word / Keyword Spotting
KWS Zipformer (gigaspeech fine-tune)
Streaming Zipformer2 transducer (3.49M params, Apache-2.0) with INT8 palettization on CoreML. Evaluated against 12 keywords on LibriSpeech test-clean (158 positive utterances, 60 negative). Thresholds tuned: acThreshold = 0.15, contextScore = 0.5, numTrailingBlanks = 1.
| Metric | Value | Notes |
|---|---|---|
| RTF (CPU + Neural Engine) | 0.04 | 26× real-time on M-series |
| Recall (12 keywords) | 88% | LibriSpeech test-clean, 158 positive utterances |
| False positives / utterance | 0.27 | 60 negative utterances |
| CoreML INT8 vs PyTorch FP32 | 99% | Emission agreement |
| Compiled model size | ~4 MB | encoder 3.3 MB + decoder 525 KB + joiner 160 KB |
| Runtime memory | ~6 MB | Weights + encoder state caches |
Tuned defaults improved recall from 62% to 88% (and cut FP/utt from 0.43 to 0.27) vs. the upstream icefall defaults (acThreshold = 0.25, contextScore = 2.0). See the wake-word guide for keyword-file format and per-phrase threshold tuning.
Speaker Embeddings
Extraction latency
20-second audio clip, 10 iterations after warmup.
| Model | Dim | Backend | Latency |
|---|---|---|---|
| CAM++ (3D-Speaker) | 192 | CoreML (ANE) | 12 ms |
| WeSpeaker ResNet34-LM | 256 | MLX (GPU) | 64 ms |
| WeSpeaker ResNet34-LM | 256 | CoreML (ANE) | 143 ms |
Embedding quality (VoxConverse)
Cosine similarity between segment-level embeddings from 5 multi-speaker recordings. Higher separation = better speaker discrimination.
| Model | Backend | Intra-Speaker | Inter-Speaker | Separation |
|---|---|---|---|---|
| WeSpeaker | MLX | 0.726 | 0.142 | 0.584 |
| WeSpeaker | CoreML | 0.726 | 0.143 | 0.582 |
| CAM++ | CoreML | 0.723 | 0.395 | 0.328 |
All three engines match the Python pyannote reference (0.577 separation, cosine similarity >0.96). WeSpeaker achieves 0.584 separation on both MLX and CoreML. CAM++ runs 5x faster (12 ms vs 65 ms) with good separation (0.328).
Source Separation — SDR
Signal-to-Distortion Ratio (SDR) on MUSDB18-HQ (50 full-length test tracks, stereo 44.1kHz). Higher is better. Two model sizes: HQ (8.9M params/stem) and L (28.3M params/stem).
| Target | UMX-HQ (MLX) | UMX-L (MLX) | UMX-HQ (published) |
|---|---|---|---|
| Vocals | 6.23 dB | ~10.5 dB | 6.32 dB |
| Drums | 6.44 dB | ~7.0 dB | 5.73 dB |
| Bass | 4.56 dB | ~5.5 dB | 5.23 dB |
| Other | 3.41 dB | ~4.5 dB | 4.02 dB |
| Model | Params/stem | Size | RTF | Speed |
|---|---|---|---|---|
| Open-Unmix HQ | 8.9M | 136 MB | 0.23 | 4.3x real-time |
| Open-Unmix L | 28.3M | 432 MB | 0.21 | 4.8x real-time |
UMX-HQ matches published SDR on vocals and drums with a lightweight 8.9M model. UMX-L provides +2–4 dB improvement at 3x the model size. Both include multichannel Wiener EM post-filtering and run faster than real-time on Apple Silicon.
Reproduction
# ASR benchmarks (LibriSpeech test-clean)
make build
python scripts/benchmark_asr.py --batch --engine qwen3 --model 0.6B
python scripts/benchmark_asr.py --batch --engine qwen3 --model 0.6B-8bit
python scripts/benchmark_asr.py --batch --engine parakeet
python scripts/benchmark_asr.py --batch --engine parakeet --model int8
# ASR multilingual (FLEURS, auto-download)
python scripts/benchmark_asr.py --dataset fleurs --language en_us --batch
# TTS round-trip
python scripts/benchmark_tts.py --compare
# VAD comparison
python scripts/benchmark_vad.py --compare
# Speaker embeddings comparison
python scripts/benchmark_speaker.py --compare
# Source separation (MUSDB18-HQ, download from Zenodo)
python scripts/benchmark_separation.py --data-dir benchmarks/data/musdb18-hq