Benchmarks
RTF (real-time factor) below 1.0 means faster than real-time.
Apple Silicon (MLX + CoreML)
All benchmarks on M2 Max, 64 GB, macOS 14 with release builds and compiled metallib.
ASR — Word Error Rate
Evaluated on LibriSpeech test-clean (2620 utterances, ~5.4 hours of English read speech).
| Model | Bits | Size | WER% | RTF |
|---|---|---|---|---|
| Qwen3-ASR 1.7B | 8-bit | 2.3 GB | 2.35 | 0.090 |
| Qwen3-ASR 1.7B | 4-bit | 1.2 GB | 2.57 | 0.045 |
| Parakeet TDT 0.6B | INT8 | 634 MB | 2.74 | 0.089 |
| Qwen3-ASR 0.6B | 8-bit | 960 MB | 2.80 | 0.025 |
| Qwen3-ASR 0.6B | 4-bit | 675 MB | 3.34 | 0.023 |
Comparison with published models
| Model | Params | Size | Precision | WER% | Source |
|---|---|---|---|---|---|
| Qwen3-ASR 1.7B | 1.7B | 2.3 GB | 8-bit | 2.35 | This benchmark |
| Whisper Large v3 Turbo | 809M | 1.6 GB | FP16 | 2.5 | OpenAI (2024) |
| Qwen3-ASR 1.7B | 1.7B | 1.2 GB | 4-bit | 2.57 | This benchmark |
| Whisper Large v3 | 1.5B | 3.1 GB | FP16 | 2.7 | OpenAI (2023) |
| Parakeet TDT 0.6B | 600M | 634 MB | INT8 | 2.74 | This benchmark |
| Qwen3-ASR 0.6B | 600M | 960 MB | 8-bit | 2.80 | This benchmark |
| Whisper Medium | 769M | 1.5 GB | FP16 | 3.0 | OpenAI (2022) |
| Qwen3-ASR 0.6B | 600M | 675 MB | 4-bit | 3.34 | This benchmark |
| Whisper Small | 244M | 483 MB | FP16 | 3.4 | OpenAI (2022) |
Long-form stability (sustained Neural Engine load)
200 LibriSpeech utterances processed sequentially (~30 min audio, M2 Max). Tests whether WER or latency degrade under sustained transcription.
| Metric | First 25% | Last 25% | Overall |
|---|---|---|---|
| WER% | 1.30 | 1.23 | 2.43 |
| RTF | 0.672 | 0.400 | 0.539 |
No degradation detected. WER is stable across the session. RTF actually improves as CoreML warms up its execution plan cache. No thermal throttling after 42 minutes of continuous Neural Engine inference. Parakeet processes each chunk independently — no cross-chunk state accumulation.
Multilingual results (FLEURS)
CER used for CJK languages (no word boundaries). Parakeet supports ~25 European languages (no CJK).
| Language | Metric | Qwen3 4-bit | Qwen3 8-bit | Parakeet INT8 |
|---|---|---|---|---|
| Spanish | WER | 6.44 | 5.06 | 5.18 |
| English | WER | 6.57 | 5.64 | 9.30 |
| Chinese | CER | 8.41 | 7.71 | — |
| German | WER | 9.45 | 6.81 | 12.33 |
| French | WER | 11.42 | 8.50 | 13.02 |
| Japanese | CER | 16.11 | 8.64 | — |
| Russian | WER | 16.35 | 10.52 | 11.49 |
| Korean | WER | 19.95 | 6.89 | — |
| Hindi | WER | 25.93 | 18.57 | — |
| Arabic | WER | 33.47 | 20.31 | — |
Compression delta
Accuracy loss from quantizing to lower bit widths.
| Variant | WER% | Substitutions | Insertions | Deletions | Total Errors | Size |
|---|---|---|---|---|---|---|
| Qwen3 0.6B 8-bit | 2.80 | 1111 | 92 | 268 | 1471 | 960 MB |
| Qwen3 0.6B 4-bit | 3.34 | 1323 | 123 | 308 | 1754 | 675 MB |
| Delta | +0.54 | +212 | +31 | +40 | +283 | -30% |
| Parakeet TDT INT8 | 2.74 | 990 | 125 | 308 | 1423 | 634 MB |
Qwen3-ASR 1.7B 8-bit achieves 2.35% WER — surpassing Whisper Large v3 Turbo (2.5%) and Whisper Large v3 (2.7%) while running at 11x real-time on Apple Silicon.
TTS — Round-Trip Intelligibility
Synthesize text, then transcribe the audio back with Qwen3-ASR 0.6B and compute WER against the original text. Evaluated on 30 built-in English conversational sentences.
| Engine | Model | Params | Size | WER% | RTF |
|---|---|---|---|---|---|
| CosyVoice3 | 0.5B 4-bit | 500M | ~1.9 GB | 3.25 | 0.59 |
| Qwen3-TTS | 1.7B 4-bit | 1.7B | ~2.3 GB | 3.47 | 0.79 |
| Qwen3-TTS | 1.7B 8-bit | 1.7B | ~3.5 GB | 3.66 | 0.85 |
| Kokoro-82M | CoreML | 82M | ~170 MB | 3.90 | 0.17 |
| Qwen3-TTS | 0.6B 8-bit | 600M | ~960 MB | 9.74 | 0.76 |
| Qwen3-TTS | 0.6B 4-bit | 600M | ~675 MB | 15.58 | 0.76 |
Latency breakdown (Qwen3-TTS)
| Stage | Time | % of Total | Description |
|---|---|---|---|
| Embed | 1-3 ms | <1% | Text embedding (TTFT) |
| Generate | 2-6 s | ~92% | Autoregressive codec tokens |
| Decode | 244-457 ms | ~8% | Codec decoder to waveform |
All TTS engines run faster than real-time (RTF < 1.0). CosyVoice3 leads in intelligibility (3.25% WER). Kokoro is the fastest (RTF 0.17) at only 170 MB.
VAD — Detection Accuracy
FLEURS evaluation (10 languages, 250 files)
Evaluated against Python FireRedVAD reference ground truth at the same threshold.
| Engine | Params | Backend | F1% | FAR% | MR% | RTF |
|---|---|---|---|---|---|---|
| FireRedVAD | 588K | CoreML (ANE) | 99.12 | 2.52 | 0.47 | 0.007 |
| Silero v5 | 309K | CoreML (ANE) | 95.13 | 15.76 | 1.89 | 0.022 |
| Silero v5 | 309K | MLX (GPU) | 95.11 | 15.85 | 1.89 | 0.027 |
| Pyannote | 1.5M | MLX (GPU) | 94.86 | 14.71 | 2.92 | 0.358 |
VoxConverse evaluation (multi-speaker)
5 multi-speaker conversation files evaluated at 10 ms frame resolution.
| Engine | Params | Backend | F1% | FAR% | MR% | RTF |
|---|---|---|---|---|---|---|
| Pyannote | 1.5M | MLX (GPU) | 98.22 | 50.09 | 0.19 | 0.358 |
| Silero v5 | 309K | CoreML (ANE) | 97.52 | 33.29 | 2.69 | 0.022 |
| Silero v5 | 309K | MLX (GPU) | 95.98 | 21.02 | 5.88 | 0.027 |
| FireRedVAD | 588K | CoreML (ANE) | 94.21 | 40.12 | 5.05 | 0.007 |
Comparison with published numbers
| Model | F1% | FAR% | MR% | Params | Dataset |
|---|---|---|---|---|---|
| Pyannote (ours) | 98.22 | 50.09 | 0.19 | 1.5M | VoxConverse |
| FireRedVAD (paper) | 97.57 | 2.69 | 3.62 | 588K | FLEURS-VAD-102 |
| Silero (ours) | 95.98 | 21.02 | 5.88 | 309K | VoxConverse |
| Silero-VAD (paper) | 95.95 | 9.41 | 3.95 | 309K | FLEURS-VAD-102 |
| FireRedVAD (ours) | 94.21 | 69.33 | 5.05 | 588K | VoxConverse |
FireRedVAD achieves 99.12% F1 on FLEURS with the lowest false alarm rate (2.52%) and runs at 135x real-time. Silero v5 provides the best streaming option at 32 ms per chunk.
Speaker Embeddings
Extraction latency
20-second audio clip, 10 iterations after warmup.
| Model | Dim | Backend | Latency |
|---|---|---|---|
| CAM++ (3D-Speaker) | 192 | CoreML (ANE) | 12 ms |
| WeSpeaker ResNet34-LM | 256 | MLX (GPU) | 64 ms |
| WeSpeaker ResNet34-LM | 256 | CoreML (ANE) | 143 ms |
Embedding quality (VoxConverse)
Cosine similarity between segment-level embeddings from 5 multi-speaker recordings. Higher separation = better speaker discrimination.
| Model | Backend | Intra-Speaker | Inter-Speaker | Separation |
|---|---|---|---|---|
| WeSpeaker | MLX | 0.726 | 0.142 | 0.584 |
| WeSpeaker | CoreML | 0.726 | 0.143 | 0.582 |
| CAM++ | CoreML | 0.723 | 0.395 | 0.328 |
All three engines match the Python pyannote reference (0.577 separation, cosine similarity >0.96). WeSpeaker achieves 0.584 separation on both MLX and CoreML. CAM++ runs 5x faster (12 ms vs 65 ms) with good separation (0.328).
Source Separation — SDR
Signal-to-Distortion Ratio (SDR) on MUSDB18-HQ (50 full-length test tracks, stereo 44.1kHz). Higher is better. Two model sizes: HQ (8.9M params/stem) and L (28.3M params/stem).
| Target | UMX-HQ (MLX) | UMX-L (MLX) | UMX-HQ (published) |
|---|---|---|---|
| Vocals | 6.23 dB | ~10.5 dB | 6.32 dB |
| Drums | 6.44 dB | ~7.0 dB | 5.73 dB |
| Bass | 4.56 dB | ~5.5 dB | 5.23 dB |
| Other | 3.41 dB | ~4.5 dB | 4.02 dB |
| Model | Params/stem | Size | RTF | Speed |
|---|---|---|---|---|
| Open-Unmix HQ | 8.9M | 136 MB | 0.23 | 4.3x real-time |
| Open-Unmix L | 28.3M | 432 MB | 0.21 | 4.8x real-time |
UMX-HQ matches published SDR on vocals and drums with a lightweight 8.9M model. UMX-L provides +2–4 dB improvement at 3x the model size. Both include multichannel Wiener EM post-filtering and run faster than real-time on Apple Silicon.
Reproduction
# ASR benchmarks (LibriSpeech test-clean)
make build
python scripts/benchmark_asr.py --batch --engine qwen3 --model 0.6B
python scripts/benchmark_asr.py --batch --engine qwen3 --model 0.6B-8bit
python scripts/benchmark_asr.py --batch --engine parakeet
python scripts/benchmark_asr.py --batch --engine parakeet --model int8
# ASR multilingual (FLEURS, auto-download)
python scripts/benchmark_asr.py --dataset fleurs --language en_us --batch
# TTS round-trip
python scripts/benchmark_tts.py --compare
# VAD comparison
python scripts/benchmark_vad.py --compare
# Speaker embeddings comparison
python scripts/benchmark_speaker.py --compare
# Source separation (MUSDB18-HQ, download from Zenodo)
python scripts/benchmark_separation.py --data-dir benchmarks/data/musdb18-hq