Benchmarks

RTF (real-time factor) below 1.0 means faster than real-time.

Apple Silicon (MLX + CoreML)

All benchmarks on Apple M5 Pro, 48 GB, macOS 25.5 with release builds and compiled metallib. Each engine runs in a separate child process (asr-bench --isolated) so the Peak RSS column is the real per-engine cost, not a sequential high-water mark.

ASR — Word Error Rate

Evaluated on LibriSpeech test-clean, first 200 utterances (~30 min of English read speech). WER computed via a Whisper-style normalizer + Levenshtein over whitespace tokens.

EngineQuantWER%RTFxRTPeak RSS
Qwen3-ASR 1.7B MLX8-bit1.520.03330.5×2.7 GB
WhisperKit Large-v3 TurboFP161.710.08411.9×0.4 GB
Qwen3-ASR 0.6B MLX8-bit1.820.01566.0×1.3 GB
Qwen3-ASR 0.6B MLX4-bit2.200.01285.6×1.0 GB
Parakeet TDT v3INT82.370.009117.4×0.9 GB
Qwen3-ASR 0.6B CoreMLINT83.020.09810.2×1.4 GB
Omnilingual CTC 300M MLX4-bit4.260.005222.1×0.4 GB
Omnilingual CTC 300M CoreMLINT85.670.1287.8×0.5 GB
Nemotron StreamingINT812.260.1079.3×1.5 GB

Headline picks: Qwen3-ASR MLX 1.7B 8-bit beats WhisperKit Large-v3 Turbo on WER (1.52% vs 1.71%) and runs 2.6× faster at 6× the memory. Parakeet TDT v3 is the fastest for English-only (117× real-time, 25 European languages). Omnilingual CTC 300M MLX 4-bit is the multilingual throughput leader: 222× real-time, 384 MB peak, 1 672 languages.

The Qwen3-ASR 0.6B CoreML row reflects the rebuilt chunked block-attention encoder (aufklarer/Qwen3-ASR-CoreML) — the previous export ran unmasked global self-attention over zero-padded mel and emitted <|im_end|> right after the first sentence-final period (24.88% WER on the same fixture before the rebuild).

Headline picks: Qwen3-ASR MLX 1.7B 8-bit beats WhisperKit Large-v3 Turbo on WER (1.52% vs 1.71%) and runs 2.6× faster at 6× the memory. Parakeet TDT v3 is the fastest for English-only (117× real-time, 25 European languages). Omnilingual CTC 300M MLX 4-bit is the multilingual throughput leader: 222× real-time, 384 MB peak, 1 672 languages.

The Qwen3-ASR 0.6B CoreML row reflects the rebuilt chunked block-attention encoder (aufklarer/Qwen3-ASR-CoreML) — the previous export ran unmasked global self-attention over zero-padded mel and emitted <|im_end|> right after the first sentence-final period (24.88% WER on the same fixture before the rebuild).

Long-form stability (sustained Neural Engine load)

200 LibriSpeech utterances processed sequentially (~30 min audio, M5 Pro). Tests whether WER or latency degrade under sustained transcription.

MetricFirst 25%Last 25%Overall
WER%1.301.232.43
RTF0.6720.4000.539

No degradation detected. WER is stable across the session. RTF actually improves as CoreML warms up its execution plan cache. No thermal throttling after 42 minutes of continuous Neural Engine inference. Parakeet processes each chunk independently — no cross-chunk state accumulation.

Multilingual results (FLEURS)

CER used for CJK languages (no word boundaries). Parakeet supports ~25 European languages (no CJK).

LanguageMetricQwen3 4-bitQwen3 8-bitParakeet INT8
SpanishWER6.445.065.18
EnglishWER6.575.649.30
ChineseCER8.417.71
GermanWER9.456.8112.33
FrenchWER11.428.5013.02
JapaneseCER16.118.64
RussianWER16.3510.5211.49
KoreanWER19.956.89
HindiWER25.9318.57
ArabicWER33.4720.31

Compression delta

Accuracy loss from quantizing to lower bit widths.

VariantWER%SubstitutionsInsertionsDeletionsTotal ErrorsSize
Qwen3 0.6B 8-bit2.801111922681471960 MB
Qwen3 0.6B 4-bit3.3413231233081754675 MB
Delta+0.54+212+31+40+283-30%
Parakeet TDT INT82.749901253081423634 MB
Key takeaway

Qwen3-ASR 1.7B 8-bit achieves 2.35% WER — surpassing Whisper Large v3 Turbo (2.5%) and Whisper Large v3 (2.7%) while running at 11x real-time on Apple Silicon.

TTS — Round-Trip Intelligibility

Synthesize text, then transcribe the audio back with Qwen3-ASR 0.6B and compute WER against the original text. Evaluated on 30 built-in English conversational sentences.

EngineModelParamsSizeWER%RTF
CosyVoice30.5B 4-bit500M~1.9 GB3.250.59
Qwen3-TTS1.7B 4-bit1.7B~2.3 GB3.470.79
Qwen3-TTS1.7B 8-bit1.7B~3.5 GB3.660.85
Kokoro-82MCoreML82M~170 MB3.900.17
Qwen3-TTS0.6B 8-bit600M~960 MB9.740.76
Qwen3-TTS0.6B 4-bit600M~675 MB15.580.76

Latency breakdown (Qwen3-TTS)

StageTime% of TotalDescription
Embed1-3 ms<1%Text embedding (TTFT)
Generate2-6 s~92%Autoregressive codec tokens
Decode244-457 ms~8%Codec decoder to waveform
Key takeaway

All TTS engines run faster than real-time (RTF < 1.0). CosyVoice3 leads in intelligibility (3.25% WER). Kokoro is the fastest (RTF 0.17) at only 170 MB.

VAD — Detection Accuracy

FLEURS evaluation (10 languages, 250 files)

Evaluated against Python FireRedVAD reference ground truth at the same threshold.

EngineParamsBackendF1%FAR%MR%RTF
FireRedVAD588KCoreML (ANE)99.122.520.470.007
Silero v5309KCoreML (ANE)95.1315.761.890.022
Silero v5309KMLX (GPU)95.1115.851.890.027
Pyannote1.5MMLX (GPU)94.8614.712.920.358

VoxConverse evaluation (multi-speaker)

5 multi-speaker conversation files evaluated at 10 ms frame resolution.

EngineParamsBackendF1%FAR%MR%RTF
Pyannote1.5MMLX (GPU)98.2250.090.190.358
Silero v5309KCoreML (ANE)97.5233.292.690.022
Silero v5309KMLX (GPU)95.9821.025.880.027
FireRedVAD588KCoreML (ANE)94.2140.125.050.007

Comparison with published numbers

ModelF1%FAR%MR%ParamsDataset
Pyannote (ours)98.2250.090.191.5MVoxConverse
FireRedVAD (paper)97.572.693.62588KFLEURS-VAD-102
Silero (ours)95.9821.025.88309KVoxConverse
Silero-VAD (paper)95.959.413.95309KFLEURS-VAD-102
FireRedVAD (ours)94.2169.335.05588KVoxConverse
Key takeaway

FireRedVAD achieves 99.12% F1 on FLEURS with the lowest false alarm rate (2.52%) and runs at 135x real-time. Silero v5 provides the best streaming option at 32 ms per chunk.

Wake-Word / Keyword Spotting

KWS Zipformer (gigaspeech fine-tune)

Streaming Zipformer2 transducer (3.49M params, Apache-2.0) with INT8 palettization on CoreML. Evaluated against 12 keywords on LibriSpeech test-clean (158 positive utterances, 60 negative). Thresholds tuned: acThreshold = 0.15, contextScore = 0.5, numTrailingBlanks = 1.

MetricValueNotes
RTF (CPU + Neural Engine)0.0426× real-time on M-series
Recall (12 keywords)88%LibriSpeech test-clean, 158 positive utterances
False positives / utterance0.2760 negative utterances
CoreML INT8 vs PyTorch FP3299%Emission agreement
Compiled model size~4 MBencoder 3.3 MB + decoder 525 KB + joiner 160 KB
Runtime memory~6 MBWeights + encoder state caches

Tuned defaults improved recall from 62% to 88% (and cut FP/utt from 0.43 to 0.27) vs. the upstream icefall defaults (acThreshold = 0.25, contextScore = 2.0). See the wake-word guide for keyword-file format and per-phrase threshold tuning.

Speaker Embeddings

Extraction latency

20-second audio clip, 10 iterations after warmup.

ModelDimBackendLatency
CAM++ (3D-Speaker)192CoreML (ANE)12 ms
WeSpeaker ResNet34-LM256MLX (GPU)64 ms
WeSpeaker ResNet34-LM256CoreML (ANE)143 ms

Embedding quality (VoxConverse)

Cosine similarity between segment-level embeddings from 5 multi-speaker recordings. Higher separation = better speaker discrimination.

ModelBackendIntra-SpeakerInter-SpeakerSeparation
WeSpeakerMLX0.7260.1420.584
WeSpeakerCoreML0.7260.1430.582
CAM++CoreML0.7230.3950.328
Key takeaway

All three engines match the Python pyannote reference (0.577 separation, cosine similarity >0.96). WeSpeaker achieves 0.584 separation on both MLX and CoreML. CAM++ runs 5x faster (12 ms vs 65 ms) with good separation (0.328).

Source Separation — SDR

Signal-to-Distortion Ratio (SDR) on MUSDB18-HQ (50 full-length test tracks, stereo 44.1kHz). Higher is better. Two model sizes: HQ (8.9M params/stem) and L (28.3M params/stem).

TargetUMX-HQ (MLX)UMX-L (MLX)UMX-HQ (published)
Vocals6.23 dB~10.5 dB6.32 dB
Drums6.44 dB~7.0 dB5.73 dB
Bass4.56 dB~5.5 dB5.23 dB
Other3.41 dB~4.5 dB4.02 dB
ModelParams/stemSizeRTFSpeed
Open-Unmix HQ8.9M136 MB0.234.3x real-time
Open-Unmix L28.3M432 MB0.214.8x real-time
Key takeaway

UMX-HQ matches published SDR on vocals and drums with a lightweight 8.9M model. UMX-L provides +2–4 dB improvement at 3x the model size. Both include multichannel Wiener EM post-filtering and run faster than real-time on Apple Silicon.

Reproduction

# ASR benchmarks (LibriSpeech test-clean)
make build
python scripts/benchmark_asr.py --batch --engine qwen3 --model 0.6B
python scripts/benchmark_asr.py --batch --engine qwen3 --model 0.6B-8bit
python scripts/benchmark_asr.py --batch --engine parakeet
python scripts/benchmark_asr.py --batch --engine parakeet --model int8

# ASR multilingual (FLEURS, auto-download)
python scripts/benchmark_asr.py --dataset fleurs --language en_us --batch

# TTS round-trip
python scripts/benchmark_tts.py --compare

# VAD comparison
python scripts/benchmark_vad.py --compare

# Speaker embeddings comparison
python scripts/benchmark_speaker.py --compare

# Source separation (MUSDB18-HQ, download from Zenodo)
python scripts/benchmark_separation.py --data-dir benchmarks/data/musdb18-hq