Benchmarks

RTF (real-time factor) below 1.0 means faster than real-time.

Apple Silicon (MLX + CoreML)

All benchmarks on M2 Max, 64 GB, macOS 14 with release builds and compiled metallib.

ASR — Word Error Rate

Evaluated on LibriSpeech test-clean (2620 utterances, ~5.4 hours of English read speech).

ModelBitsSizeWER%RTF
Qwen3-ASR 1.7B8-bit2.3 GB2.350.090
Qwen3-ASR 1.7B4-bit1.2 GB2.570.045
Parakeet TDT 0.6BINT8634 MB2.740.089
Qwen3-ASR 0.6B8-bit960 MB2.800.025
Qwen3-ASR 0.6B4-bit675 MB3.340.023

Comparison with published models

ModelParamsSizePrecisionWER%Source
Qwen3-ASR 1.7B1.7B2.3 GB8-bit2.35This benchmark
Whisper Large v3 Turbo809M1.6 GBFP162.5OpenAI (2024)
Qwen3-ASR 1.7B1.7B1.2 GB4-bit2.57This benchmark
Whisper Large v31.5B3.1 GBFP162.7OpenAI (2023)
Parakeet TDT 0.6B600M634 MBINT82.74This benchmark
Qwen3-ASR 0.6B600M960 MB8-bit2.80This benchmark
Whisper Medium769M1.5 GBFP163.0OpenAI (2022)
Qwen3-ASR 0.6B600M675 MB4-bit3.34This benchmark
Whisper Small244M483 MBFP163.4OpenAI (2022)

Long-form stability (sustained Neural Engine load)

200 LibriSpeech utterances processed sequentially (~30 min audio, M2 Max). Tests whether WER or latency degrade under sustained transcription.

MetricFirst 25%Last 25%Overall
WER%1.301.232.43
RTF0.6720.4000.539

No degradation detected. WER is stable across the session. RTF actually improves as CoreML warms up its execution plan cache. No thermal throttling after 42 minutes of continuous Neural Engine inference. Parakeet processes each chunk independently — no cross-chunk state accumulation.

Multilingual results (FLEURS)

CER used for CJK languages (no word boundaries). Parakeet supports ~25 European languages (no CJK).

LanguageMetricQwen3 4-bitQwen3 8-bitParakeet INT8
SpanishWER6.445.065.18
EnglishWER6.575.649.30
ChineseCER8.417.71
GermanWER9.456.8112.33
FrenchWER11.428.5013.02
JapaneseCER16.118.64
RussianWER16.3510.5211.49
KoreanWER19.956.89
HindiWER25.9318.57
ArabicWER33.4720.31

Compression delta

Accuracy loss from quantizing to lower bit widths.

VariantWER%SubstitutionsInsertionsDeletionsTotal ErrorsSize
Qwen3 0.6B 8-bit2.801111922681471960 MB
Qwen3 0.6B 4-bit3.3413231233081754675 MB
Delta+0.54+212+31+40+283-30%
Parakeet TDT INT82.749901253081423634 MB
Key takeaway

Qwen3-ASR 1.7B 8-bit achieves 2.35% WER — surpassing Whisper Large v3 Turbo (2.5%) and Whisper Large v3 (2.7%) while running at 11x real-time on Apple Silicon.

TTS — Round-Trip Intelligibility

Synthesize text, then transcribe the audio back with Qwen3-ASR 0.6B and compute WER against the original text. Evaluated on 30 built-in English conversational sentences.

EngineModelParamsSizeWER%RTF
CosyVoice30.5B 4-bit500M~1.9 GB3.250.59
Qwen3-TTS1.7B 4-bit1.7B~2.3 GB3.470.79
Qwen3-TTS1.7B 8-bit1.7B~3.5 GB3.660.85
Kokoro-82MCoreML82M~170 MB3.900.17
Qwen3-TTS0.6B 8-bit600M~960 MB9.740.76
Qwen3-TTS0.6B 4-bit600M~675 MB15.580.76

Latency breakdown (Qwen3-TTS)

StageTime% of TotalDescription
Embed1-3 ms<1%Text embedding (TTFT)
Generate2-6 s~92%Autoregressive codec tokens
Decode244-457 ms~8%Codec decoder to waveform
Key takeaway

All TTS engines run faster than real-time (RTF < 1.0). CosyVoice3 leads in intelligibility (3.25% WER). Kokoro is the fastest (RTF 0.17) at only 170 MB.

VAD — Detection Accuracy

FLEURS evaluation (10 languages, 250 files)

Evaluated against Python FireRedVAD reference ground truth at the same threshold.

EngineParamsBackendF1%FAR%MR%RTF
FireRedVAD588KCoreML (ANE)99.122.520.470.007
Silero v5309KCoreML (ANE)95.1315.761.890.022
Silero v5309KMLX (GPU)95.1115.851.890.027
Pyannote1.5MMLX (GPU)94.8614.712.920.358

VoxConverse evaluation (multi-speaker)

5 multi-speaker conversation files evaluated at 10 ms frame resolution.

EngineParamsBackendF1%FAR%MR%RTF
Pyannote1.5MMLX (GPU)98.2250.090.190.358
Silero v5309KCoreML (ANE)97.5233.292.690.022
Silero v5309KMLX (GPU)95.9821.025.880.027
FireRedVAD588KCoreML (ANE)94.2140.125.050.007

Comparison with published numbers

ModelF1%FAR%MR%ParamsDataset
Pyannote (ours)98.2250.090.191.5MVoxConverse
FireRedVAD (paper)97.572.693.62588KFLEURS-VAD-102
Silero (ours)95.9821.025.88309KVoxConverse
Silero-VAD (paper)95.959.413.95309KFLEURS-VAD-102
FireRedVAD (ours)94.2169.335.05588KVoxConverse
Key takeaway

FireRedVAD achieves 99.12% F1 on FLEURS with the lowest false alarm rate (2.52%) and runs at 135x real-time. Silero v5 provides the best streaming option at 32 ms per chunk.

Speaker Embeddings

Extraction latency

20-second audio clip, 10 iterations after warmup.

ModelDimBackendLatency
CAM++ (3D-Speaker)192CoreML (ANE)12 ms
WeSpeaker ResNet34-LM256MLX (GPU)64 ms
WeSpeaker ResNet34-LM256CoreML (ANE)143 ms

Embedding quality (VoxConverse)

Cosine similarity between segment-level embeddings from 5 multi-speaker recordings. Higher separation = better speaker discrimination.

ModelBackendIntra-SpeakerInter-SpeakerSeparation
WeSpeakerMLX0.7260.1420.584
WeSpeakerCoreML0.7260.1430.582
CAM++CoreML0.7230.3950.328
Key takeaway

All three engines match the Python pyannote reference (0.577 separation, cosine similarity >0.96). WeSpeaker achieves 0.584 separation on both MLX and CoreML. CAM++ runs 5x faster (12 ms vs 65 ms) with good separation (0.328).

Source Separation — SDR

Signal-to-Distortion Ratio (SDR) on MUSDB18-HQ (50 full-length test tracks, stereo 44.1kHz). Higher is better. Two model sizes: HQ (8.9M params/stem) and L (28.3M params/stem).

TargetUMX-HQ (MLX)UMX-L (MLX)UMX-HQ (published)
Vocals6.23 dB~10.5 dB6.32 dB
Drums6.44 dB~7.0 dB5.73 dB
Bass4.56 dB~5.5 dB5.23 dB
Other3.41 dB~4.5 dB4.02 dB
ModelParams/stemSizeRTFSpeed
Open-Unmix HQ8.9M136 MB0.234.3x real-time
Open-Unmix L28.3M432 MB0.214.8x real-time
Key takeaway

UMX-HQ matches published SDR on vocals and drums with a lightweight 8.9M model. UMX-L provides +2–4 dB improvement at 3x the model size. Both include multichannel Wiener EM post-filtering and run faster than real-time on Apple Silicon.

Reproduction

# ASR benchmarks (LibriSpeech test-clean)
make build
python scripts/benchmark_asr.py --batch --engine qwen3 --model 0.6B
python scripts/benchmark_asr.py --batch --engine qwen3 --model 0.6B-8bit
python scripts/benchmark_asr.py --batch --engine parakeet
python scripts/benchmark_asr.py --batch --engine parakeet --model int8

# ASR multilingual (FLEURS, auto-download)
python scripts/benchmark_asr.py --dataset fleurs --language en_us --batch

# TTS round-trip
python scripts/benchmark_tts.py --compare

# VAD comparison
python scripts/benchmark_vad.py --compare

# Speaker embeddings comparison
python scripts/benchmark_speaker.py --compare

# Source separation (MUSDB18-HQ, download from Zenodo)
python scripts/benchmark_separation.py --data-dir benchmarks/data/musdb18-hq