Diarized transcription.
Every speaker named.
From a meeting recording or a call file to a fully-attributed transcript — speech recognition, speaker diarization, and speaker identification stitched into one on-device pipeline. No cloud APIs, no per-minute pricing, no data leaving the device.
Four shapes of the same pipeline.
Each shape stitches an ASR + a diarizer + an optional speaker-ID enrolment store. The components are interchangeable; what you choose depends on the audio source and your latency budget.
"Alice said …" / "Bob said …" attribution from a single Zoom export.
Agent vs. caller turns, sentiment per speaker, on-device for compliance.
Host + guests identified across the episode, word-level timestamps.
Court-grade attribution with no audio ever leaving the device.
Each stage runs on-device.
Defaults are tuned for accuracy on cleanly-recorded audio. Swap in the alternatives when your input differs (noisy room, streaming pipeline, long-tail language, iOS Neural Engine constraints).
| Stage | Default | Alternatives | When |
|---|---|---|---|
| Pre-clean (optional) | DeepFilterNet3 | — | Noisy room, phone audio, far-field mics. |
| Speech recognition | Qwen3-ASR 0.6B | Qwen3-ASR 1.7B, Parakeet TDT, Omnilingual, Nemotron | 1.7B for lowest WER, Parakeet for iOS, Omnilingual for long-tail languages, Nemotron for streaming. |
| Speaker diarization | Pyannote (default) | Sortformer (end-to-end) | Sortformer runs on the Neural Engine via CoreML. |
| Speaker identification | WeSpeaker ResNet34 | CAM++ | 256-d vs 192-d embeddings — cosine match against enrolled voices. |
| Word-level alignment | Forced Aligner | Built into Qwen3-ASR / Parakeet outputs | Use forced alignment for retroactive word timing. |
From audio to attributed JSON.
The simplest path: ASR + diarization with anonymous speaker labels (`Speaker_1`, `Speaker_2`, …).
# 1. Basic: ASR + diarization (anonymous speakers)
speech transcribe meeting.wav --diarize --engine qwen3 -o meeting.json# 2. Enrol known voices (one-time)
speech embed-speaker alice-sample.wav --engine wespeaker -o speakers/alice.npy
speech embed-speaker bob-sample.wav --engine wespeaker -o speakers/bob.npy
# 3. Transcribe + diarize + match enrolled speakers
speech transcribe meeting.wav --diarize \
--enrolled-speakers speakers/ -o meeting.jsonPick the engine that matches your audio.
All ASR models work with the same diarization + speaker-ID pipeline, so swapping is a one-flag change.
| Need | Pick | Why | Backend |
|---|---|---|---|
| Best accuracy | Qwen3-ASR 1.7B 8-bit | Lowest WER across 52 languages | MLX |
| Real-time streaming | Nemotron Streaming | RNN-T, sub-100 ms per chunk, native punctuation | CoreML |
| iOS / Neural Engine | Parakeet TDT | FastConformer, 32× real-time, runs on the Apple Neural Engine | CoreML |
| Long-tail languages | Omnilingual ASR | 1,672 languages via Meta wav2vec2 + CTC | CoreML, MLX |
| Android / Linux | Parakeet TDT v3 | 114 languages, INT8, NNAPI | ONNX Runtime |
Component guides.
Architecture, CLI, Swift API, benchmarks.
Streaming + iOS Neural Engine.
1,672 languages, 300M–7B variants.
Pyannote pipeline vs. Sortformer end-to-end.
WeSpeaker / CAM++ for enrolment + ID.
Word-level timestamps via CTC, 80 ms.
DeepFilterNet3 at 48 kHz.
RNN-T with native punctuation.
