Use case · Pipeline

Diarized transcription.
Every speaker named.

From a meeting recording or a call file to a fully-attributed transcript — speech recognition, speaker diarization, and speaker identification stitched into one on-device pipeline. No cloud APIs, no per-minute pricing, no data leaving the device.

What you can build

Four shapes of the same pipeline.

Each shape stitches an ASR + a diarizer + an optional speaker-ID enrolment store. The components are interchangeable; what you choose depends on the audio source and your latency budget.

Meeting minutes

"Alice said …" / "Bob said …" attribution from a single Zoom export.

Call-center analytics

Agent vs. caller turns, sentiment per speaker, on-device for compliance.

Podcast transcripts

Host + guests identified across the episode, word-level timestamps.

Legal / interview records

Court-grade attribution with no audio ever leaving the device.

Recommended stack

Each stage runs on-device.

Defaults are tuned for accuracy on cleanly-recorded audio. Swap in the alternatives when your input differs (noisy room, streaming pipeline, long-tail language, iOS Neural Engine constraints).

StageDefaultAlternativesWhen
Pre-clean (optional)DeepFilterNet3Noisy room, phone audio, far-field mics.
Speech recognitionQwen3-ASR 0.6BQwen3-ASR 1.7B, Parakeet TDT, Omnilingual, Nemotron1.7B for lowest WER, Parakeet for iOS, Omnilingual for long-tail languages, Nemotron for streaming.
Speaker diarizationPyannote (default)Sortformer (end-to-end)Sortformer runs on the Neural Engine via CoreML.
Speaker identificationWeSpeaker ResNet34CAM++256-d vs 192-d embeddings — cosine match against enrolled voices.
Word-level alignmentForced AlignerBuilt into Qwen3-ASR / Parakeet outputsUse forced alignment for retroactive word timing.
Quickstart

From audio to attributed JSON.

The simplest path: ASR + diarization with anonymous speaker labels (`Speaker_1`, `Speaker_2`, …).

# 1. Basic: ASR + diarization (anonymous speakers)
speech transcribe meeting.wav --diarize --engine qwen3 -o meeting.json
To turn anonymous labels into named speakers, enrol each known voice once with the speaker-embeddings model, then re-run with the enrolment directory:
# 2. Enrol known voices (one-time)
speech embed-speaker alice-sample.wav --engine wespeaker -o speakers/alice.npy
speech embed-speaker bob-sample.wav   --engine wespeaker -o speakers/bob.npy

# 3. Transcribe + diarize + match enrolled speakers
speech transcribe meeting.wav --diarize \
  --enrolled-speakers speakers/ -o meeting.json
Choosing an ASR engine

Pick the engine that matches your audio.

All ASR models work with the same diarization + speaker-ID pipeline, so swapping is a one-flag change.

NeedPickWhyBackend
Best accuracyQwen3-ASR 1.7B 8-bitLowest WER across 52 languagesMLX
Real-time streamingNemotron StreamingRNN-T, sub-100 ms per chunk, native punctuationCoreML
iOS / Neural EngineParakeet TDTFastConformer, 32× real-time, runs on the Apple Neural EngineCoreML
Long-tail languagesOmnilingual ASR1,672 languages via Meta wav2vec2 + CTCCoreML, MLX
Android / LinuxParakeet TDT v3114 languages, INT8, NNAPIONNX Runtime
Deeper reading

Component guides.