Use case · Pipeline

Diarized transcription.
Every speaker named.

From a meeting recording or a call file to a fully-attributed transcript — speech recognition, speaker diarization, and speaker identification stitched into one on-device pipeline. No cloud APIs, no per-minute pricing, no data leaving the device.

Get started Read the ASR guide

What you can build

Four shapes of the same pipeline.

Each shape stitches an ASR + a diarizer + an optional speaker-ID enrolment store. The components are interchangeable; what you choose depends on the audio source and your latency budget.

Meeting minutes

"Alice said …" / "Bob said …" attribution from a single Zoom export.

Call-center analytics

Agent vs. caller turns, sentiment per speaker, on-device for compliance.

Podcast transcripts

Host + guests identified across the episode, word-level timestamps.

Legal / interview records

Court-grade attribution with no audio ever leaving the device.

Recommended stack

Each stage runs on-device.

Defaults are tuned for accuracy on cleanly-recorded audio. Swap in the alternatives when your input differs (noisy room, streaming pipeline, long-tail language, iOS Neural Engine constraints).

Stage	Default	Alternatives	When
Pre-clean (optional)	DeepFilterNet3	—	Noisy room, phone audio, far-field mics.
Speech recognition	Qwen3-ASR 0.6B	Qwen3-ASR 1.7B, Parakeet TDT, Omnilingual, Nemotron	1.7B for lowest WER, Parakeet for iOS, Omnilingual for long-tail languages, Nemotron for streaming.
Speaker diarization	Pyannote (default)	Sortformer (end-to-end)	Sortformer runs on the Neural Engine via CoreML.
Speaker identification	WeSpeaker ResNet34	CAM++	256-d vs 192-d embeddings — cosine match against enrolled voices.
Word-level alignment	Forced Aligner	Built into Qwen3-ASR / Parakeet outputs	Use forced alignment for retroactive word timing.

Quickstart

From audio to attributed JSON.

The simplest path: ASR + diarization with anonymous speaker labels (`Speaker_1`, `Speaker_2`, …).

# 1. Basic: ASR + diarization (anonymous speakers)
speech transcribe meeting.wav --diarize --engine qwen3 -o meeting.json

To turn anonymous labels into named speakers, enrol each known voice once with the speaker-embeddings model, then re-run with the enrolment directory:

# 2. Enrol known voices (one-time)
speech embed-speaker alice-sample.wav --engine wespeaker -o speakers/alice.npy
speech embed-speaker bob-sample.wav   --engine wespeaker -o speakers/bob.npy

# 3. Transcribe + diarize + match enrolled speakers
speech transcribe meeting.wav --diarize \
  --enrolled-speakers speakers/ -o meeting.json

Choosing an ASR engine

Pick the engine that matches your audio.

All ASR models work with the same diarization + speaker-ID pipeline, so swapping is a one-flag change.

Need	Pick	Why	Backend
Best accuracy	Qwen3-ASR 1.7B 8-bit	Lowest WER across 52 languages	MLX
Real-time streaming	Nemotron Streaming	RNN-T, sub-100 ms per chunk, native punctuation	CoreML
iOS / Neural Engine	Parakeet TDT	FastConformer, 32× real-time, runs on the Apple Neural Engine	CoreML
Long-tail languages	Omnilingual ASR	1,672 languages via Meta wav2vec2 + CTC	CoreML, MLX
Android / Linux	Parakeet TDT v3	114 languages, INT8, NNAPI	ONNX Runtime