Speaker Diarization
Identify who spoke when in a multi-speaker recording. Two diarization engines are available: a two-stage Pyannote pipeline (segmentation + activity-based speaker chaining, then post-hoc embedding) and an end-to-end Sortformer model (CoreML, Neural Engine).
Engines
Select the engine with --engine pyannote (default) or --engine sortformer.
Pyannote (default)
Two-stage pipeline: Pyannote segmentation processes overlapping windows with activity-based speaker chaining (Pearson correlation in overlap zones) to assign global speaker labels. Post-hoc WeSpeaker embedding extraction enables target speaker identification via enrollment audio.
Sortformer (CoreML)
NVIDIA's end-to-end neural diarization model. Directly predicts per-frame speaker activity for up to 4 speakers without separate embedding or clustering stages. Runs on Neural Engine via CoreML with streaming state buffers (FIFO + speaker cache).
Sortformer does not produce speaker embeddings. The --target-speaker and --embedding-engine flags are only available with the Pyannote engine.
Pyannote Pipeline
The default pipeline runs in two stages:
Stage 1: Segmentation + Speaker Chaining
Pyannote segmentation-3.0 processes 10-second sliding windows with 50% overlap. A powerset decoder converts the 7-class output into per-speaker probabilities (up to 3 local speakers per window). Adjacent windows share a 5-second overlap — speaker identity is propagated across windows by computing Pearson correlation between probability tracks in the overlap zone, with greedy exclusive matching for consistent global speaker IDs.
Stage 2: Post-hoc Embedding
After diarization, WeSpeaker ResNet34-LM extracts a 256-dimensional centroid embedding per speaker. These embeddings enable target speaker extraction (--target-speaker) but do not drive the speaker assignment itself.
CLI Usage
# Basic diarization (pyannote, default)
.build/release/audio diarize meeting.wav
# End-to-end Sortformer (CoreML)
.build/release/audio diarize meeting.wav --engine sortformer
# RTTM output format (for evaluation)
.build/release/audio diarize meeting.wav --rttm
# JSON output
.build/release/audio diarize meeting.wav --json
Target Speaker Extraction
Provide enrollment audio of a known speaker to extract only their segments from a recording. The pipeline computes the speaker embedding of the enrollment audio and finds the cluster with the highest cosine similarity.
# Extract segments for a specific speaker
.build/release/audio diarize meeting.wav --target-speaker enrollment.wav
DER Scoring
Evaluate diarization quality by scoring against a reference RTTM file. The pipeline computes the Diarization Error Rate (DER), which measures the proportion of time that is incorrectly attributed.
# Score against reference RTTM
.build/release/audio diarize meeting.wav --score-against reference.rttm
RTTM Output
The --rttm flag produces Rich Transcription Time Marked output, a standard format used for diarization evaluation. Each line follows the format:
SPEAKER filename 1 start_time duration <NA> <NA> speaker_id <NA> <NA>
Options
| Option | Description |
|---|---|
--target-speaker | Enrollment audio for target speaker extraction (pyannote only) |
--embedding-engine | Speaker embedding engine: mlx or coreml (pyannote only) |
--vad-filter | Pre-filter with Silero VAD (pyannote only) |
--rttm | Output in RTTM format |
--json | JSON output format |
--score-against | Reference RTTM file for DER evaluation |
Diarization works best with recordings that have clear speaker turns. Highly overlapping speech may reduce accuracy. Speaker count is determined automatically.
Model Downloads
Models are downloaded automatically on first use:
| Component | Model | Size | HuggingFace |
|---|---|---|---|
| Segmentation | Pyannote-Segmentation-3.0 | ~5.7 MB | aufklarer/Pyannote-Segmentation-MLX |
| Speaker Embedding | WeSpeaker-ResNet34-LM (MLX) | ~25 MB | aufklarer/WeSpeaker-ResNet34-LM-MLX |
| Speaker Embedding | WeSpeaker-ResNet34-LM (CoreML) | ~25 MB | aufklarer/WeSpeaker-ResNet34-LM-CoreML |
| Sortformer | Sortformer Diarization (CoreML) | ~240 MB | aufklarer/Sortformer-Diarization-CoreML |
Swift API
import SpeechVAD
let pipeline = try await DiarizationPipeline.fromPretrained()
let result = pipeline.diarize(audio: samples, sampleRate: 16000)
for seg in result.segments {
print("Speaker \(seg.speakerId): [\(seg.startTime)s - \(seg.endTime)s]")
}
// Target speaker extraction
let targetEmb = pipeline.embeddingModel.embed(audio: enrollmentAudio, sampleRate: 16000)
let segments = pipeline.extractSpeaker(
audio: meetingAudio, sampleRate: 16000,
targetEmbedding: targetEmb
)