Speaker Diarization

Identify who spoke when in a multi-speaker recording. Two diarization engines are available: a two-stage Pyannote pipeline (segmentation + activity-based speaker chaining, then post-hoc embedding) and an end-to-end Sortformer model (CoreML, Neural Engine).

Engines

Select the engine with --engine pyannote (default) or --engine sortformer.

Pyannote (default)

Two-stage pipeline: Pyannote segmentation processes overlapping windows with activity-based speaker chaining (Pearson correlation in overlap zones) to assign global speaker labels. Post-hoc WeSpeaker embedding extraction enables target speaker identification via enrollment audio.

Sortformer (CoreML)

NVIDIA's end-to-end neural diarization model. Directly predicts per-frame speaker activity for up to 4 speakers without separate embedding or clustering stages. Runs on Neural Engine via CoreML with streaming state buffers (FIFO + speaker cache).

Note

Sortformer does not produce speaker embeddings. The --target-speaker and --embedding-engine flags are only available with the Pyannote engine.

Pyannote Pipeline

The default pipeline runs in two stages:

Stage 1: Segmentation + Speaker Chaining

Pyannote segmentation-3.0 processes 10-second sliding windows with 50% overlap. A powerset decoder converts the 7-class output into per-speaker probabilities (up to 3 local speakers per window). Adjacent windows share a 5-second overlap — speaker identity is propagated across windows by computing Pearson correlation between probability tracks in the overlap zone, with greedy exclusive matching for consistent global speaker IDs.

Stage 2: Post-hoc Embedding

After diarization, WeSpeaker ResNet34-LM extracts a 256-dimensional centroid embedding per speaker. These embeddings enable target speaker extraction (--target-speaker) but do not drive the speaker assignment itself.

CLI Usage

# Basic diarization (pyannote, default)
.build/release/audio diarize meeting.wav

# End-to-end Sortformer (CoreML)
.build/release/audio diarize meeting.wav --engine sortformer

# RTTM output format (for evaluation)
.build/release/audio diarize meeting.wav --rttm

# JSON output
.build/release/audio diarize meeting.wav --json

Target Speaker Extraction

Provide enrollment audio of a known speaker to extract only their segments from a recording. The pipeline computes the speaker embedding of the enrollment audio and finds the cluster with the highest cosine similarity.

# Extract segments for a specific speaker
.build/release/audio diarize meeting.wav --target-speaker enrollment.wav

DER Scoring

Evaluate diarization quality by scoring against a reference RTTM file. The pipeline computes the Diarization Error Rate (DER), which measures the proportion of time that is incorrectly attributed.

# Score against reference RTTM
.build/release/audio diarize meeting.wav --score-against reference.rttm

RTTM Output

The --rttm flag produces Rich Transcription Time Marked output, a standard format used for diarization evaluation. Each line follows the format:

SPEAKER filename 1 start_time duration <NA> <NA> speaker_id <NA> <NA>

Options

OptionDescription
--target-speakerEnrollment audio for target speaker extraction (pyannote only)
--embedding-engineSpeaker embedding engine: mlx or coreml (pyannote only)
--vad-filterPre-filter with Silero VAD (pyannote only)
--rttmOutput in RTTM format
--jsonJSON output format
--score-againstReference RTTM file for DER evaluation
Important

Diarization works best with recordings that have clear speaker turns. Highly overlapping speech may reduce accuracy. Speaker count is determined automatically.

Model Downloads

Models are downloaded automatically on first use:

ComponentModelSizeHuggingFace
SegmentationPyannote-Segmentation-3.0~5.7 MBaufklarer/Pyannote-Segmentation-MLX
Speaker EmbeddingWeSpeaker-ResNet34-LM (MLX)~25 MBaufklarer/WeSpeaker-ResNet34-LM-MLX
Speaker EmbeddingWeSpeaker-ResNet34-LM (CoreML)~25 MBaufklarer/WeSpeaker-ResNet34-LM-CoreML
SortformerSortformer Diarization (CoreML)~240 MBaufklarer/Sortformer-Diarization-CoreML

Swift API

import SpeechVAD

let pipeline = try await DiarizationPipeline.fromPretrained()
let result = pipeline.diarize(audio: samples, sampleRate: 16000)
for seg in result.segments {
    print("Speaker \(seg.speakerId): [\(seg.startTime)s - \(seg.endTime)s]")
}

// Target speaker extraction
let targetEmb = pipeline.embeddingModel.embed(audio: enrollmentAudio, sampleRate: 16000)
let segments = pipeline.extractSpeaker(
    audio: meetingAudio, sampleRate: 16000,
    targetEmbedding: targetEmb
)