Speaker Diarization

Identify who spoke when in a multi-speaker recording. Two diarization engines are available: a two-stage Pyannote pipeline (segmentation + activity-based speaker chaining, then post-hoc embedding) and an end-to-end Sortformer model (CoreML, Neural Engine).

Engines

Select the engine with --engine pyannote (default) or --engine sortformer.

Pyannote (default)

Two-stage pipeline: Pyannote segmentation processes overlapping windows with activity-based speaker chaining (Pearson correlation in overlap zones) to assign global speaker labels. Post-hoc WeSpeaker embedding extraction enables target speaker identification via enrollment audio.

Sortformer (CoreML)

NVIDIA's end-to-end neural diarization model. Directly predicts per-frame speaker activity for up to 4 speakers without separate embedding or clustering stages. Runs on Neural Engine via CoreML with streaming state buffers (FIFO + speaker cache).

Note

Sortformer does not produce speaker embeddings. The --target-speaker and --embedding-engine flags are only available with the Pyannote engine.

Pyannote Pipeline

The default pipeline runs in two stages:

Stage 1: Segmentation + Speaker Chaining

Pyannote segmentation-3.0 processes 10-second sliding windows with 50% overlap. A powerset decoder converts the 7-class output into per-speaker probabilities (up to 3 local speakers per window). Adjacent windows share a 5-second overlap — speaker identity is propagated across windows by computing Pearson correlation between probability tracks in the overlap zone, with greedy exclusive matching for consistent global speaker IDs.

Stage 2: Post-hoc Embedding

After diarization, WeSpeaker ResNet34-LM extracts a 256-dimensional centroid embedding per speaker. These embeddings enable target speaker extraction (--target-speaker) but do not drive the speaker assignment itself.

Migrating from pyannote.audio

If you are coming from the Python pyannote.audio library — replacing a Pipeline subclass that sets pipeline.segmentation = ..., or moving away from a server hosting pyannote/speaker-diarization-3.1 — Soniqo wraps the same Pyannote-Segmentation-3.0 model and runs it entirely on-device on Apple Silicon. No Python runtime, no CUDA, no Hugging Face token at inference time.

pyannote.audio (Python)	Soniqo (Swift)
`Pipeline.from_pretrained("pyannote/speaker-diarization-3.1")`	`DiarizationPipeline.fromPretrained()`
`pipeline(audio_file)`	`pipeline.diarize(audio: samples, sampleRate: 16000)`
`pipeline.segmentation = ...` (custom subclass)	Fixed: Pyannote-Segmentation-3.0 (MLX or CoreML, auto-selected)
`diarization.itertracks(yield_label=True)`	`for seg in result.segments { ... }`
`diarization.write_rttm(file)`	CLI: `--rttm`
`pyannote.metrics.diarization.DiarizationErrorRate`	CLI: `--score-against reference.rttm`

The Pyannote-Segmentation-3.0 weights are converted from the upstream HuggingFace checkpoint, so segmentation logits are numerically equivalent within float-precision tolerance. The post-segmentation chaining (Pearson correlation across overlapping windows + greedy exclusive matching) and post-hoc WeSpeaker embedding stages are reimplemented in Swift but produce comparable RTTM output to the reference Python pipeline.

Not yet supported

There is no streaming OnlineSpeakerDiarization equivalent for the Pyannote engine. For real-time diarization use --engine sortformer instead, which runs the Sortformer model with FIFO and speaker-cache state buffers.

CLI Usage

# Basic diarization (pyannote, default)
.build/release/speech diarize meeting.wav

# End-to-end Sortformer (CoreML)
.build/release/speech diarize meeting.wav --engine sortformer

# RTTM output format (for evaluation)
.build/release/speech diarize meeting.wav --rttm

# JSON output
.build/release/speech diarize meeting.wav --json

Target Speaker Extraction

Provide enrollment audio of a known speaker to extract only their segments from a recording. The pipeline computes the speaker embedding of the enrollment audio and finds the cluster with the highest cosine similarity.

# Extract segments for a specific speaker
.build/release/speech diarize meeting.wav --target-speaker enrollment.wav

DER Scoring

Evaluate diarization quality by scoring against a reference RTTM file. The pipeline computes the Diarization Error Rate (DER), which measures the proportion of time that is incorrectly attributed.

# Score against reference RTTM
.build/release/speech diarize meeting.wav --score-against reference.rttm

RTTM Output

The --rttm flag produces Rich Transcription Time Marked output, a standard format used for diarization evaluation. Each line follows the format:

SPEAKER filename 1 start_time duration <NA> <NA> speaker_id <NA> <NA>

Options

Option	Description
`--target-speaker`	Enrollment audio for target speaker extraction (pyannote only)
`--embedding-engine`	Speaker embedding engine: `mlx` or `coreml` (pyannote only)
`--vad-filter`	Pre-filter with Silero VAD (pyannote only)
`--rttm`	Output in RTTM format
`--json`	JSON output format
`--score-against`	Reference RTTM file for DER evaluation

Important

Diarization works best with recordings that have clear speaker turns. Highly overlapping speech may reduce accuracy. Speaker count is determined automatically.

Model Downloads

Models are downloaded automatically on first use:

Component	Model	Size	HuggingFace
Segmentation	Pyannote-Segmentation-3.0	~5.7 MB	aufklarer/Pyannote-Segmentation-MLX
Speaker Embedding	WeSpeaker-ResNet34-LM (MLX)	~25 MB	aufklarer/WeSpeaker-ResNet34-LM-MLX
Speaker Embedding	WeSpeaker-ResNet34-LM (CoreML)	~25 MB	aufklarer/WeSpeaker-ResNet34-LM-CoreML
Sortformer	Sortformer Diarization (CoreML)	~240 MB	aufklarer/Sortformer-Diarization-CoreML

Swift API

import SpeechVAD

let pipeline = try await DiarizationPipeline.fromPretrained()
let result = pipeline.diarize(audio: samples, sampleRate: 16000)
for seg in result.segments {
    print("Speaker \(seg.speakerId): [\(seg.startTime)s - \(seg.endTime)s]")
}

// Target speaker extraction
let targetEmb = pipeline.embeddingModel.embed(audio: enrollmentAudio, sampleRate: 16000)
let segments = pipeline.extractSpeaker(
    audio: meetingAudio, sampleRate: 16000,
    targetEmbedding: targetEmb
)