Voice Activity Detection

Two VAD models are available: Pyannote segmentation for offline batch processing with high accuracy, and Silero VAD v5 for streaming low-latency detection. Both run entirely on-device.

Pyannote (Offline)

Pyannote segmentation-3.0 provides high-accuracy VAD using a PyanNet architecture. It processes audio in 10-second sliding windows with a 1-second step, then aggregates overlapping predictions and applies hysteresis smoothing.

Architecture

StageDetails
SincNet40 learned bandpass filters (80 total: 40 cos + 40 sin)
BiLSTM4 layers, hidden=128, bidirectional (256-dim output)
Linear2 linear layers with LeakyReLU (negative_slope=0.01)
Output7-class softmax with hysteresis post-processing

Model size: ~1.49M parameters, ~5.7 MB on disk.

Default Thresholds

CLI Usage

# Offline VAD
.build/release/audio vad recording.wav

# JSON output
.build/release/audio vad recording.wav --json

# Custom thresholds
.build/release/audio vad recording.wav --onset 0.6 --offset 0.3

Silero VAD v5 (Streaming)

Silero VAD v5 is a lightweight streaming model that processes 512-sample chunks (32 ms at 16 kHz). It runs at 23x real-time in release mode, making it suitable for live audio applications.

Architecture

StageDetails
STFTConv1d (1 to 258 channels), right-only reflection pad of 64
Encoder4x Conv1d + ReLU
LSTMHidden size 128, state carried across chunks
DecoderConv1d (128 to 1) on LSTM hidden state, sigmoid output

Model size: ~309K parameters, ~1.2 MB on disk.

Streaming State Machine

The streaming VAD processor uses a 4-state machine to produce clean speech segments:

  1. silence — no speech detected
  2. pendingSpeech — onset threshold crossed, waiting for minimum speech duration
  3. speech — confirmed speech segment in progress
  4. pendingSilence — offset threshold crossed, waiting for minimum silence duration

Default Thresholds

CLI Usage

# Streaming VAD
.build/release/audio vad-stream recording.wav

# Custom thresholds
.build/release/audio vad-stream recording.wav --onset 0.6 --offset 0.3

# Minimum durations
.build/release/audio vad-stream recording.wav --min-speech 0.5 --min-silence 0.2

# Choose engine
.build/release/audio vad-stream recording.wav --engine coreml

Options

OptionApplies ToDescription
--onsetBothSpeech onset probability threshold
--offsetBothSpeech offset probability threshold
--min-speechStreamingMinimum speech segment duration (seconds)
--min-silenceStreamingMinimum silence duration to end segment (seconds)
--engineStreamingInference engine: mlx or coreml
--jsonBothJSON output format
Important

For real-time applications, use audio vad-stream with Silero VAD. The Pyannote model requires the full audio file and is better suited for offline batch processing where accuracy is the priority.

Model Downloads

ModelBackendSizeHuggingFace
Silero-VAD-v5MLX~1.2 MBaufklarer/Silero-VAD-v5-MLX
Silero-VAD-v5CoreML~1.2 MBaufklarer/Silero-VAD-v5-CoreML
Pyannote-Segmentation-3.0MLX~5.7 MBaufklarer/Pyannote-Segmentation-MLX

Swift API

import SpeechVAD

// Offline VAD (Pyannote)
let pyannote = try await PyannoteVAD.loadFromHub()
let segments = try await pyannote.detectSpeech(audioFile: "recording.wav")
for segment in segments {
    print("\(segment.start)s - \(segment.end)s")
}

// Streaming VAD (Silero)
let silero = try await SileroVAD.loadFromHub()
let processor = StreamingVADProcessor(model: silero, config: .sileroDefault)
for chunk in audioChunks {
    if let segment = try processor.process(chunk: chunk) {
        print("Speech: \(segment.start)s - \(segment.end)s")
    }
}