Voice Activity Detection — Silero VAD v5

Two VAD models are available: Pyannote segmentation for offline batch processing with high accuracy, and Silero VAD v5 for streaming low-latency detection. Both run entirely on-device.

Pyannote (Offline)

Pyannote segmentation-3.0 provides high-accuracy VAD using a PyanNet architecture. It processes speech in 10-second sliding windows with a 1-second step, then aggregates overlapping predictions and applies hysteresis smoothing.

Architecture

Stage	Details
SincNet	40 learned bandpass filters (80 total: 40 cos + 40 sin)
BiLSTM	4 layers, hidden=128, bidirectional (256-dim output)
Linear	2 linear layers with LeakyReLU (negative_slope=0.01)
Output	7-class softmax with hysteresis post-processing

Model size: ~1.49M parameters, ~5.7 MB on disk.

Default Thresholds

Onset: 0.767 — probability above which speech is detected
Offset: 0.377 — probability below which speech ends

CLI Usage

# Offline VAD
.build/release/speech vad recording.wav

# JSON output
.build/release/speech vad recording.wav --json

# Custom thresholds
.build/release/speech vad recording.wav --onset 0.6 --offset 0.3

Silero VAD v5 (Streaming)

Silero VAD v5 is a lightweight streaming model that processes 512-sample chunks (32 ms at 16 kHz). It runs at 23x real-time in release mode, making it suitable for live audio applications.

Architecture

Stage	Details
STFT	Conv1d (1 to 258 channels), right-only reflection pad of 64
Encoder	4x Conv1d + ReLU
LSTM	Hidden size 128, state carried across chunks
Decoder	Conv1d (128 to 1) on LSTM hidden state, sigmoid output

Model size: ~309K parameters, ~1.2 MB on disk.

Streaming State Machine

The streaming VAD processor uses a 4-state machine to produce clean speech segments:

silence — no speech detected
pendingSpeech — onset threshold crossed, waiting for minimum speech duration
speech — confirmed speech segment in progress
pendingSilence — offset threshold crossed, waiting for minimum silence duration

Default Thresholds

Onset: 0.5
Offset: 0.35
Minimum speech duration: 0.25s
Minimum silence duration: 0.1s

CLI Usage

# Streaming VAD
.build/release/speech vad-stream recording.wav

# Custom thresholds
.build/release/speech vad-stream recording.wav --onset 0.6 --offset 0.3

# Minimum durations
.build/release/speech vad-stream recording.wav --min-speech 0.5 --min-silence 0.2

# Choose engine
.build/release/speech vad-stream recording.wav --engine coreml

Options

Option	Applies To	Description
`--onset`	Both	Speech onset probability threshold
`--offset`	Both	Speech offset probability threshold
`--min-speech`	Streaming	Minimum speech segment duration (seconds)
`--min-silence`	Streaming	Minimum silence duration to end segment (seconds)
`--engine`	Streaming	Inference engine: `mlx` or `coreml`
`--json`	Both	JSON output format

Important

For real-time applications, use speech vad-stream with Silero VAD. The Pyannote model requires the full audio file and is better suited for offline batch processing where accuracy is the priority.

Model Downloads

Model	Backend	Size	HuggingFace
Silero-VAD-v5	MLX	~1.2 MB	aufklarer/Silero-VAD-v5-MLX
Silero-VAD-v5	CoreML	~1.2 MB	aufklarer/Silero-VAD-v5-CoreML
Pyannote-Segmentation-3.0	MLX	~5.7 MB	aufklarer/Pyannote-Segmentation-MLX

Swift API

import SpeechVAD

// Offline VAD (Pyannote)
let pyannote = try await PyannoteVADModel.fromPretrained()
let segments = pyannote.detectSpeech(audio: samples, sampleRate: 16000)
for segment in segments {
    print("\(segment.startTime)s - \(segment.endTime)s")
}

// Streaming VAD (Silero)
let silero = try await SileroVADModel.fromPretrained()
let processor = StreamingVADProcessor(model: silero, config: .sileroDefault)
for event in processor.process(samples: audioBuffer) {
    switch event {
    case .speechStarted(let time):
        print("Speech started at \(time)s")
    case .speechEnded(let segment):
        print("Speech: \(segment.startTime)s - \(segment.endTime)s")
    }
}

Also available on Android, Linux & Windows via ONNX Runtime.