Voice Activity Detection
Two VAD models are available: Pyannote segmentation for offline batch processing with high accuracy, and Silero VAD v5 for streaming low-latency detection. Both run entirely on-device.
Pyannote (Offline)
Pyannote segmentation-3.0 provides high-accuracy VAD using a PyanNet architecture. It processes audio in 10-second sliding windows with a 1-second step, then aggregates overlapping predictions and applies hysteresis smoothing.
Architecture
| Stage | Details |
|---|---|
| SincNet | 40 learned bandpass filters (80 total: 40 cos + 40 sin) |
| BiLSTM | 4 layers, hidden=128, bidirectional (256-dim output) |
| Linear | 2 linear layers with LeakyReLU (negative_slope=0.01) |
| Output | 7-class softmax with hysteresis post-processing |
Model size: ~1.49M parameters, ~5.7 MB on disk.
Default Thresholds
- Onset:
0.767— probability above which speech is detected - Offset:
0.377— probability below which speech ends
CLI Usage
# Offline VAD
.build/release/audio vad recording.wav
# JSON output
.build/release/audio vad recording.wav --json
# Custom thresholds
.build/release/audio vad recording.wav --onset 0.6 --offset 0.3
Silero VAD v5 (Streaming)
Silero VAD v5 is a lightweight streaming model that processes 512-sample chunks (32 ms at 16 kHz). It runs at 23x real-time in release mode, making it suitable for live audio applications.
Architecture
| Stage | Details |
|---|---|
| STFT | Conv1d (1 to 258 channels), right-only reflection pad of 64 |
| Encoder | 4x Conv1d + ReLU |
| LSTM | Hidden size 128, state carried across chunks |
| Decoder | Conv1d (128 to 1) on LSTM hidden state, sigmoid output |
Model size: ~309K parameters, ~1.2 MB on disk.
Streaming State Machine
The streaming VAD processor uses a 4-state machine to produce clean speech segments:
- silence — no speech detected
- pendingSpeech — onset threshold crossed, waiting for minimum speech duration
- speech — confirmed speech segment in progress
- pendingSilence — offset threshold crossed, waiting for minimum silence duration
Default Thresholds
- Onset:
0.5 - Offset:
0.35 - Minimum speech duration:
0.25s - Minimum silence duration:
0.1s
CLI Usage
# Streaming VAD
.build/release/audio vad-stream recording.wav
# Custom thresholds
.build/release/audio vad-stream recording.wav --onset 0.6 --offset 0.3
# Minimum durations
.build/release/audio vad-stream recording.wav --min-speech 0.5 --min-silence 0.2
# Choose engine
.build/release/audio vad-stream recording.wav --engine coreml
Options
| Option | Applies To | Description |
|---|---|---|
--onset | Both | Speech onset probability threshold |
--offset | Both | Speech offset probability threshold |
--min-speech | Streaming | Minimum speech segment duration (seconds) |
--min-silence | Streaming | Minimum silence duration to end segment (seconds) |
--engine | Streaming | Inference engine: mlx or coreml |
--json | Both | JSON output format |
For real-time applications, use audio vad-stream with Silero VAD. The Pyannote model requires the full audio file and is better suited for offline batch processing where accuracy is the priority.
Model Downloads
| Model | Backend | Size | HuggingFace |
|---|---|---|---|
| Silero-VAD-v5 | MLX | ~1.2 MB | aufklarer/Silero-VAD-v5-MLX |
| Silero-VAD-v5 | CoreML | ~1.2 MB | aufklarer/Silero-VAD-v5-CoreML |
| Pyannote-Segmentation-3.0 | MLX | ~5.7 MB | aufklarer/Pyannote-Segmentation-MLX |
Swift API
import SpeechVAD
// Offline VAD (Pyannote)
let pyannote = try await PyannoteVAD.loadFromHub()
let segments = try await pyannote.detectSpeech(audioFile: "recording.wav")
for segment in segments {
print("\(segment.start)s - \(segment.end)s")
}
// Streaming VAD (Silero)
let silero = try await SileroVAD.loadFromHub()
let processor = StreamingVADProcessor(model: silero, config: .sileroDefault)
for chunk in audioChunks {
if let segment = try processor.process(chunk: chunk) {
print("Speech: \(segment.start)s - \(segment.end)s")
}
}