Wake-Word / Keyword Spotting
The SpeechWakeWord module runs an on-device keyword spotter: you register a list of phrases, push audio chunks in, and receive detections. Based on icefall's streaming Zipformer transducer (3.49M params, Apache-2.0), compiled to CoreML with INT8 palettization.
The shipped checkpoint is the gigaspeech KWS fine-tune. Non-English keywords need a separate icefall fine-tune and re-export.
Architecture
| Stage | Details |
|---|---|
| fbank | kaldi-compatible (25 ms / 10 ms, Povey window, 80 mel bins, high_freq=-400, no CMVN) |
| Encoder | 6-stage causal Zipformer2 (128-dim), 45 mel frames in → 8 frames out (40 ms / frame) — 3.3 MB INT8 |
| Decoder | Stateless transducer, BPE-500 vocab, context size 2 — 525 KB FP16 |
| Joiner | Linear + tanh output projection — 160 KB INT8 |
| Decode | Modified beam search (beam=4) over an Aho-Corasick ContextGraph of user keywords |
Compiled size on disk: ~4 MB total (encoder.mlmodelc + decoder.mlmodelc + joiner.mlmodelc). Runtime memory: ~6 MB including encoder state caches.
Performance
| Metric | Value | Notes |
|---|---|---|
| RTF (CPU + Neural Engine) | 0.04 | 26× real-time on M-series |
| Recall (12 keywords) | 88% | LibriSpeech test-clean, 158 positive utterances |
| False positives / utterance | 0.27 | 60 negative utterances |
| CoreML INT8 vs PyTorch FP32 | 99% | Emission agreement |
Tuned defaults: acThreshold=0.15, contextScore=0.5, numTrailingBlanks=1. Per-keyword overrides supported.
CLI Usage
Plain-phrase form (greedy BPE — works well for common words):
audio wake recording.wav --keywords "hey soniqo"
audio wake recording.wav --keywords "hey soniqo:0.15:0.5" "cancel"
Pre-tokenized form (sherpa-onnx style — recommended when you know the exact decomposition the model was trained on):
# Format: "phrase|piece1 piece2 ...:threshold:boost"
audio wake recording.wav \
--keywords "LIGHT UP|▁ L IGHT ▁UP:0.25:2.0"
# Multiple keywords + JSON output
audio wake recording.wav \
--keywords "LIGHT UP|▁ L IGHT ▁UP:0.25:2.0" \
"LOVELY CHILD|▁LOVE LY ▁CHI L D:0.25:2.0" \
--json
Or a keyword file, one entry per line (# for comments):
audio wake recording.wav --keywords-file keywords.txt
Swift API
import SpeechWakeWord
// Load the model with your keyword list.
let detector = try await WakeWordDetector.fromPretrained(
keywords: [
KeywordSpec(phrase: "hey soniqo", acThreshold: 0.15, boost: 0.5),
KeywordSpec(phrase: "cancel")
]
)
// Streaming: push chunks, consume detections as they fire.
let session = try detector.createSession()
for chunk in micAudioChunks { // Float32 @ 16 kHz
for detection in try session.pushAudio(chunk) {
print("[\(detection.time(frameShiftSeconds: 0.04))s] \(detection.phrase)")
}
}
// Batch: single shot over a full buffer.
let detections = try detector.detect(audio: samples, sampleRate: 16000)
KeywordSpec
| Field | Meaning |
|---|---|
phrase | Display phrase, e.g. "hey soniqo". Also used as the source for greedy BPE encoding when tokens is nil. |
acThreshold | Mean acoustic probability required over the matched span. 0 → use tuned default (0.15). |
boost | Per-token context boost. Positive values make the phrase easier to trigger. 0 → use tuned default (0.5). |
tokens | Optional explicit BPE piece list. When non-nil, the detector looks each piece up in the model's tokens.txt and bypasses the greedy BPE encoder. |
tokensThe icefall KWS vocabulary is uppercase BPE. Greedy tokenization of a phrase can pick a different BPE decomposition from the one the model was trained to emit — "LIGHT UP" greedy-encodes to ▁LI GHT ▁UP but the training decomposition is ▁ L IGHT ▁UP. When detection on TTS-synthesised or clean read speech misses obvious matches, try the sherpa-onnx-style pre-tokenized form.
Model Downloads
| Model | Params | Size | HuggingFace |
|---|---|---|---|
| KWS-Zipformer-3M | 3.49M | ~4 MB | aufklarer/KWS-Zipformer-3M-CoreML-INT8 |
Pipeline integration
The module exposes a WakeWordProvider protocol that mirrors StreamingVADProvider, so a voice pipeline can gate activation on VAD, wake-word, or both. WakeWordStreamingAdapter wraps a loaded detector + a single session into a reusable provider object.
let adapter = try WakeWordStreamingAdapter(detector: detector)
// pipeline.configure(wakeWord: adapter)
Source
- Sources/SpeechWakeWord — Swift module
- docs/models/kws-zipformer.md — architecture notes
- docs/inference/wake-word.md — inference pipeline
- Upstream: k2-fsa/icefall KWS recipe / pkufool/keyword-spotting-models