Wake-Word / Keyword Spotting

The SpeechWakeWord module runs an on-device keyword spotter: you register a list of phrases, push audio chunks in, and receive detections. Based on icefall's streaming Zipformer transducer (3.49M params, Apache-2.0), compiled to CoreML with INT8 palettization.

English only

The shipped checkpoint is the gigaspeech KWS fine-tune. Non-English keywords need a separate icefall fine-tune and re-export.

Architecture

StageDetails
fbankkaldi-compatible (25 ms / 10 ms, Povey window, 80 mel bins, high_freq=-400, no CMVN)
Encoder6-stage causal Zipformer2 (128-dim), 45 mel frames in → 8 frames out (40 ms / frame) — 3.3 MB INT8
DecoderStateless transducer, BPE-500 vocab, context size 2 — 525 KB FP16
JoinerLinear + tanh output projection — 160 KB INT8
DecodeModified beam search (beam=4) over an Aho-Corasick ContextGraph of user keywords

Compiled size on disk: ~4 MB total (encoder.mlmodelc + decoder.mlmodelc + joiner.mlmodelc). Runtime memory: ~6 MB including encoder state caches.

Performance

MetricValueNotes
RTF (CPU + Neural Engine)0.0426× real-time on M-series
Recall (12 keywords)88%LibriSpeech test-clean, 158 positive utterances
False positives / utterance0.2760 negative utterances
CoreML INT8 vs PyTorch FP3299%Emission agreement

Tuned defaults: acThreshold=0.15, contextScore=0.5, numTrailingBlanks=1. Per-keyword overrides supported.

CLI Usage

Plain-phrase form (greedy BPE — works well for common words):

audio wake recording.wav --keywords "hey soniqo"

audio wake recording.wav --keywords "hey soniqo:0.15:0.5" "cancel"

Pre-tokenized form (sherpa-onnx style — recommended when you know the exact decomposition the model was trained on):

# Format: "phrase|piece1 piece2 ...:threshold:boost"
audio wake recording.wav \
    --keywords "LIGHT UP|▁ L IGHT ▁UP:0.25:2.0"

# Multiple keywords + JSON output
audio wake recording.wav \
    --keywords "LIGHT UP|▁ L IGHT ▁UP:0.25:2.0" \
               "LOVELY CHILD|▁LOVE LY ▁CHI L D:0.25:2.0" \
    --json

Or a keyword file, one entry per line (# for comments):

audio wake recording.wav --keywords-file keywords.txt

Swift API

import SpeechWakeWord

// Load the model with your keyword list.
let detector = try await WakeWordDetector.fromPretrained(
    keywords: [
        KeywordSpec(phrase: "hey soniqo", acThreshold: 0.15, boost: 0.5),
        KeywordSpec(phrase: "cancel")
    ]
)

// Streaming: push chunks, consume detections as they fire.
let session = try detector.createSession()
for chunk in micAudioChunks {                   // Float32 @ 16 kHz
    for detection in try session.pushAudio(chunk) {
        print("[\(detection.time(frameShiftSeconds: 0.04))s] \(detection.phrase)")
    }
}

// Batch: single shot over a full buffer.
let detections = try detector.detect(audio: samples, sampleRate: 16000)

KeywordSpec

FieldMeaning
phraseDisplay phrase, e.g. "hey soniqo". Also used as the source for greedy BPE encoding when tokens is nil.
acThresholdMean acoustic probability required over the matched span. 0 → use tuned default (0.15).
boostPer-token context boost. Positive values make the phrase easier to trigger. 0 → use tuned default (0.5).
tokensOptional explicit BPE piece list. When non-nil, the detector looks each piece up in the model's tokens.txt and bypasses the greedy BPE encoder.
When to use pre-tokenized tokens

The icefall KWS vocabulary is uppercase BPE. Greedy tokenization of a phrase can pick a different BPE decomposition from the one the model was trained to emit — "LIGHT UP" greedy-encodes to ▁LI GHT ▁UP but the training decomposition is ▁ L IGHT ▁UP. When detection on TTS-synthesised or clean read speech misses obvious matches, try the sherpa-onnx-style pre-tokenized form.

Model Downloads

ModelParamsSizeHuggingFace
KWS-Zipformer-3M3.49M~4 MBaufklarer/KWS-Zipformer-3M-CoreML-INT8

Pipeline integration

The module exposes a WakeWordProvider protocol that mirrors StreamingVADProvider, so a voice pipeline can gate activation on VAD, wake-word, or both. WakeWordStreamingAdapter wraps a loaded detector + a single session into a reusable provider object.

let adapter = try WakeWordStreamingAdapter(detector: detector)
// pipeline.configure(wakeWord: adapter)

Source