Wake-Word / Keyword Spotting

The SpeechWakeWord module runs an on-device keyword spotter: you register a list of phrases, push audio chunks in, and receive detections. Based on icefall's streaming Zipformer transducer (3.49M params, Apache-2.0), compiled to CoreML with INT8 palettization.

English only

The shipped checkpoint is the gigaspeech KWS fine-tune. Non-English keywords need a separate icefall fine-tune and re-export.

Architecture

Stage	Details
fbank	kaldi-compatible (25 ms / 10 ms, Povey window, 80 mel bins, `high_freq=-400`, no CMVN)
Encoder	6-stage causal Zipformer2 (128-dim), 45 mel frames in → 8 frames out (40 ms / frame) — 3.3 MB INT8
Decoder	Stateless transducer, BPE-500 vocab, context size 2 — 525 KB FP16
Joiner	Linear + tanh output projection — 160 KB INT8
Decode	Modified beam search (beam=4) over an Aho-Corasick `ContextGraph` of user keywords

Compiled size on disk: ~4 MB total (encoder.mlmodelc + decoder.mlmodelc + joiner.mlmodelc). Runtime memory: ~6 MB including encoder state caches.

Performance

Metric	Value	Notes
RTF (CPU + Neural Engine)	0.04	26× real-time on M-series
Recall (12 keywords)	88%	LibriSpeech test-clean, 158 positive utterances
False positives / utterance	0.27	60 negative utterances
CoreML INT8 vs PyTorch FP32	99%	Emission agreement

Tuned defaults: acThreshold=0.15, contextScore=0.5, numTrailingBlanks=1. Per-keyword overrides supported.

CLI Usage

Plain-phrase form (greedy BPE — works well for common words):

speech wake recording.wav --keywords "hey soniqo"

speech wake recording.wav --keywords "hey soniqo:0.15:0.5" "cancel"

Pre-tokenized form (sherpa-onnx style — recommended when you know the exact decomposition the model was trained on):

# Format: "phrase|piece1 piece2 ...:threshold:boost"
speech wake recording.wav \
    --keywords "LIGHT UP|▁ L IGHT ▁UP:0.25:2.0"

# Multiple keywords + JSON output
speech wake recording.wav \
    --keywords "LIGHT UP|▁ L IGHT ▁UP:0.25:2.0" \
               "LOVELY CHILD|▁LOVE LY ▁CHI L D:0.25:2.0" \
    --json

Or a keyword file, one entry per line (# for comments):

speech wake recording.wav --keywords-file keywords.txt

Swift API

import SpeechWakeWord

// Load the model with your keyword list.
let detector = try await WakeWordDetector.fromPretrained(
    keywords: [
        KeywordSpec(phrase: "hey soniqo", acThreshold: 0.15, boost: 0.5),
        KeywordSpec(phrase: "cancel")
    ]
)

// Streaming: push chunks, consume detections as they fire.
let session = try detector.createSession()
for chunk in micAudioChunks {                   // Float32 @ 16 kHz
    for detection in try session.pushAudio(chunk) {
        print("[\(detection.time(frameShiftSeconds: 0.04))s] \(detection.phrase)")
    }
}

// Batch: single shot over a full buffer.
let detections = try detector.detect(audio: samples, sampleRate: 16000)

KeywordSpec

Field	Meaning
`phrase`	Display phrase, e.g. `"hey soniqo"`. Also used as the source for greedy BPE encoding when `tokens` is nil.
`acThreshold`	Mean acoustic probability required over the matched span. `0` → use tuned default (0.15).
`boost`	Per-token context boost. Positive values make the phrase easier to trigger. `0` → use tuned default (0.5).
`tokens`	Optional explicit BPE piece list. When non-nil, the detector looks each piece up in the model's `tokens.txt` and bypasses the greedy BPE encoder.

When to use pre-tokenized tokens

The icefall KWS vocabulary is uppercase BPE. Greedy tokenization of a phrase can pick a different BPE decomposition from the one the model was trained to emit — "LIGHT UP" greedy-encodes to ▁LI GHT ▁UP but the training decomposition is ▁ L IGHT ▁UP. When detection on TTS-synthesised or clean read speech misses obvious matches, try the sherpa-onnx-style pre-tokenized form.

Model Downloads

Model	Params	Size	HuggingFace
KWS-Zipformer-3M	3.49M	~4 MB	aufklarer/KWS-Zipformer-3M-CoreML-INT8

Pipeline integration

The module exposes a WakeWordProvider protocol that mirrors StreamingVADProvider, so a voice pipeline can gate activation on VAD, wake-word, or both. WakeWordStreamingAdapter wraps a loaded detector + a single session into a reusable provider object.

let adapter = try WakeWordStreamingAdapter(detector: detector)
// pipeline.configure(wakeWord: adapter)

Source

Sources/SpeechWakeWord — Swift module
docs/models/kws-zipformer.md — architecture notes
docs/inference/wake-word.md — inference pipeline
Upstream: k2-fsa/icefall KWS recipe / pkufool/keyword-spotting-models