Streaming Dictation
Parakeet-EOU-120M is a small RNN-T streaming ASR model with an explicit end-of-utterance (EOU) head, built for real-time dictation on Apple Silicon's Neural Engine. This guide also covers DictateDemo, the macOS menu-bar reference app that wires the streaming model together with Silero VAD for hands-free, paste-anywhere dictation.
What it is
- Live partials — text updates as you speak, ~340 ms after each chunk
- Explicit EOU — model decides when an utterance ends, no manual button
- VAD-driven force-finalize — Silero backstop commits utterances even when EOU stalls on background noise
- 120 MB INT8 CoreML — runs on the Neural Engine, leaves the GPU free for other models
- 25 European languages — same vocabulary family as upstream NeMo Parakeet TDT
Architecture
Three CoreML models pipelined per audio chunk:
| Component | Description |
|---|---|
| Encoder | Cache-aware Conformer. Takes a 64-frame mel chunk (640 ms) plus six state tensors — attention KV cache, depthwise conv cache, and a pre_cache mel loopback that prepends recent-past audio so the FFT sees continuous signal across chunk boundaries. |
| Decoder | Single-step LSTM prediction network. Consumes the previous non-blank token, emits an embedding plus updated (h, c) state. |
| Joint + EOU head | Fuses encoder and decoder outputs into logits over vocab + blank + EOU. The EOU class is the model's hard signal that an utterance is finished. |
Why a separate EOU token
Plain RNNT emits blanks during silence, which the decoder happily absorbs without signaling "utterance finished." A dedicated EOU head lets the model make a hard cut for committing the partial to a final, resetting punctuation/capitalization state, and triggering downstream actions like paste-to-app.
Keyboard clicks, mouse movement, and room tone during a "silent" pause can make the joint occasionally emit a non-blank token, resetting the EOU debounce timer and stalling the commit. Production pipelines pair the joint EOU with an external VAD-driven forceEndOfUtterance() backstop — see DictateDemo below.
Model
| Model | Size | HuggingFace |
|---|---|---|
| Parakeet-EOU-120M (CoreML INT8) | ~120 MB | aufklarer/Parakeet-EOU-120M-CoreML-INT8 |
Performance
| Metric | Value |
|---|---|
| Weight memory | ~120 MB (INT8) |
| Peak inference memory | ~200 MB |
| Chunk latency (M-series) | ~30 ms compute / 640 ms of audio (RTF ~0.056) |
| Partial latency end-to-end | ~340 ms (one chunk) |
| Commit latency (VAD path) | ~1 s after speech stops |
| Compute target | Neural Engine (CoreML) |
Quick start — batch transcription
The streaming model also conforms to SpeechRecognitionModel, so it works as a drop-in for any code that takes a generic STT model:
import ParakeetStreamingASR
let model = try await ParakeetStreamingASRModel.fromPretrained()
let text = try model.transcribeAudio(audioSamples, sampleRate: 16000)
Quick start — async streaming
for await partial in model.transcribeStream(audio: samples, sampleRate: 16000) {
if partial.isFinal { print("FINAL: \(partial.text)") }
else { print("... \(partial.text)") }
}
Each PartialTranscript carries text, isFinal, confidence, eouDetected (joint fired vs force-finalized), and a monotonic segmentIndex.
Long-lived session API (mic input)
For live dictation, create a session once and feed it chunks as they arrive from the mic. The session buffers internally and runs the encoder when enough samples accumulate, so you can push arbitrary chunk sizes:
let session = try model.createSession()
// each mic chunk:
let partials = try session.pushAudio(float32Chunk16kHz)
for p in partials {
if p.isFinal { commit(p.text) }
else { showPartial(p.text) }
}
// when the stream ends:
let trailing = try session.finalize()
VAD force-finalize pattern
When a Silero VAD is already running in your pipeline, use it to drive a fallback commit so background noise can't stall the EOU debounce timer:
if hasPendingUtterance && !vadSpeechActive && vadSilentChunks >= 30 {
// ~960 ms of sustained silence per Silero
if let forced = session.forceEndOfUtterance() {
commit(forced.text)
}
hasPendingUtterance = false
}
// guardrail: don't double-commit if joint already fired EOU
if partials.contains(where: { $0.isFinal }) {
hasPendingUtterance = false
}
DictateDemo — macOS menu-bar reference app
DictateDemo is a complete macOS menu-bar agent built on top of the streaming session. It runs as a background app, transcribes from the mic with live partials, auto-commits utterances on EOU or VAD silence, and pastes results into the frontmost app.
- Menu-bar app with global
Cmd+Shift+Dhotkey - Live partials with floating HUD and audio level indicator
- VAD-guarded force-finalize (the production pattern above)
- Paste-to-frontmost-app with
Cmd+Shift+V - Model auto-downloads on first launch (~120 MB)
cd Examples/DictateDemo
swift build
.build/debug/DictateDemo
The full implementation lives in Examples/DictateDemo/DictateDemo/DictateViewModel.swift: an off-main audio sink with a lock-protected buffer, a 300 ms timer tick that drains it, Silero VAD with leftover-sample carry-over, and a guarded force-finalize. The matching regression tests in Examples/DictateDemo/Tests/DictateDemoTests.swift cover multi-utterance, stuck-EOU, and noisy-silence scenarios.
Streaming vs batch Parakeet
| Parakeet-EOU-120M (streaming) | Parakeet TDT 0.6B (batch) | |
|---|---|---|
| Use case | Live dictation, real-time captioning | File transcription, offline jobs |
| Decoder | RNN-T + EOU head | Token-and-Duration Transducer |
| Chunk size | 640 ms streaming | Whole-file batch |
| Weight memory | ~120 MB | 500 MB |
| Throughput | ~18x real-time | ~32x real-time |
| Latency | ~340 ms partials | End-of-file only |
…you need partials before the user finishes speaking. For batch transcription of audio files, the larger Parakeet TDT 0.6B is faster end-to-end and more accurate. The two models share the same SentencePiece vocabulary, so you can swap between them without changing tokenization.