Streaming Dictation

Parakeet-EOU-120M is a small RNN-T streaming ASR model with an explicit end-of-utterance (EOU) head, built for real-time dictation on Apple Silicon's Neural Engine. This guide also covers DictateDemo, the macOS menu-bar reference app that wires the streaming model together with Silero VAD for hands-free, paste-anywhere dictation.

What it is

Architecture

Three CoreML models pipelined per audio chunk:

ComponentDescription
EncoderCache-aware Conformer. Takes a 64-frame mel chunk (640 ms) plus six state tensors — attention KV cache, depthwise conv cache, and a pre_cache mel loopback that prepends recent-past audio so the FFT sees continuous signal across chunk boundaries.
DecoderSingle-step LSTM prediction network. Consumes the previous non-blank token, emits an embedding plus updated (h, c) state.
Joint + EOU headFuses encoder and decoder outputs into logits over vocab + blank + EOU. The EOU class is the model's hard signal that an utterance is finished.

Why a separate EOU token

Plain RNNT emits blanks during silence, which the decoder happily absorbs without signaling "utterance finished." A dedicated EOU head lets the model make a hard cut for committing the partial to a final, resetting punctuation/capitalization state, and triggering downstream actions like paste-to-app.

EOU is noisy in the real world

Keyboard clicks, mouse movement, and room tone during a "silent" pause can make the joint occasionally emit a non-blank token, resetting the EOU debounce timer and stalling the commit. Production pipelines pair the joint EOU with an external VAD-driven forceEndOfUtterance() backstop — see DictateDemo below.

Model

ModelSizeHuggingFace
Parakeet-EOU-120M (CoreML INT8)~120 MBaufklarer/Parakeet-EOU-120M-CoreML-INT8

Performance

MetricValue
Weight memory~120 MB (INT8)
Peak inference memory~200 MB
Chunk latency (M-series)~30 ms compute / 640 ms of audio (RTF ~0.056)
Partial latency end-to-end~340 ms (one chunk)
Commit latency (VAD path)~1 s after speech stops
Compute targetNeural Engine (CoreML)

Quick start — batch transcription

The streaming model also conforms to SpeechRecognitionModel, so it works as a drop-in for any code that takes a generic STT model:

import ParakeetStreamingASR

let model = try await ParakeetStreamingASRModel.fromPretrained()
let text = try model.transcribeAudio(audioSamples, sampleRate: 16000)

Quick start — async streaming

for await partial in model.transcribeStream(audio: samples, sampleRate: 16000) {
    if partial.isFinal { print("FINAL: \(partial.text)") }
    else               { print("... \(partial.text)") }
}

Each PartialTranscript carries text, isFinal, confidence, eouDetected (joint fired vs force-finalized), and a monotonic segmentIndex.

Long-lived session API (mic input)

For live dictation, create a session once and feed it chunks as they arrive from the mic. The session buffers internally and runs the encoder when enough samples accumulate, so you can push arbitrary chunk sizes:

let session = try model.createSession()

// each mic chunk:
let partials = try session.pushAudio(float32Chunk16kHz)
for p in partials {
    if p.isFinal { commit(p.text) }
    else          { showPartial(p.text) }
}

// when the stream ends:
let trailing = try session.finalize()

VAD force-finalize pattern

When a Silero VAD is already running in your pipeline, use it to drive a fallback commit so background noise can't stall the EOU debounce timer:

if hasPendingUtterance && !vadSpeechActive && vadSilentChunks >= 30 {
    // ~960 ms of sustained silence per Silero
    if let forced = session.forceEndOfUtterance() {
        commit(forced.text)
    }
    hasPendingUtterance = false
}

// guardrail: don't double-commit if joint already fired EOU
if partials.contains(where: { $0.isFinal }) {
    hasPendingUtterance = false
}

DictateDemo — macOS menu-bar reference app

DictateDemo is a complete macOS menu-bar agent built on top of the streaming session. It runs as a background app, transcribes from the mic with live partials, auto-commits utterances on EOU or VAD silence, and pastes results into the frontmost app.

cd Examples/DictateDemo
swift build
.build/debug/DictateDemo

The full implementation lives in Examples/DictateDemo/DictateDemo/DictateViewModel.swift: an off-main audio sink with a lock-protected buffer, a 300 ms timer tick that drains it, Silero VAD with leftover-sample carry-over, and a guarded force-finalize. The matching regression tests in Examples/DictateDemo/Tests/DictateDemoTests.swift cover multi-utterance, stuck-EOU, and noisy-silence scenarios.

Streaming vs batch Parakeet

Parakeet-EOU-120M (streaming)Parakeet TDT 0.6B (batch)
Use caseLive dictation, real-time captioningFile transcription, offline jobs
DecoderRNN-T + EOU headToken-and-Duration Transducer
Chunk size640 ms streamingWhole-file batch
Weight memory~120 MB500 MB
Throughput~18x real-time~32x real-time
Latency~340 ms partialsEnd-of-file only
Pick the streaming model when…

…you need partials before the user finishes speaking. For batch transcription of audio files, the larger Parakeet TDT 0.6B is faster end-to-end and more accurate. The two models share the same SentencePiece vocabulary, so you can swap between them without changing tokenization.