Streaming Dictation

Parakeet-EOU-120M is a small RNN-T streaming ASR model with an explicit end-of-utterance (EOU) head, built for real-time dictation on Apple Silicon's Neural Engine. This guide also covers DictateDemo, the macOS menu-bar reference app that wires the streaming model together with Silero VAD for hands-free, paste-anywhere dictation.

What it is

Live partials — text updates as you speak, ~340 ms after each chunk
Explicit EOU — model decides when an utterance ends, no manual button
VAD-driven force-finalize — Silero backstop commits utterances even when EOU stalls on background noise
120 MB INT8 CoreML — runs on the Neural Engine, leaves the GPU free for other models
25 European languages — same vocabulary family as upstream NeMo Parakeet TDT

Architecture

Three CoreML models pipelined per audio chunk:

Component	Description
Encoder	Cache-aware Conformer. Takes a 64-frame mel chunk (640 ms) plus six state tensors — attention KV cache, depthwise conv cache, and a `pre_cache` mel loopback that prepends recent-past audio so the FFT sees continuous signal across chunk boundaries.
Decoder	Single-step LSTM prediction network. Consumes the previous non-blank token, emits an embedding plus updated `(h, c)` state.
Joint + EOU head	Fuses encoder and decoder outputs into logits over `vocab + blank + EOU`. The EOU class is the model's hard signal that an utterance is finished.

Why a separate EOU token

Plain RNNT emits blanks during silence, which the decoder happily absorbs without signaling "utterance finished." A dedicated EOU head lets the model make a hard cut for committing the partial to a final, resetting punctuation/capitalization state, and triggering downstream actions like paste-to-app.

EOU is noisy in the real world

Keyboard clicks, mouse movement, and room tone during a "silent" pause can make the joint occasionally emit a non-blank token, resetting the EOU debounce timer and stalling the commit. Production pipelines pair the joint EOU with an external VAD-driven forceEndOfUtterance() backstop — see DictateDemo below.

Model

Model	Size	HuggingFace
Parakeet-EOU-120M (CoreML INT8)	~120 MB	aufklarer/Parakeet-EOU-120M-CoreML-INT8

Performance

Metric	Value
Weight memory	~120 MB (INT8)
Peak inference memory	~200 MB
Chunk latency (M-series)	~30 ms compute / 640 ms of audio (RTF ~0.056)
Partial latency end-to-end	~340 ms (one chunk)
Commit latency (VAD path)	~1 s after speech stops
Compute target	Neural Engine (CoreML)

Quick start — batch transcription

The streaming model also conforms to SpeechRecognitionModel, so it works as a drop-in for any code that takes a generic STT model:

import ParakeetStreamingASR

let model = try await ParakeetStreamingASRModel.fromPretrained()
let text = try model.transcribeAudio(audioSamples, sampleRate: 16000)

Quick start — async streaming

for await partial in model.transcribeStream(audio: samples, sampleRate: 16000) {
    if partial.isFinal { print("FINAL: \(partial.text)") }
    else               { print("... \(partial.text)") }
}

Each PartialTranscript carries text, isFinal, confidence, eouDetected (joint fired vs force-finalized), and a monotonic segmentIndex.

Long-lived session API (mic input)

For live dictation, create a session once and feed it chunks as they arrive from the mic. The session buffers internally and runs the encoder when enough samples accumulate, so you can push arbitrary chunk sizes:

let session = try model.createSession()

// each mic chunk:
let partials = try session.pushAudio(float32Chunk16kHz)
for p in partials {
    if p.isFinal { commit(p.text) }
    else          { showPartial(p.text) }
}

// when the stream ends:
let trailing = try session.finalize()

VAD force-finalize pattern

When a Silero VAD is already running in your pipeline, use it to drive a fallback commit so background noise can't stall the EOU debounce timer:

if hasPendingUtterance && !vadSpeechActive && vadSilentChunks >= 30 {
    // ~960 ms of sustained silence per Silero
    if let forced = session.forceEndOfUtterance() {
        commit(forced.text)
    }
    hasPendingUtterance = false
}

// guardrail: don't double-commit if joint already fired EOU
if partials.contains(where: { $0.isFinal }) {
    hasPendingUtterance = false
}

DictateDemo — macOS menu-bar reference app

DictateDemo is a complete macOS menu-bar agent built on top of the streaming session. It runs as a background app, transcribes from the mic with live partials, auto-commits utterances on EOU or VAD silence, and pastes results into the frontmost app.

Menu-bar app with global Cmd+Shift+D hotkey
Live partials with floating HUD and audio level indicator
VAD-guarded force-finalize (the production pattern above)
Paste-to-frontmost-app with Cmd+Shift+V
Model auto-downloads on first launch (~120 MB)

cd Examples/DictateDemo
swift build
.build/debug/DictateDemo

The full implementation lives in Examples/DictateDemo/DictateDemo/DictateViewModel.swift: an off-main audio sink with a lock-protected buffer, a 300 ms timer tick that drains it, Silero VAD with leftover-sample carry-over, and a guarded force-finalize. The matching regression tests in Examples/DictateDemo/Tests/DictateDemoTests.swift cover multi-utterance, stuck-EOU, and noisy-silence scenarios.

Streaming vs batch Parakeet

	Parakeet-EOU-120M (streaming)	Parakeet TDT 0.6B (batch)
Use case	Live dictation, real-time captioning	File transcription, offline jobs
Decoder	RNN-T + EOU head	Token-and-Duration Transducer
Chunk size	640 ms streaming	Whole-file batch
Weight memory	~120 MB	500 MB
Throughput	~18x real-time	~32x real-time
Latency	~340 ms partials	End-of-file only

Pick the streaming model when…

…you need partials before the user finishes speaking. For batch transcription of audio files, the larger Parakeet TDT 0.6B is faster end-to-end and more accurate. The two models share the same SentencePiece vocabulary, so you can swap between them without changing tokenization.