Nemotron Streaming
Nemotron-Speech-Streaming-0.6B is NVIDIA's low-latency English streaming ASR: a cache-aware FastConformer encoder paired with an RNN-T decoder, with native punctuation and capitalization emitted as regular BPE tokens. The CoreML bundle on this site ships with an INT8-palettized encoder and runs on the Apple Neural Engine.
What it is
- Native punctuation and capitalization — no post-processor needed; periods, commas, and case are part of the vocabulary
- 600M parameters — larger than Parakeet-EOU (120M), so transcription quality is materially higher on challenging audio
- Cache-aware FastConformer — 24 encoder layers with attention + conv caches flowing chunk-to-chunk for continuous context
- Four chunk sizes — 80, 160, 560, 1120 ms per inference step (160 ms is the default and the variant published today)
- English only — trained on English speech; for multilingual use see Qwen3-ASR or Omnilingual ASR
Architecture
Three CoreML models pipelined per audio chunk:
| Component | Description |
|---|---|
| Encoder | 24-layer cache-aware FastConformer, 1024 hidden. Takes a 17-frame mel chunk (160 ms default) plus five state tensors — attention KV cache [24, 1, 70, 1024], depthwise conv cache [24, 1, 1024, 8], and a pre_cache mel loopback that prepends recent-past audio so chunk boundaries stay continuous. |
| Decoder | Two-layer LSTM prediction network, 640 hidden. Consumes the previous non-blank token, emits an embedding plus updated (h, c) state. |
| Joint | Fuses encoder and decoder outputs into logits over 1024 BPE tokens + blank. Punctuation and capitalization are just more tokens in the BPE vocab — no extra heads. |
No EOU head
Unlike Parakeet-EOU, Nemotron does not emit a dedicated end-of-utterance token. Two ways to segment continuous audio into utterances:
- External VAD — pair the session with Silero VAD; on sustained silence, call
finalize()to commit the current utterance andcreateSession()for the next one. - Punctuation boundary — when the partial transcript ends in
.,?, or!, treat that as a natural commit cue. No extra model, but depends on the audio actually inducing terminal punctuation.
Model
| Component | Size | HuggingFace |
|---|---|---|
| Encoder (INT8) | 562 MB | aufklarer/Nemotron-Speech-Streaming-0.6B-CoreML-INT8 |
| Decoder | 14 MB | |
| Joint | 3.3 MB |
Upstream: nvidia/nemotron-speech-streaming-en-0.6b (NeMo .nemo checkpoint).
Quick start — batch transcription
Conforms to SpeechRecognitionModel, so it drops into any code path that takes a generic STT model:
import NemotronStreamingASR
let model = try await NemotronStreamingASRModel.fromPretrained()
let text = try model.transcribeAudio(audioSamples, sampleRate: 16000)
Quick start — async streaming
for await partial in model.transcribeStream(audio: samples, sampleRate: 16000) {
if partial.isFinal { print("FINAL: \(partial.text)") }
else { print("... \(partial.text)") }
}
Each PartialTranscript carries text, isFinal (true only for the last partial after finalize()), confidence, and a monotonic segmentIndex.
Long-lived session API (mic input)
let session = try model.createSession()
// each mic chunk:
let partials = try session.pushAudio(float32Chunk16kHz)
for p in partials { showPartial(p.text) } // isFinal is false mid-stream
// when the utterance ends (VAD silence or explicit stop):
let trailing = try session.finalize()
for p in trailing { commit(p.text) }
CLI
audio transcribe recording.wav --engine nemotron # batch
audio transcribe recording.wav --engine nemotron --stream # streaming final
audio transcribe recording.wav --engine nemotron --stream --partial # with partials
Nemotron vs Parakeet-EOU
| Nemotron Streaming 0.6B | Parakeet-EOU 120M | |
|---|---|---|
| Parameters | 600M | 120M |
| Encoder | 24-layer FastConformer, 1024 hidden | 17-layer FastConformer, 512 hidden |
| Decoder | 2-layer LSTM, RNN-T | 1-layer LSTM, RNN-T |
| EOU detection | External (VAD or punctuation) | Built-in <EOU> token |
| Punctuation | Native inline BPE tokens | No (post-process) |
| Languages | English only | 25 European |
| Default chunk | 160 ms | 320 ms |
| Bundle size | ~580 MB | ~150 MB |
…you want a higher-quality English transcript with punctuation and capitalization out of the box, and you're OK segmenting utterances yourself (VAD or punctuation cue). For constrained-device iOS dictation with a built-in EOU signal, Parakeet-EOU is still the smaller and simpler choice.