Nemotron Streaming

Nemotron-Speech-Streaming-0.6B is NVIDIA's low-latency English streaming ASR: a cache-aware FastConformer encoder paired with an RNN-T decoder, with native punctuation and capitalization emitted as regular BPE tokens. The CoreML bundle on this site ships with an INT8-palettized encoder and runs on the Apple Neural Engine.

What it is

Architecture

Three CoreML models pipelined per audio chunk:

ComponentDescription
Encoder24-layer cache-aware FastConformer, 1024 hidden. Takes a 17-frame mel chunk (160 ms default) plus five state tensors — attention KV cache [24, 1, 70, 1024], depthwise conv cache [24, 1, 1024, 8], and a pre_cache mel loopback that prepends recent-past audio so chunk boundaries stay continuous.
DecoderTwo-layer LSTM prediction network, 640 hidden. Consumes the previous non-blank token, emits an embedding plus updated (h, c) state.
JointFuses encoder and decoder outputs into logits over 1024 BPE tokens + blank. Punctuation and capitalization are just more tokens in the BPE vocab — no extra heads.

No EOU head

Unlike Parakeet-EOU, Nemotron does not emit a dedicated end-of-utterance token. Two ways to segment continuous audio into utterances:

  1. External VAD — pair the session with Silero VAD; on sustained silence, call finalize() to commit the current utterance and createSession() for the next one.
  2. Punctuation boundary — when the partial transcript ends in ., ?, or !, treat that as a natural commit cue. No extra model, but depends on the audio actually inducing terminal punctuation.

Model

ComponentSizeHuggingFace
Encoder (INT8)562 MBaufklarer/Nemotron-Speech-Streaming-0.6B-CoreML-INT8
Decoder14 MB
Joint3.3 MB

Upstream: nvidia/nemotron-speech-streaming-en-0.6b (NeMo .nemo checkpoint).

Quick start — batch transcription

Conforms to SpeechRecognitionModel, so it drops into any code path that takes a generic STT model:

import NemotronStreamingASR

let model = try await NemotronStreamingASRModel.fromPretrained()
let text = try model.transcribeAudio(audioSamples, sampleRate: 16000)

Quick start — async streaming

for await partial in model.transcribeStream(audio: samples, sampleRate: 16000) {
    if partial.isFinal { print("FINAL: \(partial.text)") }
    else               { print("... \(partial.text)") }
}

Each PartialTranscript carries text, isFinal (true only for the last partial after finalize()), confidence, and a monotonic segmentIndex.

Long-lived session API (mic input)

let session = try model.createSession()

// each mic chunk:
let partials = try session.pushAudio(float32Chunk16kHz)
for p in partials { showPartial(p.text) }   // isFinal is false mid-stream

// when the utterance ends (VAD silence or explicit stop):
let trailing = try session.finalize()
for p in trailing { commit(p.text) }

CLI

audio transcribe recording.wav --engine nemotron                    # batch
audio transcribe recording.wav --engine nemotron --stream           # streaming final
audio transcribe recording.wav --engine nemotron --stream --partial # with partials

Nemotron vs Parakeet-EOU

Nemotron Streaming 0.6BParakeet-EOU 120M
Parameters600M120M
Encoder24-layer FastConformer, 1024 hidden17-layer FastConformer, 512 hidden
Decoder2-layer LSTM, RNN-T1-layer LSTM, RNN-T
EOU detectionExternal (VAD or punctuation)Built-in <EOU> token
PunctuationNative inline BPE tokensNo (post-process)
LanguagesEnglish only25 European
Default chunk160 ms320 ms
Bundle size~580 MB~150 MB
Pick Nemotron when…

…you want a higher-quality English transcript with punctuation and capitalization out of the box, and you're OK segmenting utterances yourself (VAD or punctuation cue). For constrained-device iOS dictation with a built-in EOU signal, Parakeet-EOU is still the smaller and simpler choice.