Nemotron Streaming

Two NVIDIA streaming ASR models share the NemotronStreamingASR Swift target. Both are 600M-parameter cache-aware FastConformer encoders paired with an RNN-T decoder, both emit native punctuation and capitalization as regular BPE tokens, both run on the Apple Neural Engine via CoreML, and the multilingual variant additionally has MLX bundles for GPU-resident inference. Pick the one that matches your application:

Variant	Coverage	Default chunk	Upstream
Nemotron 3.5 Multilingual	40 language-locales (en-US, en-GB, es-ES, fr-FR, de-DE, ja-JP, ko-KR, zh-CN, hi-IN, ar, …)	320 ms	nvidia/nemotron-3.5-asr-streaming-0.6b
Nemotron Speech Streaming (English)	English only — smaller bundle, lower default latency	160 ms	nvidia/nemotron-speech-streaming-en-0.6b

Shared properties

Native punctuation and capitalization — no post-processor needed; periods, commas, and case are part of the vocabulary
600M parameters — cache-aware 24-layer FastConformer encoder + 2-layer LSTM RNN-T decoder + joint network
Cache-aware streaming — attention and conv caches flow chunk-to-chunk for continuous context across chunk boundaries
Multiple chunk sizes — multilingual: 80, 320, 560, 1120 ms (320 ms default); English: 80, 160, 560, 1120 ms (160 ms default)
Multilingual-only: a prompt kernel (Linear 1152→2048→1024) folds a one-hot language slot into every encoded frame, so the same weights serve all 40 language-locales. The English-only variant has no prompt kernel.

Architecture

Three CoreML models pipelined per audio chunk:

Component	Description
Encoder	24-layer cache-aware FastConformer, 1024 hidden. Takes a 32-frame mel chunk (320 ms default) plus six state tensors — attention KV cache [24, 1, 56, 1024], depthwise conv cache [24, 1, 1024, 8], `pre_cache` mel loopback, and a `language_mask` 128-slot one-hot that drives the prompt kernel.
Prompt kernel	Linear(1152→2048) → ReLU → Linear(2048→1024) — folds the language one-hot into every encoded frame so the same 600M weights serve all 40 language-locales.
Decoder	Two-layer LSTM prediction network, 640 hidden. Consumes the previous non-blank token, emits an embedding plus updated `(h, c)` state.
Joint	Fuses encoder and decoder outputs into logits over 13 087 BPE tokens + blank. Punctuation, capitalization, and per-language tags are just more tokens in the BPE vocab — no extra heads.

No EOU head

Unlike Parakeet-EOU, Nemotron does not emit a dedicated end-of-utterance token. Two ways to segment continuous audio into utterances:

External VAD — pair the session with Silero VAD; on sustained silence, call finalize() to commit the current utterance and createSession() for the next one.
Punctuation boundary — when the partial transcript ends in ., ?, or !, treat that as a natural commit cue. No extra model, but depends on the audio actually inducing terminal punctuation.

Bundles

Four published variants of Nemotron-3.5-ASR-Streaming-0.6B, plus the older English-only model on the same Swift target:

Variant	On-disk	Streaming peak (M5 Pro)	HuggingFace
CoreML INT8 (default)	612 MB	1238 MB	aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-CoreML-INT8
MLX bf16	1217 MB	1474 MB	aufklarer/…MLX-bf16
MLX 8-bit	732 MB	997 MB	aufklarer/…MLX-8bit
MLX 4-bit	473 MB	747 MB	aufklarer/…MLX-4bit
English-only (CoreML INT8)	~580 MB	—	aufklarer/Nemotron-Speech-Streaming-0.6B-CoreML-INT8

Upstream: nvidia/nemotron-3.5-asr-streaming-0.6b (multilingual) and nvidia/nemotron-speech-streaming-en-0.6b (English-only).

Quantization is essentially lossless: CoreML INT8, MLX bf16, and MLX 8-bit are within ±0.3 pp WER of the fp32 NeMo source. MLX 4-bit costs ~6 pp avg WER for the smallest disk and streaming RSS.

Quick start — batch transcription

Conforms to SpeechRecognitionModel, so it drops into any code path that takes a generic STT model:

import NemotronStreamingASR

let model = try await NemotronStreamingASRModel.fromPretrained()
let text = try model.transcribeAudio(audioSamples, sampleRate: 16000, language: "en-US")

Quick start — async streaming

for await partial in model.transcribeStream(audio: samples, sampleRate: 16000) {
    if partial.isFinal { print("FINAL: \(partial.text)") }
    else               { print("... \(partial.text)") }
}

Each PartialTranscript carries text, isFinal (true only for the last partial after finalize()), confidence, and a monotonic segmentIndex.

Long-lived session API (mic input)

let session = try model.createSession()

// each mic chunk:
let partials = try session.pushAudio(float32Chunk16kHz)
for p in partials { showPartial(p.text) }   // isFinal is false mid-stream

// when the utterance ends (VAD silence or explicit stop):
let trailing = try session.finalize()
for p in trailing { commit(p.text) }

CLI

speech transcribe recording.wav --engine nemotron --language en-US                    # batch
speech transcribe recording.wav --engine nemotron --language en-US --stream           # streaming final
speech transcribe recording.wav --engine nemotron --language ja-JP --stream --partial # partials, Japanese
speech transcribe meeting.wav   --engine nemotron --language de-DE                    # any of the 40 locales

Nemotron vs Parakeet-EOU

	Nemotron Streaming 0.6B	Parakeet-EOU 120M
Parameters	600M	120M
Encoder	24-layer FastConformer, 1024 hidden	17-layer FastConformer, 512 hidden
Decoder	2-layer LSTM, RNN-T	1-layer LSTM, RNN-T
EOU detection	External (VAD or punctuation)	Built-in `<EOU>` token
Punctuation	Native inline BPE tokens	No (post-process)
Languages	40 language-locales (en-US, en-GB, es-ES, fr-FR, de-DE, ja-JP, ko-KR, zh-CN, hi-IN, ar, …)	25 European
Default chunk	320 ms	320 ms
Bundle size	612 MB (CoreML INT8); 473 MB (MLX 4-bit)	~150 MB

Pick Nemotron when…

…you need multilingual streaming transcription (any of 40 language-locales) with punctuation and capitalization out of the box, and you're OK segmenting utterances yourself (VAD or punctuation cue). For constrained-device iOS dictation that's English-only with a built-in EOU signal, Parakeet-EOU is still the smaller choice.