Nemotron Streaming
Two NVIDIA streaming ASR models share the NemotronStreamingASR Swift target. Both are 600M-parameter cache-aware FastConformer encoders paired with an RNN-T decoder, both emit native punctuation and capitalization as regular BPE tokens, both run on the Apple Neural Engine via CoreML, and the multilingual variant additionally has MLX bundles for GPU-resident inference. Pick the one that matches your application:
| Variant | Coverage | Default chunk | Upstream |
|---|---|---|---|
| Nemotron 3.5 Multilingual | 40 language-locales (en-US, en-GB, es-ES, fr-FR, de-DE, ja-JP, ko-KR, zh-CN, hi-IN, ar, …) | 320 ms | nvidia/nemotron-3.5-asr-streaming-0.6b |
| Nemotron Speech Streaming (English) | English only — smaller bundle, lower default latency | 160 ms | nvidia/nemotron-speech-streaming-en-0.6b |
Shared properties
- Native punctuation and capitalization — no post-processor needed; periods, commas, and case are part of the vocabulary
- 600M parameters — cache-aware 24-layer FastConformer encoder + 2-layer LSTM RNN-T decoder + joint network
- Cache-aware streaming — attention and conv caches flow chunk-to-chunk for continuous context across chunk boundaries
- Multiple chunk sizes — multilingual: 80, 320, 560, 1120 ms (320 ms default); English: 80, 160, 560, 1120 ms (160 ms default)
- Multilingual-only: a prompt kernel (Linear 1152→2048→1024) folds a one-hot language slot into every encoded frame, so the same weights serve all 40 language-locales. The English-only variant has no prompt kernel.
Architecture
Three CoreML models pipelined per audio chunk:
| Component | Description |
|---|---|
| Encoder | 24-layer cache-aware FastConformer, 1024 hidden. Takes a 32-frame mel chunk (320 ms default) plus six state tensors — attention KV cache [24, 1, 56, 1024], depthwise conv cache [24, 1, 1024, 8], pre_cache mel loopback, and a language_mask 128-slot one-hot that drives the prompt kernel. |
| Prompt kernel | Linear(1152→2048) → ReLU → Linear(2048→1024) — folds the language one-hot into every encoded frame so the same 600M weights serve all 40 language-locales. |
| Decoder | Two-layer LSTM prediction network, 640 hidden. Consumes the previous non-blank token, emits an embedding plus updated (h, c) state. |
| Joint | Fuses encoder and decoder outputs into logits over 13 087 BPE tokens + blank. Punctuation, capitalization, and per-language tags are just more tokens in the BPE vocab — no extra heads. |
No EOU head
Unlike Parakeet-EOU, Nemotron does not emit a dedicated end-of-utterance token. Two ways to segment continuous audio into utterances:
- External VAD — pair the session with Silero VAD; on sustained silence, call
finalize()to commit the current utterance andcreateSession()for the next one. - Punctuation boundary — when the partial transcript ends in
.,?, or!, treat that as a natural commit cue. No extra model, but depends on the audio actually inducing terminal punctuation.
Bundles
Four published variants of Nemotron-3.5-ASR-Streaming-0.6B, plus the older English-only model on the same Swift target:
| Variant | On-disk | Streaming peak (M5 Pro) | HuggingFace |
|---|---|---|---|
| CoreML INT8 (default) | 612 MB | 1238 MB | aufklarer/Nemotron-3.5-ASR-Streaming-0.6B-CoreML-INT8 |
| MLX bf16 | 1217 MB | 1474 MB | aufklarer/…MLX-bf16 |
| MLX 8-bit | 732 MB | 997 MB | aufklarer/…MLX-8bit |
| MLX 4-bit | 473 MB | 747 MB | aufklarer/…MLX-4bit |
| English-only (CoreML INT8) | ~580 MB | — | aufklarer/Nemotron-Speech-Streaming-0.6B-CoreML-INT8 |
Upstream: nvidia/nemotron-3.5-asr-streaming-0.6b (multilingual) and nvidia/nemotron-speech-streaming-en-0.6b (English-only).
Quantization is essentially lossless: CoreML INT8, MLX bf16, and MLX 8-bit are within ±0.3 pp WER of the fp32 NeMo source. MLX 4-bit costs ~6 pp avg WER for the smallest disk and streaming RSS.
Quick start — batch transcription
Conforms to SpeechRecognitionModel, so it drops into any code path that takes a generic STT model:
import NemotronStreamingASR
let model = try await NemotronStreamingASRModel.fromPretrained()
let text = try model.transcribeAudio(audioSamples, sampleRate: 16000, language: "en-US")
Quick start — async streaming
for await partial in model.transcribeStream(audio: samples, sampleRate: 16000, language: "ja-JP") {
if partial.isFinal { print("FINAL: \(partial.text)") }
else { print("... \(partial.text)") }
}
Each PartialTranscript carries text, isFinal (true only for the last partial after finalize()), confidence, and a monotonic segmentIndex.
Long-lived session API (mic input)
let session = try model.createSession(language: "en-US")
// each mic chunk:
let partials = try session.pushAudio(float32Chunk16kHz)
for p in partials { showPartial(p.text) } // isFinal is false mid-stream
// when the utterance ends (VAD silence or explicit stop):
let trailing = try session.finalize()
for p in trailing { commit(p.text) }
CLI
speech transcribe recording.wav --engine nemotron --language en-US # batch
speech transcribe recording.wav --engine nemotron --language en-US --stream # streaming final
speech transcribe recording.wav --engine nemotron --language ja-JP --stream --partial # partials, Japanese
speech transcribe meeting.wav --engine nemotron --language de-DE # any of the 40 locales
Nemotron vs Parakeet-EOU
| Nemotron Streaming 0.6B | Parakeet-EOU 120M | |
|---|---|---|
| Parameters | 600M | 120M |
| Encoder | 24-layer FastConformer, 1024 hidden | 17-layer FastConformer, 512 hidden |
| Decoder | 2-layer LSTM, RNN-T | 1-layer LSTM, RNN-T |
| EOU detection | External (VAD or punctuation) | Built-in <EOU> token |
| Punctuation | Native inline BPE tokens | No (post-process) |
| Languages | 40 language-locales (en-US, en-GB, es-ES, fr-FR, de-DE, ja-JP, ko-KR, zh-CN, hi-IN, ar, …) | 25 European |
| Default chunk | 320 ms | 320 ms |
| Bundle size | 612 MB (CoreML INT8); 473 MB (MLX 4-bit) | ~150 MB |
…you need multilingual streaming transcription (any of 40 language-locales) with punctuation and capitalization out of the box, and you're OK segmenting utterances yourself (VAD or punctuation cue). For constrained-device iOS dictation that's English-only with a built-in EOU signal, Parakeet-EOU is still the smaller choice.