Magpie-TTS Multilingual
NVIDIA Magpie-TTS Multilingual 357M in Swift on Apple Silicon — an autoregressive multi-codebook text-to-speech model over NeMo's 22.05 kHz Nano-Codec. Nine languages (English, Spanish, German, French, Italian, Vietnamese, Hindi, Mandarin Chinese, Japanese) with five baked speaker identities. Shipped quantized at INT4 (~247 MB) or INT8 (~411 MB). Streaming-ready, ~120 ms first-packet latency.
Magpie is the right pick when you need a single small bundle that speaks nine languages with a consistent voice. The five baked speakers stay identity-stable across languages — useful for multilingual assistants, education apps, or audiobook narration with code-switching. For zero-shot voice cloning use CosyVoice3, Qwen3-TTS Base, or VoxCPM2 instead.
Architecture
Magpie is a 4-bundle MLX pipeline: text encoder → cross-attention decoder → LocalTransformer codebook head → Causal HiFi-GAN audio codec. The bundles share decoder weights between prefill and step entry points to match upstream FluidInference's CoreML layout.
| Stage | Module | Details |
|---|---|---|
| 1. Tokenisation | MagpieTokenizer | Per-language G2P (IPA dict / byT5 bytes / pinyin / katakana), shared 2360-token vocab with per-tokenizer offsets, always-appended EOS |
| 2. Text encoder | MagpieTextEncoder | 6 causal Transformer layers, d=768, k=3 conv FFN |
| 3. Decoder prefill | MagpieDecoder | 12 causal layers with cross-attention. Seeds the 110-frame baked speaker context + BOS into the KV cache. |
| 4. LocalTransformer | MagpieLocalTransformer | 1-layer codebook AR head, d=256. Samples the 8 codebooks per frame sequentially given the decoder hidden. |
| 5. Decoder step | MagpieDecoder | One AR step per frame until EOS or 500-frame cap (~23 s). |
| 6. NanoCodec | MagpieNanoCodec | FSQ inverse → causal HiFi-GAN → 22.05 kHz mono waveform. |
Languages and G2P
All nine languages round-trip through Qwen3-ASR in the SDK's testMultilingualRoundTrip. Each gets a tailored pipeline:
| Language | Code | G2P pipeline |
|---|---|---|
| English | en | CMU IPA dict (125 k entries, bundled) |
| Spanish | es | Spanish IPA dict (bundled) |
| German | de | German IPA dict (bundled) |
| French | fr | byT5 UTF-8 byte encoder |
| Italian | it | byT5 UTF-8 byte encoder |
| Vietnamese | vi | byT5 UTF-8 byte encoder |
| Hindi | hi | Devanagari codepoint lookup + last-wins sub-vocab |
| Mandarin Chinese | zh | NLTokenizer(.simplifiedChinese) word seg + Apple .mandarinToLatin + bundled pinyin → IPA dict + #N tone markers |
| Japanese | ja | CFStringTokenizer kanji reading + NFC-preserved dakuten + heiban pitch markers + particle/greeting overrides |
The shared 2360-entry vocab concatenates each language's sub-tokeniser end-to-end with a per-language offset (recorded in MagpieSubVocab). The text-embedding adds two extra rows past the vocab for BOS / EOS; eos_id = 2361 is appended to every input sequence.
Baked Speakers
The checkpoint embeds five speaker contexts (110 frames × 768 dim each) used as the prefix of every AR decode. Speaker identity is consistent across all nine languages.
| Index | CLI name | Identity |
|---|---|---|
| 0 | sofia | Sofia (default) |
| 1 | aria | Aria |
| 2 | jason | Jason |
| 3 | leo | Leo |
| 4 | john | John Van Stan |
Model Variants
| Variant | Disk | RAM (load + decode) | HuggingFace |
|---|---|---|---|
| INT4 (default) | ~247 MB | ~1.3 GB | aufklarer/Magpie-TTS-Multilingual-357M-MLX-4bit |
| INT8 | ~411 MB | ~1.6 GB | aufklarer/Magpie-TTS-Multilingual-357M-MLX-8bit |
Both bundles use MLX flat affine quantisation (mlx_affine_flat, group size 64) and dequantise to FP32 at load time — runtime activations are full precision. INT4 is audibly indistinguishable from INT8 for this model; pick INT4 unless you have storage to spare.
CLI Usage
# English, greedy decoding
speech speak "Hello, world." --engine magpie --magpie-speaker aria \
--magpie-temperature 0 -o out.wav
# Spanish (any of the 9 languages — pick with --language)
speech speak "Hola, mundo." --engine magpie --language es \
--magpie-speaker aria -o out.wav
# Japanese — needs stochastic decoding (greedy gets stuck on first phrase)
speech speak "こんにちは世界、これは音声合成システムです。" \
--engine magpie --language ja --magpie-temperature 0.6 \
--magpie-top-k 80 --seed 42 -o out.wav
# Streaming synthesis with playback
speech speak "Streaming test" --engine magpie --stream --play
# List the 5 baked speakers
speech speak --engine magpie --list-speakers
# Pre-phonemised IPA bypasses the per-language G2P
speech speak "həˈloʊ" --engine magpie --magpie-prephonemized -o out.wav
Options
| Option | Default | Description |
|---|---|---|
--magpie-variant | int4 | Quantisation variant: int4 or int8 |
--magpie-speaker | sofia | Baked speaker: sofia, aria, jason, leo, john |
--magpie-temperature | 0.6 | Sampling temperature (0 = greedy) |
--magpie-top-k | 80 | Top-k filter for sampling |
--magpie-max-frames | 500 | Hard cap on codec frames (~23 s) |
--magpie-min-frames | 4 | Minimum frames before EOS allowed |
--magpie-prephonemized | off | Treat input as IPA / phoneme stream; skip per-language G2P |
--language | english | Picks the per-language tokeniser pipeline |
--stream | off | Emit AsyncStream<AudioChunk> instead of a single WAV |
--seed | — | Reproducible Gumbel sampling |
Japanese inputs longer than a single word need stochastic decoding (--magpie-temperature 0.6 --magpie-top-k 80 --seed 42 mirrors NeMo's reference test). Greedy gets stuck on the first phrase because the heiban pitch-accent heuristic deviates from per-word truth.
Voice cloning — not supported
Magpie has no zero-shot speaker conditioning in the model; only the 5 baked identities ship in the bundle. The CLI rejects the shared --voice-sample, --speaker, and --instruct flags with an actionable error pointing users at the --magpie-speaker flag or the engines that do support cloning (Qwen3-TTS Base, CosyVoice3, VoxCPM2).
Performance (M4 Pro)
| Setting | Audio | Wall | RTF |
|---|---|---|---|
| Batch, INT4, greedy, short prompt | 2.8 s | 0.88 s | 0.32 |
| Batch, INT4, greedy, sentence | 5.8 s | 1.35 s | 0.23 |
| Batch, INT4, sampled, 23 s output | 23 s | 5.6 s | 0.24 |
| Streaming, INT4, sampled | 23 s | 21.6 s | 0.93 |
First-packet latency in streaming mode is ≈120 ms after model load. Streaming RTF is higher because the codec is re-invoked on the full code buffer at every chunk emission (a future revision can cache codec state).
Swift API
import MagpieTTS
let model = try await MagpieTTS.fromPretrained(variant: .int4)
// Batch synthesis (en/es/de/fr/it/vi/hi/zh — greedy works)
let audio = try model.synthesize(
text: "Hello, world.",
speaker: .aria,
language: .english,
params: MagpieTTSParams(temperature: 0, topK: 1, maxSteps: 500))
// Japanese — use stochastic sampling
let audioJA = try model.synthesize(
text: "こんにちは世界、これは音声合成システムです。",
speaker: .aria,
language: .japanese,
params: MagpieTTSParams(temperature: 0.6, topK: 80,
maxSteps: 300, seed: 42))
// Streaming (AsyncStream<AudioChunk>)
let stream = model.synthesizeStream(
text: "Streaming text",
speaker: .aria,
language: .english,
firstChunkFrames: 8,
framesPerChunk: 25)
for try await chunk in stream {
// chunk.samples is 22.05 kHz mono Float32
}
CoreML backend (--engine magpie-coreml)
Alongside the MLX bundle, Magpie ships a CoreML bundle (aufklarer/Magpie-TTS-Multilingual-357M-CoreML-8bit, ~342 MB INT8). Four .mlmodelc packages — text_encoder, decoder_prefill, decoder_step, nanocodec_decoder — run on ANE / GPU; a Swift-side FSQ inverse turns the sampled codes into the 32-dim latents the codec consumes.
# 8 languages (no Japanese), 5 baked speakers
speech speak "Hello world." --engine magpie-coreml --magpie-speaker aria -o hi.wav
speech speak "Hola mundo." --engine magpie-coreml --language es --magpie-speaker leo -o es.wav
# --language ja auto-routes to the MLX backend (stderr note)
speech speak "こんにちは" --engine magpie-coreml --language ja -o ja.wav
Caveats vs --engine magpie:
- Hybrid pipeline today. The 1-layer LocalTransformer (the actual codebook sampling head NeMo trains) and the 8 audio embedding tables aren't shipped inside the CoreML bundle. On first synthesis the CoreML engine lazy-loads the MLX INT4 bundle to drive both pieces. ASR round-trip is bit-for-bit identical to the MLX backend; the difference is that this engine pulls in the MLX bundle too. A pure CoreML path for ANE-only iOS deployment needs the bundle to ship
local_transformer/*.npy+audio_embedding_*.npyand a Swift Accelerate LT (tracked follow-up). - No streaming.
nanocodec_decoder.mlmodelcis traced at a fixed 64-frame window. We chunk longer sequences internally, but first-packet latency would be ~3 s if we emitted at chunk boundaries.--streamis rejected with an actionable error. - No Japanese tokenizer. The CoreML bundle doesn't ship JA tokenizer JSONs yet.
--language jawith this engine auto-falls-back to the MLX backend.
Speaker ordering matches the CoreML bundle's speaker_info.json (0=John, 1=Sofia, 2=Aria, 3=Jason, 4=Leo — different from MLX), and the speaker enum maps internally so the CLI names work for both engines.
Implementation notes
Three bugs worth knowing about if you're porting NeMo-style multilingual TTS:
- FSQ floor division — MLX-swift's
/is true division (mlx_divide); NeMo's FSQ inverse uses Python//. UseMLX.floorDivide(...)or every FSQ slot decodes to fractional offsets and the codec smears the audio. - Sub-vocab offsets — NeMo's
AggregatedTTSTokenizerconcatenates per-language vocabs with offsets. A naive global first-occurrence map always lands in the English region and produces nonsense audio for other languages. - Hindi last-wins dedup —
HindiCharsTokenizeremits duplicate Devanagari entries (CHARSET overlaps PUNCT_LIST). Python's{l: i for i, l in enumerate(tokens)}dict-comp keeps the last assignment; mirror that, not first-occurrence.
All three fixes are documented inline in the Swift module.
Source
- Upstream weights: nvidia/magpie_tts_multilingual_357m (NVIDIA Open Model License)
- Codec: nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps
- Paper: NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference (2025)
- Reference CoreML port: FluidInference/mobius
- Swift modules: MagpieTTS (MLX) + MagpieTTSCoreML (CoreML)
- CoreML bundle: aufklarer/Magpie-TTS-Multilingual-357M-CoreML-8bit
License
- Model weights: NVIDIA Open Model License (commercial use permitted; see linked PDF on the HuggingFace page)
- Swift port + bundled IPA / pinyin dictionaries: same as upstream NeMo (Apache 2.0 for the dictionaries, NVIDIA OML for the model)