Magpie-TTS Multilingual

NVIDIA Magpie-TTS Multilingual 357M in Swift on Apple Silicon — an autoregressive multi-codebook text-to-speech model over NeMo's 22.05 kHz Nano-Codec. Nine languages (English, Spanish, German, French, Italian, Vietnamese, Hindi, Mandarin Chinese, Japanese) with five baked speaker identities. Shipped quantized at INT4 (~247 MB) or INT8 (~411 MB). Streaming-ready, ~120 ms first-packet latency.

When to reach for Magpie

Magpie is the right pick when you need a single small bundle that speaks nine languages with a consistent voice. The five baked speakers stay identity-stable across languages — useful for multilingual assistants, education apps, or audiobook narration with code-switching. For zero-shot voice cloning use CosyVoice3, Qwen3-TTS Base, or VoxCPM2 instead.

Architecture

Magpie is a 4-bundle MLX pipeline: text encoder → cross-attention decoder → LocalTransformer codebook head → Causal HiFi-GAN audio codec. The bundles share decoder weights between prefill and step entry points to match upstream FluidInference's CoreML layout.

Stage	Module	Details
1. Tokenisation	MagpieTokenizer	Per-language G2P (IPA dict / byT5 bytes / pinyin / katakana), shared 2360-token vocab with per-tokenizer offsets, always-appended EOS
2. Text encoder	MagpieTextEncoder	6 causal Transformer layers, d=768, k=3 conv FFN
3. Decoder prefill	MagpieDecoder	12 causal layers with cross-attention. Seeds the 110-frame baked speaker context + BOS into the KV cache.
4. LocalTransformer	MagpieLocalTransformer	1-layer codebook AR head, d=256. Samples the 8 codebooks per frame sequentially given the decoder hidden.
5. Decoder step	MagpieDecoder	One AR step per frame until EOS or 500-frame cap (~23 s).
6. NanoCodec	MagpieNanoCodec	FSQ inverse → causal HiFi-GAN → 22.05 kHz mono waveform.

Languages and G2P

All nine languages round-trip through Qwen3-ASR in the SDK's testMultilingualRoundTrip. Each gets a tailored pipeline:

Language	Code	G2P pipeline
English	`en`	CMU IPA dict (125 k entries, bundled)
Spanish	`es`	Spanish IPA dict (bundled)
German	`de`	German IPA dict (bundled)
French	`fr`	byT5 UTF-8 byte encoder
Italian	`it`	byT5 UTF-8 byte encoder
Vietnamese	`vi`	byT5 UTF-8 byte encoder
Hindi	`hi`	Devanagari codepoint lookup + last-wins sub-vocab
Mandarin Chinese	`zh`	`NLTokenizer(.simplifiedChinese)` word seg + Apple `.mandarinToLatin` + bundled pinyin → IPA dict + `#N` tone markers
Japanese	`ja`	`CFStringTokenizer` kanji reading + NFC-preserved dakuten + heiban pitch markers + particle/greeting overrides

The shared 2360-entry vocab concatenates each language's sub-tokeniser end-to-end with a per-language offset (recorded in MagpieSubVocab). The text-embedding adds two extra rows past the vocab for BOS / EOS; eos_id = 2361 is appended to every input sequence.

Baked Speakers

The checkpoint embeds five speaker contexts (110 frames × 768 dim each) used as the prefix of every AR decode. Speaker identity is consistent across all nine languages.

Index	CLI name	Identity
0	`sofia`	Sofia (default)
1	`aria`	Aria
2	`jason`	Jason
3	`leo`	Leo
4	`john`	John Van Stan

Model Variants

Variant	Disk	RAM (load + decode)	HuggingFace
INT4 (default)	~247 MB	~1.3 GB	aufklarer/Magpie-TTS-Multilingual-357M-MLX-4bit
INT8	~411 MB	~1.6 GB	aufklarer/Magpie-TTS-Multilingual-357M-MLX-8bit

Both bundles use MLX flat affine quantisation (mlx_affine_flat, group size 64) and dequantise to FP32 at load time — runtime activations are full precision. INT4 is audibly indistinguishable from INT8 for this model; pick INT4 unless you have storage to spare.

CLI Usage

# English, greedy decoding
speech speak "Hello, world." --engine magpie --magpie-speaker aria \
    --magpie-temperature 0 -o out.wav

# Spanish (any of the 9 languages — pick with --language)
speech speak "Hola, mundo." --engine magpie --language es \
    --magpie-speaker aria -o out.wav

# Japanese — needs stochastic decoding (greedy gets stuck on first phrase)
speech speak "こんにちは世界、これは音声合成システムです。" \
    --engine magpie --language ja --magpie-temperature 0.6 \
    --magpie-top-k 80 --seed 42 -o out.wav

# Streaming synthesis with playback
speech speak "Streaming test" --engine magpie --stream --play

# List the 5 baked speakers
speech speak --engine magpie --list-speakers

# Pre-phonemised IPA bypasses the per-language G2P
speech speak "həˈloʊ" --engine magpie --magpie-prephonemized -o out.wav

Options

Option	Default	Description
`--magpie-variant`	`int4`	Quantisation variant: `int4` or `int8`
`--magpie-speaker`	`sofia`	Baked speaker: `sofia`, `aria`, `jason`, `leo`, `john`
`--magpie-temperature`	`0.6`	Sampling temperature (0 = greedy)
`--magpie-top-k`	`80`	Top-k filter for sampling
`--magpie-max-frames`	`500`	Hard cap on codec frames (~23 s)
`--magpie-min-frames`	`4`	Minimum frames before EOS allowed
`--magpie-prephonemized`	off	Treat input as IPA / phoneme stream; skip per-language G2P
`--language`	`english`	Picks the per-language tokeniser pipeline
`--stream`	off	Emit `AsyncStream<AudioChunk>` instead of a single WAV
`--seed`	—	Reproducible Gumbel sampling

Japanese sampling tip

Japanese inputs longer than a single word need stochastic decoding (--magpie-temperature 0.6 --magpie-top-k 80 --seed 42 mirrors NeMo's reference test). Greedy gets stuck on the first phrase because the heiban pitch-accent heuristic deviates from per-word truth.

Voice cloning — not supported

Magpie has no zero-shot speaker conditioning in the model; only the 5 baked identities ship in the bundle. The CLI rejects the shared --voice-sample, --speaker, and --instruct flags with an actionable error pointing users at the --magpie-speaker flag or the engines that do support cloning (Qwen3-TTS Base, CosyVoice3, VoxCPM2).

Performance (M4 Pro)

Setting	Audio	Wall	RTF
Batch, INT4, greedy, short prompt	2.8 s	0.88 s	0.32
Batch, INT4, greedy, sentence	5.8 s	1.35 s	0.23
Batch, INT4, sampled, 23 s output	23 s	5.6 s	0.24
Streaming, INT4, sampled	23 s	21.6 s	0.93

First-packet latency in streaming mode is ≈120 ms after model load. Streaming RTF is higher because the codec is re-invoked on the full code buffer at every chunk emission (a future revision can cache codec state).

Swift API

import MagpieTTS

let model = try await MagpieTTS.fromPretrained(variant: .int4)

// Batch synthesis (en/es/de/fr/it/vi/hi/zh — greedy works)
let audio = try model.synthesize(
    text: "Hello, world.",
    speaker: .aria,
    language: .english,
    params: MagpieTTSParams(temperature: 0, topK: 1, maxSteps: 500))

// Japanese — use stochastic sampling
let audioJA = try model.synthesize(
    text: "こんにちは世界、これは音声合成システムです。",
    speaker: .aria,
    language: .japanese,
    params: MagpieTTSParams(temperature: 0.6, topK: 80,
                              maxSteps: 300, seed: 42))

// Streaming (AsyncStream<AudioChunk>)
let stream = model.synthesizeStream(
    text: "Streaming text",
    speaker: .aria,
    language: .english,
    firstChunkFrames: 8,
    framesPerChunk: 25)
for try await chunk in stream {
    // chunk.samples is 22.05 kHz mono Float32
}

CoreML backend (`--engine magpie-coreml`)

Alongside the MLX bundle, Magpie ships a CoreML bundle (aufklarer/Magpie-TTS-Multilingual-357M-CoreML-8bit, ~342 MB INT8). Four .mlmodelc packages — text_encoder, decoder_prefill, decoder_step, nanocodec_decoder — run on ANE / GPU; a Swift-side FSQ inverse turns the sampled codes into the 32-dim latents the codec consumes.

# 8 languages (no Japanese), 5 baked speakers
speech speak "Hello world." --engine magpie-coreml --magpie-speaker aria -o hi.wav
speech speak "Hola mundo." --engine magpie-coreml --language es --magpie-speaker leo -o es.wav

# --language ja auto-routes to the MLX backend (stderr note)
speech speak "こんにちは" --engine magpie-coreml --language ja -o ja.wav

Caveats vs --engine magpie:

Hybrid pipeline today. The 1-layer LocalTransformer (the actual codebook sampling head NeMo trains) and the 8 audio embedding tables aren't shipped inside the CoreML bundle. On first synthesis the CoreML engine lazy-loads the MLX INT4 bundle to drive both pieces. ASR round-trip is bit-for-bit identical to the MLX backend; the difference is that this engine pulls in the MLX bundle too. A pure CoreML path for ANE-only iOS deployment needs the bundle to ship local_transformer/*.npy + audio_embedding_*.npy and a Swift Accelerate LT (tracked follow-up).
No streaming. nanocodec_decoder.mlmodelc is traced at a fixed 64-frame window. We chunk longer sequences internally, but first-packet latency would be ~3 s if we emitted at chunk boundaries. --stream is rejected with an actionable error.
No Japanese tokenizer. The CoreML bundle doesn't ship JA tokenizer JSONs yet. --language ja with this engine auto-falls-back to the MLX backend.

Speaker ordering matches the CoreML bundle's speaker_info.json (0=John, 1=Sofia, 2=Aria, 3=Jason, 4=Leo — different from MLX), and the speaker enum maps internally so the CLI names work for both engines.

Implementation notes

Three bugs worth knowing about if you're porting NeMo-style multilingual TTS:

FSQ floor division — MLX-swift's / is true division (mlx_divide); NeMo's FSQ inverse uses Python //. Use MLX.floorDivide(...) or every FSQ slot decodes to fractional offsets and the codec smears the audio.
Sub-vocab offsets — NeMo's AggregatedTTSTokenizer concatenates per-language vocabs with offsets. A naive global first-occurrence map always lands in the English region and produces nonsense audio for other languages.
Hindi last-wins dedup — HindiCharsTokenizer emits duplicate Devanagari entries (CHARSET overlaps PUNCT_LIST). Python's {l: i for i, l in enumerate(tokens)} dict-comp keeps the last assignment; mirror that, not first-occurrence.

All three fixes are documented inline in the Swift module.

Source

Upstream weights: nvidia/magpie_tts_multilingual_357m (NVIDIA Open Model License)
Codec: nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps
Paper: NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference (2025)
Reference CoreML port: FluidInference/mobius
Swift modules: MagpieTTS (MLX) + MagpieTTSCoreML (CoreML)
CoreML bundle: aufklarer/Magpie-TTS-Multilingual-357M-CoreML-8bit

License

Model weights: NVIDIA Open Model License (commercial use permitted; see linked PDF on the HuggingFace page)
Swift port + bundled IPA / pinyin dictionaries: same as upstream NeMo (Apache 2.0 for the dictionaries, NVIDIA OML for the model)