Magpie-TTS Multilingual

NVIDIA Magpie-TTS Multilingual 357M in Swift on Apple Silicon — an autoregressive multi-codebook text-to-speech model over NeMo's 22.05 kHz Nano-Codec. Nine languages (English, Spanish, German, French, Italian, Vietnamese, Hindi, Mandarin Chinese, Japanese) with five baked speaker identities. Shipped quantized at INT4 (~247 MB) or INT8 (~411 MB). Streaming-ready, ~120 ms first-packet latency.

When to reach for Magpie

Magpie is the right pick when you need a single small bundle that speaks nine languages with a consistent voice. The five baked speakers stay identity-stable across languages — useful for multilingual assistants, education apps, or audiobook narration with code-switching. For zero-shot voice cloning use CosyVoice3, Qwen3-TTS Base, or VoxCPM2 instead.

Architecture

Magpie is a 4-bundle MLX pipeline: text encoder → cross-attention decoder → LocalTransformer codebook head → Causal HiFi-GAN audio codec. The bundles share decoder weights between prefill and step entry points to match upstream FluidInference's CoreML layout.

StageModuleDetails
1. TokenisationMagpieTokenizerPer-language G2P (IPA dict / byT5 bytes / pinyin / katakana), shared 2360-token vocab with per-tokenizer offsets, always-appended EOS
2. Text encoderMagpieTextEncoder6 causal Transformer layers, d=768, k=3 conv FFN
3. Decoder prefillMagpieDecoder12 causal layers with cross-attention. Seeds the 110-frame baked speaker context + BOS into the KV cache.
4. LocalTransformerMagpieLocalTransformer1-layer codebook AR head, d=256. Samples the 8 codebooks per frame sequentially given the decoder hidden.
5. Decoder stepMagpieDecoderOne AR step per frame until EOS or 500-frame cap (~23 s).
6. NanoCodecMagpieNanoCodecFSQ inverse → causal HiFi-GAN → 22.05 kHz mono waveform.

Languages and G2P

All nine languages round-trip through Qwen3-ASR in the SDK's testMultilingualRoundTrip. Each gets a tailored pipeline:

LanguageCodeG2P pipeline
EnglishenCMU IPA dict (125 k entries, bundled)
SpanishesSpanish IPA dict (bundled)
GermandeGerman IPA dict (bundled)
FrenchfrbyT5 UTF-8 byte encoder
ItalianitbyT5 UTF-8 byte encoder
VietnamesevibyT5 UTF-8 byte encoder
HindihiDevanagari codepoint lookup + last-wins sub-vocab
Mandarin ChinesezhNLTokenizer(.simplifiedChinese) word seg + Apple .mandarinToLatin + bundled pinyin → IPA dict + #N tone markers
JapanesejaCFStringTokenizer kanji reading + NFC-preserved dakuten + heiban pitch markers + particle/greeting overrides

The shared 2360-entry vocab concatenates each language's sub-tokeniser end-to-end with a per-language offset (recorded in MagpieSubVocab). The text-embedding adds two extra rows past the vocab for BOS / EOS; eos_id = 2361 is appended to every input sequence.

Baked Speakers

The checkpoint embeds five speaker contexts (110 frames × 768 dim each) used as the prefix of every AR decode. Speaker identity is consistent across all nine languages.

IndexCLI nameIdentity
0sofiaSofia (default)
1ariaAria
2jasonJason
3leoLeo
4johnJohn Van Stan

Model Variants

VariantDiskRAM (load + decode)HuggingFace
INT4 (default)~247 MB~1.3 GBaufklarer/Magpie-TTS-Multilingual-357M-MLX-4bit
INT8~411 MB~1.6 GBaufklarer/Magpie-TTS-Multilingual-357M-MLX-8bit

Both bundles use MLX flat affine quantisation (mlx_affine_flat, group size 64) and dequantise to FP32 at load time — runtime activations are full precision. INT4 is audibly indistinguishable from INT8 for this model; pick INT4 unless you have storage to spare.

CLI Usage

# English, greedy decoding
speech speak "Hello, world." --engine magpie --magpie-speaker aria \
    --magpie-temperature 0 -o out.wav

# Spanish (any of the 9 languages — pick with --language)
speech speak "Hola, mundo." --engine magpie --language es \
    --magpie-speaker aria -o out.wav

# Japanese — needs stochastic decoding (greedy gets stuck on first phrase)
speech speak "こんにちは世界、これは音声合成システムです。" \
    --engine magpie --language ja --magpie-temperature 0.6 \
    --magpie-top-k 80 --seed 42 -o out.wav

# Streaming synthesis with playback
speech speak "Streaming test" --engine magpie --stream --play

# List the 5 baked speakers
speech speak --engine magpie --list-speakers

# Pre-phonemised IPA bypasses the per-language G2P
speech speak "həˈloʊ" --engine magpie --magpie-prephonemized -o out.wav

Options

OptionDefaultDescription
--magpie-variantint4Quantisation variant: int4 or int8
--magpie-speakersofiaBaked speaker: sofia, aria, jason, leo, john
--magpie-temperature0.6Sampling temperature (0 = greedy)
--magpie-top-k80Top-k filter for sampling
--magpie-max-frames500Hard cap on codec frames (~23 s)
--magpie-min-frames4Minimum frames before EOS allowed
--magpie-prephonemizedoffTreat input as IPA / phoneme stream; skip per-language G2P
--languageenglishPicks the per-language tokeniser pipeline
--streamoffEmit AsyncStream<AudioChunk> instead of a single WAV
--seedReproducible Gumbel sampling
Japanese sampling tip

Japanese inputs longer than a single word need stochastic decoding (--magpie-temperature 0.6 --magpie-top-k 80 --seed 42 mirrors NeMo's reference test). Greedy gets stuck on the first phrase because the heiban pitch-accent heuristic deviates from per-word truth.

Voice cloning — not supported

Magpie has no zero-shot speaker conditioning in the model; only the 5 baked identities ship in the bundle. The CLI rejects the shared --voice-sample, --speaker, and --instruct flags with an actionable error pointing users at the --magpie-speaker flag or the engines that do support cloning (Qwen3-TTS Base, CosyVoice3, VoxCPM2).

Performance (M4 Pro)

SettingAudioWallRTF
Batch, INT4, greedy, short prompt2.8 s0.88 s0.32
Batch, INT4, greedy, sentence5.8 s1.35 s0.23
Batch, INT4, sampled, 23 s output23 s5.6 s0.24
Streaming, INT4, sampled23 s21.6 s0.93

First-packet latency in streaming mode is ≈120 ms after model load. Streaming RTF is higher because the codec is re-invoked on the full code buffer at every chunk emission (a future revision can cache codec state).

Swift API

import MagpieTTS

let model = try await MagpieTTS.fromPretrained(variant: .int4)

// Batch synthesis (en/es/de/fr/it/vi/hi/zh — greedy works)
let audio = try model.synthesize(
    text: "Hello, world.",
    speaker: .aria,
    language: .english,
    params: MagpieTTSParams(temperature: 0, topK: 1, maxSteps: 500))

// Japanese — use stochastic sampling
let audioJA = try model.synthesize(
    text: "こんにちは世界、これは音声合成システムです。",
    speaker: .aria,
    language: .japanese,
    params: MagpieTTSParams(temperature: 0.6, topK: 80,
                              maxSteps: 300, seed: 42))

// Streaming (AsyncStream<AudioChunk>)
let stream = model.synthesizeStream(
    text: "Streaming text",
    speaker: .aria,
    language: .english,
    firstChunkFrames: 8,
    framesPerChunk: 25)
for try await chunk in stream {
    // chunk.samples is 22.05 kHz mono Float32
}

CoreML backend (--engine magpie-coreml)

Alongside the MLX bundle, Magpie ships a CoreML bundle (aufklarer/Magpie-TTS-Multilingual-357M-CoreML-8bit, ~342 MB INT8). Four .mlmodelc packages — text_encoder, decoder_prefill, decoder_step, nanocodec_decoder — run on ANE / GPU; a Swift-side FSQ inverse turns the sampled codes into the 32-dim latents the codec consumes.

# 8 languages (no Japanese), 5 baked speakers
speech speak "Hello world." --engine magpie-coreml --magpie-speaker aria -o hi.wav
speech speak "Hola mundo." --engine magpie-coreml --language es --magpie-speaker leo -o es.wav

# --language ja auto-routes to the MLX backend (stderr note)
speech speak "こんにちは" --engine magpie-coreml --language ja -o ja.wav

Caveats vs --engine magpie:

Speaker ordering matches the CoreML bundle's speaker_info.json (0=John, 1=Sofia, 2=Aria, 3=Jason, 4=Leo — different from MLX), and the speaker enum maps internally so the CLI names work for both engines.

Implementation notes

Three bugs worth knowing about if you're porting NeMo-style multilingual TTS:

All three fixes are documented inline in the Swift module.

Source

License