Qwen3-TTS

Qwen3-TTS is a 12Hz codec language model with a Mimi decoder for high-quality text-to-speech synthesis. The model is 4-bit quantized and runs faster than real-time on Apple Silicon.

Pipeline

Speech synthesis follows a three-stage pipeline:

Talker — 28-layer transformer that converts input text into first codebook tokens at 12.5 Hz
Code Predictor — 5-layer transformer that predicts the remaining 15 codebooks from the first codebook hidden states
Mimi Codec Decoder — Converts all 16 codebook tokens into a 24 kHz audio waveform

Architecture

Talker

The Talker is the core autoregressive model that generates codec tokens from text input.

Parameter	Value
Layers	28
Hidden dimension	1024
Query heads	16
Key/Value heads	8 (GQA)
MLP	SwiGLU
Position encoding	RoPE

Code Predictor

A lightweight 5-layer transformer that takes the hidden states from the first codebook and predicts the remaining 15 codebooks in parallel. This avoids running the full Talker 16 times per step.

Mimi Codec Decoder

The Mimi decoder converts quantized codec tokens back into audio:

RVQ decode (16 codebooks)
Pre-convolution (512 to 1024 channels)
Pre-transformer (1024 to 512 bottleneck, 8 layers, SwiGLU + LayerScale)
Upsample (2x, 2x)
SEANet decoder (8x, 5x, 4x, 3x upsample stages)
24 kHz waveform output

Model Variants

Model	Size	HuggingFace
Qwen3-TTS-0.6B Base (4-bit)	1.7 GB	aufklarer/Qwen3-TTS-12Hz-0.6B-Base-MLX-4bit
Qwen3-TTS-0.6B Base (8-bit)	2.4 GB	aufklarer/Qwen3-TTS-12Hz-0.6B-Base-MLX-8bit
Qwen3-TTS-0.6B CustomVoice (4-bit)	1.7 GB	aufklarer/Qwen3-TTS-12Hz-0.6B-CustomVoice-MLX-4bit
Qwen3-TTS-1.7B Base (4-bit)	3.2 GB	aufklarer/Qwen3-TTS-12Hz-1.7B-Base-MLX-4bit
Qwen3-TTS-1.7B Base (8-bit)	4.8 GB	aufklarer/Qwen3-TTS-12Hz-1.7B-Base-MLX-8bit
Qwen3-TTS CoreML (FP16)	2.1 GB	aufklarer/Qwen3-TTS-CoreML

CoreML Backend

The CoreML backend runs the full Qwen3-TTS pipeline on GPU via Core ML, enabling deployment on iOS and macOS without MLX dependencies. The model is split into 6 specialized submodels optimized for Apple’s compute stack:

TextProjector — Projects text token embeddings to the shared hidden space
CodeEmbedder — Embeds first-codebook tokens and control tokens
MultiCodeEmbedder — Embeds tokens from codebooks 1–15
CodeDecoder — 28-layer autoregressive transformer with stateless KV cache (max 256 positions)
MultiCodeDecoder — 5-layer code predictor for codebooks 1–15
SpeechDecoder — Mimi codec decoder, converts 16 codebook tokens to 24 kHz audio

# CoreML synthesis
.build/release/speech speak "Hello, world!" --engine coreml -o hello.wav

# CoreML uses temperature 0.8 by default (required for quality output)
.build/release/speech speak "Long text here." --engine coreml --temperature 0.9 -o out.wav

CoreML KV Cache Limit

The CoreML CodeDecoder uses a fixed 256-position KV cache. Longer passages should be split into individual sentences. Decode tokens are automatically capped to fit within remaining cache slots after prefill.

CLI Usage

Generate speech from text:

.build/release/speech speak "Hello, world!" --output hello.wav

Options

Flag	Description
`--engine`	TTS engine: `qwen3` (MLX, default), `coreml` (CoreML/GPU), or `cosyvoice`
`--output`, `-o`	Output WAV file path
`--language`	Language (default: english). Omit to use speaker's native dialect.
`--model`	Model variant: `base` or `customVoice`
`--speaker`	Speaker voice (requires `--model customVoice`)
`--temperature`	Sampling temperature (default: 0.3)
`--top-k`	Top-k sampling parameter
`--max-tokens`	Maximum number of tokens to generate (default: 500)
`--stream`	Enable streaming — emits audio chunks during generation
`--first-chunk-frames`	Number of frames in the first streamed chunk
`--chunk-frames`	Number of frames per subsequent streamed chunk
`--batch-file`	Path to a text file with one utterance per line for batch synthesis
`--batch-size`	Number of parallel utterances in batch mode

Examples

# Basic synthesis
.build/release/speech speak "The quick brown fox." -o fox.wav

# Streaming output
.build/release/speech speak "Long passage of text..." --stream -o stream.wav

# Batch synthesis from file
.build/release/speech speak --batch-file sentences.txt --batch-size 4 -o output_dir/

Streaming

The --stream flag enables chunked audio output during generation. Instead of waiting for the entire utterance to complete, audio is emitted in chunks as tokens are produced. Use --first-chunk-frames and --chunk-frames to control the size of each chunk.

Batch Mode

For synthesizing multiple utterances, use --batch-file with a text file containing one line per utterance. The --batch-size flag controls how many utterances are processed in parallel.

Performance

On an M2 Max, Qwen3-TTS achieves an RTF (real-time factor) of approximately 0.55, meaning it generates speech faster than real-time. With compile() warmup, each step takes about 37 ms.

Safety Limit

The default maximum is 500 tokens, which produces roughly 40 seconds of audio at 12.5 Hz. Setting higher values risks exceeding the Metal GPU watchdog timeout, which can cause a system reboot on Apple Silicon since the GPU is shared with the compositor.

Languages

Qwen3-TTS supports multilingual text-to-speech synthesis. The model automatically detects the input language and generates speech accordingly.

Swift API

import Qwen3TTS
import AudioCommon

let model = try await Qwen3TTSModel.fromPretrained()
let audio = model.synthesize(text: "Hello, world!", language: "english")
try WAVWriter.write(samples: audio, sampleRate: 24000, to: URL(filePath: "hello.wav"))