Qwen3-TTS
Qwen3-TTS is a 12Hz codec language model with a Mimi decoder for high-quality text-to-speech synthesis. The model is 4-bit quantized and runs faster than real-time on Apple Silicon.
Pipeline
Speech synthesis follows a three-stage pipeline:
- Talker — 28-layer transformer that converts input text into first codebook tokens at 12.5 Hz
- Code Predictor — 5-layer transformer that predicts the remaining 15 codebooks from the first codebook hidden states
- Mimi Codec Decoder — Converts all 16 codebook tokens into a 24 kHz audio waveform
Architecture
Talker
The Talker is the core autoregressive model that generates codec tokens from text input.
| Parameter | Value |
|---|---|
| Layers | 28 |
| Hidden dimension | 1024 |
| Query heads | 16 |
| Key/Value heads | 8 (GQA) |
| MLP | SwiGLU |
| Position encoding | RoPE |
Code Predictor
A lightweight 5-layer transformer that takes the hidden states from the first codebook and predicts the remaining 15 codebooks in parallel. This avoids running the full Talker 16 times per step.
Mimi Codec Decoder
The Mimi decoder converts quantized codec tokens back into audio:
- RVQ decode (16 codebooks)
- Pre-convolution (512 to 1024 channels)
- Pre-transformer (1024 to 512 bottleneck, 8 layers, SwiGLU + LayerScale)
- Upsample (2x, 2x)
- SEANet decoder (8x, 5x, 4x, 3x upsample stages)
- 24 kHz waveform output
Model Variants
| Model | Size | HuggingFace |
|---|---|---|
| Qwen3-TTS-0.6B Base (4-bit) | 1.7 GB | aufklarer/Qwen3-TTS-12Hz-0.6B-Base-MLX-4bit |
| Qwen3-TTS-0.6B Base (8-bit) | 2.4 GB | aufklarer/Qwen3-TTS-12Hz-0.6B-Base-MLX-8bit |
| Qwen3-TTS-0.6B CustomVoice (4-bit) | 1.7 GB | aufklarer/Qwen3-TTS-12Hz-0.6B-CustomVoice-MLX-4bit |
| Qwen3-TTS-1.7B Base (4-bit) | 3.2 GB | aufklarer/Qwen3-TTS-12Hz-1.7B-Base-MLX-4bit |
| Qwen3-TTS-1.7B Base (8-bit) | 4.8 GB | aufklarer/Qwen3-TTS-12Hz-1.7B-Base-MLX-8bit |
CLI Usage
Generate speech from text:
.build/release/audio speak "Hello, world!" --output hello.wav
Options
| Flag | Description |
|---|---|
--engine | TTS engine to use (default: qwen3) |
--output, -o | Output WAV file path |
--language | Language (default: english). Omit to use speaker's native dialect. |
--model | Model variant: base or customVoice |
--speaker | Speaker voice (requires --model customVoice) |
--temperature | Sampling temperature (default: 0.3) |
--top-k | Top-k sampling parameter |
--max-tokens | Maximum number of tokens to generate (default: 500) |
--stream | Enable streaming — emits audio chunks during generation |
--first-chunk-frames | Number of frames in the first streamed chunk |
--chunk-frames | Number of frames per subsequent streamed chunk |
--batch-file | Path to a text file with one utterance per line for batch synthesis |
--batch-size | Number of parallel utterances in batch mode |
Examples
# Basic synthesis
.build/release/audio speak "The quick brown fox." -o fox.wav
# Streaming output
.build/release/audio speak "Long passage of text..." --stream -o stream.wav
# Batch synthesis from file
.build/release/audio speak --batch-file sentences.txt --batch-size 4 -o output_dir/
Streaming
The --stream flag enables chunked audio output during generation. Instead of waiting for the entire utterance to complete, audio is emitted in chunks as tokens are produced. Use --first-chunk-frames and --chunk-frames to control the size of each chunk.
Batch Mode
For synthesizing multiple utterances, use --batch-file with a text file containing one line per utterance. The --batch-size flag controls how many utterances are processed in parallel.
Performance
On an M2 Max, Qwen3-TTS achieves an RTF (real-time factor) of approximately 0.55, meaning it generates speech faster than real-time. With compile() warmup, each step takes about 37 ms.
The default maximum is 500 tokens, which produces roughly 40 seconds of audio at 12.5 Hz. Setting higher values risks exceeding the Metal GPU watchdog timeout, which can cause a system reboot on Apple Silicon since the GPU is shared with the compositor.
Languages
Qwen3-TTS supports multilingual text-to-speech synthesis. The model automatically detects the input language and generates speech accordingly.
Swift API
import Qwen3TTS
let model = try await Qwen3TTSModel.loadFromHub()
let audio = try await model.speak("Hello, world!")
try audio.write(to: "hello.wav")