Qwen3-TTS

Qwen3-TTS is a 12Hz codec language model with a Mimi decoder for high-quality text-to-speech synthesis. The model is 4-bit quantized and runs faster than real-time on Apple Silicon.

Pipeline

Speech synthesis follows a three-stage pipeline:

  1. Talker — 28-layer transformer that converts input text into first codebook tokens at 12.5 Hz
  2. Code Predictor — 5-layer transformer that predicts the remaining 15 codebooks from the first codebook hidden states
  3. Mimi Codec Decoder — Converts all 16 codebook tokens into a 24 kHz audio waveform

Architecture

Talker

The Talker is the core autoregressive model that generates codec tokens from text input.

ParameterValue
Layers28
Hidden dimension1024
Query heads16
Key/Value heads8 (GQA)
MLPSwiGLU
Position encodingRoPE

Code Predictor

A lightweight 5-layer transformer that takes the hidden states from the first codebook and predicts the remaining 15 codebooks in parallel. This avoids running the full Talker 16 times per step.

Mimi Codec Decoder

The Mimi decoder converts quantized codec tokens back into audio:

  1. RVQ decode (16 codebooks)
  2. Pre-convolution (512 to 1024 channels)
  3. Pre-transformer (1024 to 512 bottleneck, 8 layers, SwiGLU + LayerScale)
  4. Upsample (2x, 2x)
  5. SEANet decoder (8x, 5x, 4x, 3x upsample stages)
  6. 24 kHz waveform output

Model Variants

ModelSizeHuggingFace
Qwen3-TTS-0.6B Base (4-bit)1.7 GBaufklarer/Qwen3-TTS-12Hz-0.6B-Base-MLX-4bit
Qwen3-TTS-0.6B Base (8-bit)2.4 GBaufklarer/Qwen3-TTS-12Hz-0.6B-Base-MLX-8bit
Qwen3-TTS-0.6B CustomVoice (4-bit)1.7 GBaufklarer/Qwen3-TTS-12Hz-0.6B-CustomVoice-MLX-4bit
Qwen3-TTS-1.7B Base (4-bit)3.2 GBaufklarer/Qwen3-TTS-12Hz-1.7B-Base-MLX-4bit
Qwen3-TTS-1.7B Base (8-bit)4.8 GBaufklarer/Qwen3-TTS-12Hz-1.7B-Base-MLX-8bit

CLI Usage

Generate speech from text:

.build/release/audio speak "Hello, world!" --output hello.wav

Options

FlagDescription
--engineTTS engine to use (default: qwen3)
--output, -oOutput WAV file path
--languageLanguage (default: english). Omit to use speaker's native dialect.
--modelModel variant: base or customVoice
--speakerSpeaker voice (requires --model customVoice)
--temperatureSampling temperature (default: 0.3)
--top-kTop-k sampling parameter
--max-tokensMaximum number of tokens to generate (default: 500)
--streamEnable streaming — emits audio chunks during generation
--first-chunk-framesNumber of frames in the first streamed chunk
--chunk-framesNumber of frames per subsequent streamed chunk
--batch-filePath to a text file with one utterance per line for batch synthesis
--batch-sizeNumber of parallel utterances in batch mode

Examples

# Basic synthesis
.build/release/audio speak "The quick brown fox." -o fox.wav

# Streaming output
.build/release/audio speak "Long passage of text..." --stream -o stream.wav

# Batch synthesis from file
.build/release/audio speak --batch-file sentences.txt --batch-size 4 -o output_dir/

Streaming

The --stream flag enables chunked audio output during generation. Instead of waiting for the entire utterance to complete, audio is emitted in chunks as tokens are produced. Use --first-chunk-frames and --chunk-frames to control the size of each chunk.

Batch Mode

For synthesizing multiple utterances, use --batch-file with a text file containing one line per utterance. The --batch-size flag controls how many utterances are processed in parallel.

Performance

On an M2 Max, Qwen3-TTS achieves an RTF (real-time factor) of approximately 0.55, meaning it generates speech faster than real-time. With compile() warmup, each step takes about 37 ms.

Safety Limit

The default maximum is 500 tokens, which produces roughly 40 seconds of audio at 12.5 Hz. Setting higher values risks exceeding the Metal GPU watchdog timeout, which can cause a system reboot on Apple Silicon since the GPU is shared with the compositor.

Languages

Qwen3-TTS supports multilingual text-to-speech synthesis. The model automatically detects the input language and generates speech accordingly.

Swift API

import Qwen3TTS

let model = try await Qwen3TTSModel.loadFromHub()
let audio = try await model.speak("Hello, world!")
try audio.write(to: "hello.wav")