CosyVoice3
Fun-CosyVoice3-0.5B is a 9-language streaming text-to-speech model. It uses a three-stage pipeline — LLM token generation, DiT flow matching, and HiFi-GAN vocoding — to produce natural 24 kHz speech from text input.
Supported Languages
| Language | Code |
|---|---|
| Chinese | chinese |
| English | english |
| Japanese | japanese |
| Korean | korean |
| German | german |
| Spanish | spanish |
| French | french |
| Italian | italian |
| Russian | russian |
Pipeline
CosyVoice3 synthesizes speech in three stages:
- LLM — Qwen2.5-0.5B backbone generates FSQ (Finite Scalar Quantization) speech tokens from text
- DiT Flow Matching — A 22-layer Diffusion Transformer converts speech tokens into mel spectrograms via Euler ODE integration
- HiFi-GAN — Neural Source Filter vocoder converts mel spectrograms into 24 kHz waveforms
Architecture
LLM (Qwen2.5-0.5B)
The language model is 4-bit quantized and generates discrete speech tokens autoregressively.
| Parameter | Value |
|---|---|
| Layers | 24 |
| Hidden dimension | 896 |
| Query heads | 14 |
| Key/Value heads | 2 (GQA) |
| FSQ vocabulary | 6561 |
| Quantization | 4-bit |
DiT Flow Matching
The Diffusion Transformer refines speech tokens into mel spectrograms using conditional flow matching with classifier-free guidance.
| Parameter | Value |
|---|---|
| Layers | 22 |
| Dimension | 1024 |
| Attention heads | 16 |
| Conditioning | AdaLN (Adaptive Layer Norm) |
| ODE solver | Euler, 10 steps |
| CFG rate | 0.7 |
HiFi-GAN Vocoder
A Neural Source Filter (NSF) vocoder that converts mel spectrograms to waveforms.
| Parameter | Value |
|---|---|
| Harmonics | 8 |
| Upsample ratio | 480x (8 x 5 x 3 x ISTFT 4) |
| ISTFT | n_fft=16, hop=4 |
| Output sample rate | 24 kHz |
Model Weights
| Model | Size | HuggingFace |
|---|---|---|
| CosyVoice3-0.5B (4-bit LLM) | 1.2 GB | aufklarer/CosyVoice3-0.5B-MLX-4bit |
Includes LLM (4-bit quantized), DiT flow matching, and HiFi-GAN vocoder weights.
CLI Usage
.build/release/audio speak "Hallo Welt" --engine cosyvoice --language german -o output.wav
Examples
# English
.build/release/audio speak "Hello, how are you?" --engine cosyvoice -o hello_en.wav
# Chinese
.build/release/audio speak "你好世界" --engine cosyvoice --language chinese -o hello_cn.wav
# Spanish
.build/release/audio speak "Hola, buenos días" --engine cosyvoice --language spanish -o hello_es.wav
# French
.build/release/audio speak "Bonjour le monde" --engine cosyvoice --language french -o hello_fr.wav
Voice Cloning
Clone any voice from a short reference audio sample using the --voice-sample flag. CosyVoice3 uses the CAM++ speaker encoder to extract a 192-dim embedding that conditions the DiT flow model.
# Voice cloning
.build/release/audio speak "Hello in your voice" --engine cosyvoice --voice-sample reference.wav -o cloned.wav
# Cross-language: clone voice, speak in German
.build/release/audio speak "Guten Tag" --engine cosyvoice --voice-sample reference.wav --language german -o german.wav
How It Works
- CAM++ speaker encoder extracts a 192-dim embedding from the reference audio via CoreML (Neural Engine)
- Affine projection (192 → 80) conditions the DiT flow matching decoder on the target voice
- HiFi-GAN vocoder converts the speaker-conditioned mel spectrogram to 24kHz audio
Speaker Encoder
| Property | Value |
|---|---|
| Model | CAM++ (Context-Aware Masking++) |
| Embedding | 192 dimensions |
| Backend | CoreML (Neural Engine, FP16) |
| Size | ~14 MB |
| HuggingFace | aufklarer/CamPlusPlus-Speaker-CoreML |
The CAM++ model is downloaded automatically on first use of --voice-sample. See the Voice Cloning guide for reference audio tips and the Swift API.
Multi-Speaker Dialogue
Synthesize conversations between multiple speakers using inline speaker tags. Each speaker is assigned a voice from a reference audio file via the --speakers flag.
# Two-speaker dialogue with voice cloning
.build/release/audio speak "[S1] Hello there! [S2] Hey, how are you?" \
--engine cosyvoice --speakers s1=alice.wav,s2=bob.wav -o dialogue.wav
# Three speakers
.build/release/audio speak "[A] Welcome. [B] Thanks! [C] Glad to be here." \
--engine cosyvoice --speakers a=host.wav,b=guest1.wav,c=guest2.wav -o panel.wav
Speaker names in tags are case-insensitive and matched to the mapping keys. A configurable silence gap (default 0.2s) is inserted between turns.
| Option | Default | Description |
|---|---|---|
--speakers | Speaker mapping: s1=file.wav,s2=file.wav | |
--turn-gap | 0.2 | Silence between turns (seconds) |
--crossfade | 0.0 | Crossfade overlap between turns (seconds) |
Emotion & Style Tags
Control the speaking style per segment using inline emotion tags. CosyVoice3 uses the text prefix before the <|endofprompt|> token as a style instruction — emotion tags map to natural language instructions that replace this prefix.
# Emotion tags
.build/release/audio speak "(excited) Wow, amazing! (sad) But I have to go..." \
--engine cosyvoice -o emotion.wav
# Combined with speakers
.build/release/audio speak "[S1] (happy) Great news! [S2] (surprised) Really?" \
--engine cosyvoice --speakers s1=alice.wav,s2=bob.wav -o combined.wav
# Freeform instruction as tag
.build/release/audio speak "(Speak like a pirate) Ahoy matey!" \
--engine cosyvoice -o pirate.wav
# Global instruction (applies to all segments without emotion tags)
.build/release/audio speak "Hello world" \
--engine cosyvoice --cosy-instruct "Speak cheerfully" -o cheerful.wav
Built-in Emotion Tags
| Tag | Instruction |
|---|---|
happy / excited | Speak happily and with excitement. |
sad | Speak sadly with a melancholic tone. |
angry | Speak with anger and intensity. |
whispers / whispering | Speak in a soft, gentle whisper. |
laughs / laughing | Speak while laughing. |
calm | Speak calmly and peacefully. |
surprised | Speak with surprise and amazement. |
serious | Speak in a serious, formal tone. |
Unknown tags pass through as freeform instructions, so (Speak in a slow, dramatic voice) works as-is.
Sampling
The LLM stage uses the following sampling configuration:
| Parameter | Value |
|---|---|
| Top-k | 25 |
| Top-p | 0.8 |
| Repetition Aware Sampling | Enabled (window=10, tau_r=0.1) |
Repetition Aware Sampling (RAS), from VALL-E 2, penalizes tokens that appeared in the last 10 generated tokens. This prevents repetitive audio artifacts and improves output stability.
Performance
On an M2 Max, CosyVoice3 achieves an RTF of approximately 0.5 — faster than real-time.
| Stage | Latency |
|---|---|
| LLM (compiled) | ~13 ms/token |
| DiT Flow Matching | 370 - 520 ms |
| HiFi-GAN | 50 - 170 ms |
The LLM stage uses compile(shapeless: true) for the autoregressive loop, which eliminates recompilation overhead across varying sequence lengths. Batch-doubled CFG halves the number of DiT forward passes from 20 to 10.