CosyVoice3
Fun-CosyVoice3-0.5B is a 9-language streaming text-to-speech model. It uses a three-stage pipeline — LLM token generation, DiT flow matching, and HiFi-GAN vocoding — to produce natural 24 kHz speech from text input. The model — also written CosyVoice 3 — is the latest of the FunAudioLLM CosyVoice family.
Supported Languages
| Language | Code |
|---|---|
| Chinese | chinese |
| English | english |
| Japanese | japanese |
| Korean | korean |
| German | german |
| Spanish | spanish |
| French | french |
| Italian | italian |
| Russian | russian |
Pipeline
CosyVoice3 synthesizes speech in three stages:
- LLM — Qwen2.5-0.5B backbone generates FSQ (Finite Scalar Quantization) speech tokens from text
- DiT Flow Matching — A 22-layer Diffusion Transformer converts speech tokens into mel spectrograms via Euler ODE integration
- HiFi-GAN — Neural Source Filter vocoder converts mel spectrograms into 24 kHz waveforms
Architecture
LLM (Qwen2.5-0.5B)
The language model generates discrete speech tokens autoregressively. The runtime ships in four quantization variants — 4-bit, 8-bit, 8-bit-full (int8 LLM + int8 DiT), and bf16 (unquantized) — picked per call via --cosyvoice-variant.
| Parameter | Value |
|---|---|
| Layers | 24 |
| Hidden dimension | 896 |
| Query heads | 14 |
| Key/Value heads | 2 (GQA) |
| FSQ vocabulary | 6561 |
| Quantization | 4-bit (default) / 8-bit / bf16 |
DiT Flow Matching
The Diffusion Transformer refines speech tokens into mel spectrograms using conditional flow matching with classifier-free guidance.
| Parameter | Value |
|---|---|
| Layers | 22 |
| Dimension | 1024 |
| Attention heads | 16 |
| Conditioning | AdaLN (Adaptive Layer Norm) |
| ODE solver | Euler, 10 steps |
| CFG rate | 0.7 |
HiFi-GAN Vocoder
A Neural Source Filter (NSF) vocoder that converts mel spectrograms to waveforms.
| Parameter | Value |
|---|---|
| Harmonics | 8 |
| Upsample ratio | 480x (8 x 5 x 3 x ISTFT 4) |
| ISTFT | n_fft=16, hop=4 |
| Output sample rate | 24 kHz |
Model Weights
| Variant | LLM | DiT | Size | HuggingFace |
|---|---|---|---|---|
4bit (default) | int4, group=64 | bf16 | ~1.2 GB | aufklarer/CosyVoice3-0.5B-MLX-4bit |
8bit | int8, group=64 | bf16 | ~1.4 GB | aufklarer/CosyVoice3-0.5B-MLX-8bit |
8bit-full | int8, group=64 | int8, group=64 | ~1.6 GB | aufklarer/CosyVoice3-0.5B-MLX-8bit-full |
bf16 | bf16 | bf16 | ~2.1 GB | aufklarer/CosyVoice3-0.5B-MLX-bf16 |
Every bundle includes the LLM, the DiT flow-matching decoder, the HiFi-GAN vocoder, and the S3-Tokenizer reference encoder needed for zero-shot voice cloning. Pick smaller bundles for smaller download / disk footprint; pick bf16 when LLM/DiT quantisation noise becomes a problem (long-form synthesis, voice cloning fidelity).
CLI Usage
# Default 4-bit bundle
.build/release/speech speak "Hallo Welt" --engine cosyvoice --language german -o output.wav
# Pick a variant via --cosyvoice-variant
.build/release/speech speak "Hallo Welt" --engine cosyvoice --cosyvoice-variant bf16 --language german -o output.wav
Examples
# English
.build/release/speech speak "Hello, how are you?" --engine cosyvoice -o hello_en.wav
# Chinese
.build/release/speech speak "你好世界" --engine cosyvoice --language chinese -o hello_cn.wav
# Spanish
.build/release/speech speak "Hola, buenos días" --engine cosyvoice --language spanish -o hello_es.wav
# French
.build/release/speech speak "Bonjour le monde" --engine cosyvoice --language french -o hello_fr.wav
Voice Cloning
Clone any voice from a short reference audio sample using the --voice-sample flag. CosyVoice3 uses the CAM++ speaker encoder to extract a 192-dim embedding that conditions the DiT flow model.
# Voice cloning
.build/release/speech speak "Hello in your voice" --engine cosyvoice --voice-sample reference.wav -o cloned.wav
# Cross-language: clone voice, speak in German
.build/release/speech speak "Guten Tag" --engine cosyvoice --voice-sample reference.wav --language german -o german.wav
How It Works
- CAM++ speaker encoder extracts a 192-dim embedding from the reference audio via CoreML (Neural Engine)
- Affine projection (192 → 80) conditions the DiT flow matching decoder on the target voice
- HiFi-GAN vocoder converts the speaker-conditioned mel spectrogram to 24kHz audio
Speaker Encoder
| Property | Value |
|---|---|
| Model | CAM++ (Context-Aware Masking++) |
| Embedding | 192 dimensions |
| Backend | CoreML (Neural Engine, FP16) |
| Size | ~14 MB |
| HuggingFace | aufklarer/CamPlusPlus-Speaker-CoreML |
The CAM++ model is downloaded automatically on first use of --voice-sample. See the Voice Cloning guide for reference audio tips and the Swift API.
Multi-Speaker Dialogue
Synthesize conversations between multiple speakers using inline speaker tags. Each speaker is assigned a voice from a reference audio file via the --speakers flag.
# Two-speaker dialogue with voice cloning
.build/release/speech speak "[S1] Hello there! [S2] Hey, how are you?" \
--engine cosyvoice --speakers s1=alice.wav,s2=bob.wav -o dialogue.wav
# Three speakers
.build/release/speech speak "[A] Welcome. [B] Thanks! [C] Glad to be here." \
--engine cosyvoice --speakers a=host.wav,b=guest1.wav,c=guest2.wav -o panel.wav
Speaker names in tags are case-insensitive and matched to the mapping keys. A configurable silence gap (default 0.2s) is inserted between turns.
| Option | Default | Description |
|---|---|---|
--speakers | Speaker mapping: s1=file.wav,s2=file.wav | |
--turn-gap | 0.2 | Silence between turns (seconds) |
--crossfade | 0.0 | Crossfade overlap between turns (seconds) |
Emotion & Style Tags
Control the speaking style per segment using inline emotion tags. CosyVoice3 uses the text prefix before the <|endofprompt|> token as a style instruction — emotion tags map to natural language instructions that replace this prefix.
# Emotion tags
.build/release/speech speak "(excited) Wow, amazing! (sad) But I have to go..." \
--engine cosyvoice -o emotion.wav
# Combined with speakers
.build/release/speech speak "[S1] (happy) Great news! [S2] (surprised) Really?" \
--engine cosyvoice --speakers s1=alice.wav,s2=bob.wav -o combined.wav
# Freeform instruction as tag
.build/release/speech speak "(Speak like a pirate) Ahoy matey!" \
--engine cosyvoice -o pirate.wav
# Global instruction (applies to all segments without emotion tags)
.build/release/speech speak "Hello world" \
--engine cosyvoice --cosy-instruct "Speak cheerfully" -o cheerful.wav
Built-in Emotion Tags
| Tag | Instruction |
|---|---|
happy / excited | Speak happily and with excitement. |
sad | Speak sadly with a melancholic tone. |
angry | Speak with anger and intensity. |
whispers / whispering | Speak in a soft, gentle whisper. |
laughs / laughing | Speak while laughing. |
calm | Speak calmly and peacefully. |
surprised | Speak with surprise and amazement. |
serious | Speak in a serious, formal tone. |
Unknown tags pass through as freeform instructions, so (Speak in a slow, dramatic voice) works as-is.
Model Control Tokens (fl_ tokens)
Internally, the CosyVoice3 LLM uses special control tokens — prefixed fl_ — to switch between modes (zero-shot cloning, instructed synthesis, saving a speaker, etc.). These tokens are part of the upstream FunAudioLLM tokenizer; the Soniqo runtime emits the correct one automatically based on the CLI flag or Swift API call you use, so you never write them by hand.
| Control token | Mode | How to invoke from Soniqo |
|---|---|---|
<|fl_speaker_clone|> | Zero-shot voice cloning from a reference audio sample | Pass --voice-sample reference.wav on the CLI, or set voiceSample: on the Swift API. |
<|fl_speaker_instruct|> | Instruction- or style-conditioned synthesis with a default voice | Pass --cosy-instruct "Speak cheerfully" or use an inline (happy) tag without --voice-sample. |
<|fl_speaker_instruct2|> | Instruction synthesis combined with a cloned reference voice | Combine --voice-sample reference.wav with --cosy-instruct "..." (or an inline emotion tag) in the same call. |
<|fl_save_speaker|> | Persist a speaker's embedding for re-use without re-encoding the reference audio each call | Not directly exposed in the Soniqo CLI — embeddings are computed per call. To cache, extract the 192-dim CAM++ vector yourself via the Speaker Embeddings module and pass it forward. |
<|fl_speaker_clone_zh|>, <|fl_speaker_clone_en|>, … | Language-specific zero-shot cloning hints used by the upstream tokenizer | Combine --voice-sample with --language german|spanish|chinese|.... Soniqo selects the correct language hint from the --language flag. |
The table above maps each upstream fl_ control token to its Soniqo equivalent. You never need to splice fl_ tokens into your prompt yourself — pass the high-level CLI flags or Swift API arguments and the runtime will emit the correct sequence: clone → instruct → instruct2 → save_speaker.
Sampling
The LLM stage uses the following sampling configuration:
| Parameter | Value |
|---|---|
| Top-k | 25 |
| Top-p | 0.8 |
| Repetition Aware Sampling | Enabled (window=10, tau_r=0.1) |
Repetition Aware Sampling (RAS), from VALL-E 2, penalizes tokens that appeared in the last 10 generated tokens. This prevents repetitive audio artifacts and improves output stability.
Performance
On an M2 Max, CosyVoice3 achieves an RTF of approximately 0.5 — faster than real-time.
| Stage | Latency |
|---|---|
| LLM (compiled) | ~13 ms/token |
| DiT Flow Matching | 370 - 520 ms |
| HiFi-GAN | 50 - 170 ms |
The quantized LLM variants (4-bit / 8-bit / 8-bit-full) use compile(shapeless: true) for the autoregressive loop, which eliminates recompilation overhead across varying sequence lengths. The bf16 variant skips that compile — MLX-Swift's shapeless tracer cannot infer the output shape of the bias-fused matmul that plain Linear uses — and runs the generation loop eagerly. Batch-doubled CFG halves the number of DiT forward passes from 20 to 10 in all variants.