CosyVoice3

Fun-CosyVoice3-0.5B is a 9-language streaming text-to-speech model. It uses a three-stage pipeline — LLM token generation, DiT flow matching, and HiFi-GAN vocoding — to produce natural 24 kHz speech from text input.

Supported Languages

LanguageCode
Chinesechinese
Englishenglish
Japanesejapanese
Koreankorean
Germangerman
Spanishspanish
Frenchfrench
Italianitalian
Russianrussian

Pipeline

CosyVoice3 synthesizes speech in three stages:

  1. LLM — Qwen2.5-0.5B backbone generates FSQ (Finite Scalar Quantization) speech tokens from text
  2. DiT Flow Matching — A 22-layer Diffusion Transformer converts speech tokens into mel spectrograms via Euler ODE integration
  3. HiFi-GAN — Neural Source Filter vocoder converts mel spectrograms into 24 kHz waveforms

Architecture

LLM (Qwen2.5-0.5B)

The language model is 4-bit quantized and generates discrete speech tokens autoregressively.

ParameterValue
Layers24
Hidden dimension896
Query heads14
Key/Value heads2 (GQA)
FSQ vocabulary6561
Quantization4-bit

DiT Flow Matching

The Diffusion Transformer refines speech tokens into mel spectrograms using conditional flow matching with classifier-free guidance.

ParameterValue
Layers22
Dimension1024
Attention heads16
ConditioningAdaLN (Adaptive Layer Norm)
ODE solverEuler, 10 steps
CFG rate0.7

HiFi-GAN Vocoder

A Neural Source Filter (NSF) vocoder that converts mel spectrograms to waveforms.

ParameterValue
Harmonics8
Upsample ratio480x (8 x 5 x 3 x ISTFT 4)
ISTFTn_fft=16, hop=4
Output sample rate24 kHz

Model Weights

ModelSizeHuggingFace
CosyVoice3-0.5B (4-bit LLM)1.2 GBaufklarer/CosyVoice3-0.5B-MLX-4bit

Includes LLM (4-bit quantized), DiT flow matching, and HiFi-GAN vocoder weights.

CLI Usage

.build/release/audio speak "Hallo Welt" --engine cosyvoice --language german -o output.wav

Examples

# English
.build/release/audio speak "Hello, how are you?" --engine cosyvoice -o hello_en.wav

# Chinese
.build/release/audio speak "你好世界" --engine cosyvoice --language chinese -o hello_cn.wav

# Spanish
.build/release/audio speak "Hola, buenos días" --engine cosyvoice --language spanish -o hello_es.wav

# French
.build/release/audio speak "Bonjour le monde" --engine cosyvoice --language french -o hello_fr.wav

Voice Cloning

Clone any voice from a short reference audio sample using the --voice-sample flag. CosyVoice3 uses the CAM++ speaker encoder to extract a 192-dim embedding that conditions the DiT flow model.

# Voice cloning
.build/release/audio speak "Hello in your voice" --engine cosyvoice --voice-sample reference.wav -o cloned.wav

# Cross-language: clone voice, speak in German
.build/release/audio speak "Guten Tag" --engine cosyvoice --voice-sample reference.wav --language german -o german.wav

How It Works

  1. CAM++ speaker encoder extracts a 192-dim embedding from the reference audio via CoreML (Neural Engine)
  2. Affine projection (192 → 80) conditions the DiT flow matching decoder on the target voice
  3. HiFi-GAN vocoder converts the speaker-conditioned mel spectrogram to 24kHz audio

Speaker Encoder

PropertyValue
ModelCAM++ (Context-Aware Masking++)
Embedding192 dimensions
BackendCoreML (Neural Engine, FP16)
Size~14 MB
HuggingFaceaufklarer/CamPlusPlus-Speaker-CoreML

The CAM++ model is downloaded automatically on first use of --voice-sample. See the Voice Cloning guide for reference audio tips and the Swift API.

Multi-Speaker Dialogue

Synthesize conversations between multiple speakers using inline speaker tags. Each speaker is assigned a voice from a reference audio file via the --speakers flag.

# Two-speaker dialogue with voice cloning
.build/release/audio speak "[S1] Hello there! [S2] Hey, how are you?" \
    --engine cosyvoice --speakers s1=alice.wav,s2=bob.wav -o dialogue.wav

# Three speakers
.build/release/audio speak "[A] Welcome. [B] Thanks! [C] Glad to be here." \
    --engine cosyvoice --speakers a=host.wav,b=guest1.wav,c=guest2.wav -o panel.wav

Speaker names in tags are case-insensitive and matched to the mapping keys. A configurable silence gap (default 0.2s) is inserted between turns.

OptionDefaultDescription
--speakersSpeaker mapping: s1=file.wav,s2=file.wav
--turn-gap0.2Silence between turns (seconds)
--crossfade0.0Crossfade overlap between turns (seconds)

Emotion & Style Tags

Control the speaking style per segment using inline emotion tags. CosyVoice3 uses the text prefix before the <|endofprompt|> token as a style instruction — emotion tags map to natural language instructions that replace this prefix.

# Emotion tags
.build/release/audio speak "(excited) Wow, amazing! (sad) But I have to go..." \
    --engine cosyvoice -o emotion.wav

# Combined with speakers
.build/release/audio speak "[S1] (happy) Great news! [S2] (surprised) Really?" \
    --engine cosyvoice --speakers s1=alice.wav,s2=bob.wav -o combined.wav

# Freeform instruction as tag
.build/release/audio speak "(Speak like a pirate) Ahoy matey!" \
    --engine cosyvoice -o pirate.wav

# Global instruction (applies to all segments without emotion tags)
.build/release/audio speak "Hello world" \
    --engine cosyvoice --cosy-instruct "Speak cheerfully" -o cheerful.wav

Built-in Emotion Tags

TagInstruction
happy / excitedSpeak happily and with excitement.
sadSpeak sadly with a melancholic tone.
angrySpeak with anger and intensity.
whispers / whisperingSpeak in a soft, gentle whisper.
laughs / laughingSpeak while laughing.
calmSpeak calmly and peacefully.
surprisedSpeak with surprise and amazement.
seriousSpeak in a serious, formal tone.

Unknown tags pass through as freeform instructions, so (Speak in a slow, dramatic voice) works as-is.

Sampling

The LLM stage uses the following sampling configuration:

ParameterValue
Top-k25
Top-p0.8
Repetition Aware SamplingEnabled (window=10, tau_r=0.1)

Repetition Aware Sampling (RAS), from VALL-E 2, penalizes tokens that appeared in the last 10 generated tokens. This prevents repetitive audio artifacts and improves output stability.

Performance

On an M2 Max, CosyVoice3 achieves an RTF of approximately 0.5 — faster than real-time.

StageLatency
LLM (compiled)~13 ms/token
DiT Flow Matching370 - 520 ms
HiFi-GAN50 - 170 ms
Compilation

The LLM stage uses compile(shapeless: true) for the autoregressive loop, which eliminates recompilation overhead across varying sequence lengths. Batch-doubled CFG halves the number of DiT forward passes from 20 to 10.