CosyVoice3

Fun-CosyVoice3-0.5B is a 9-language streaming text-to-speech model. It uses a three-stage pipeline — LLM token generation, DiT flow matching, and HiFi-GAN vocoding — to produce natural 24 kHz speech from text input. The model — also written CosyVoice 3 — is the latest of the FunAudioLLM CosyVoice family.

Supported Languages

Language	Code
Chinese	chinese
English	english
Japanese	japanese
Korean	korean
German	german
Spanish	spanish
French	french
Italian	italian
Russian	russian

Pipeline

CosyVoice3 synthesizes speech in three stages:

LLM — Qwen2.5-0.5B backbone generates FSQ (Finite Scalar Quantization) speech tokens from text
DiT Flow Matching — A 22-layer Diffusion Transformer converts speech tokens into mel spectrograms via Euler ODE integration
HiFi-GAN — Neural Source Filter vocoder converts mel spectrograms into 24 kHz waveforms

Architecture

LLM (Qwen2.5-0.5B)

The language model generates discrete speech tokens autoregressively. The runtime ships in four quantization variants — 4-bit, 8-bit, 8-bit-full (int8 LLM + int8 DiT), and bf16 (unquantized) — picked per call via --cosyvoice-variant.

Parameter	Value
Layers	24
Hidden dimension	896
Query heads	14
Key/Value heads	2 (GQA)
FSQ vocabulary	6561
Quantization	4-bit (default) / 8-bit / bf16

DiT Flow Matching

The Diffusion Transformer refines speech tokens into mel spectrograms using conditional flow matching with classifier-free guidance.

Parameter	Value
Layers	22
Dimension	1024
Attention heads	16
Conditioning	AdaLN (Adaptive Layer Norm)
ODE solver	Euler, 10 steps
CFG rate	0.7

HiFi-GAN Vocoder

A Neural Source Filter (NSF) vocoder that converts mel spectrograms to waveforms.

Parameter	Value
Harmonics	8
Upsample ratio	480x (8 x 5 x 3 x ISTFT 4)
ISTFT	n_fft=16, hop=4
Output sample rate	24 kHz

Model Weights

Variant	LLM	DiT	Size	HuggingFace
`4bit` (default)	int4, group=64	bf16	~1.2 GB	aufklarer/CosyVoice3-0.5B-MLX-4bit
`8bit`	int8, group=64	bf16	~1.4 GB	aufklarer/CosyVoice3-0.5B-MLX-8bit
`8bit-full`	int8, group=64	int8, group=64	~1.6 GB	aufklarer/CosyVoice3-0.5B-MLX-8bit-full
`bf16`	bf16	bf16	~2.1 GB	aufklarer/CosyVoice3-0.5B-MLX-bf16

Every bundle includes the LLM, the DiT flow-matching decoder, the HiFi-GAN vocoder, and the S3-Tokenizer reference encoder needed for zero-shot voice cloning. Pick smaller bundles for smaller download / disk footprint; pick bf16 when LLM/DiT quantisation noise becomes a problem (long-form synthesis, voice cloning fidelity).

CLI Usage

# Default 4-bit bundle
.build/release/speech speak "Hallo Welt" --engine cosyvoice --language german -o output.wav

# Pick a variant via --cosyvoice-variant
.build/release/speech speak "Hallo Welt" --engine cosyvoice --cosyvoice-variant bf16 --language german -o output.wav

Examples

# English
.build/release/speech speak "Hello, how are you?" --engine cosyvoice -o hello_en.wav

# Chinese
.build/release/speech speak "你好世界" --engine cosyvoice --language chinese -o hello_cn.wav

# Spanish
.build/release/speech speak "Hola, buenos días" --engine cosyvoice --language spanish -o hello_es.wav

# French
.build/release/speech speak "Bonjour le monde" --engine cosyvoice --language french -o hello_fr.wav

Voice Cloning

Clone any voice from a short reference audio sample using the --voice-sample flag. CosyVoice3 uses the CAM++ speaker encoder to extract a 192-dim embedding that conditions the DiT flow model.

# Voice cloning
.build/release/speech speak "Hello in your voice" --engine cosyvoice --voice-sample reference.wav -o cloned.wav

# Cross-language: clone voice, speak in German
.build/release/speech speak "Guten Tag" --engine cosyvoice --voice-sample reference.wav --language german -o german.wav

How It Works

CAM++ speaker encoder extracts a 192-dim embedding from the reference audio via CoreML (Neural Engine)
Affine projection (192 → 80) conditions the DiT flow matching decoder on the target voice
HiFi-GAN vocoder converts the speaker-conditioned mel spectrogram to 24kHz audio

Speaker Encoder

Property	Value
Model	CAM++ (Context-Aware Masking++)
Embedding	192 dimensions
Backend	CoreML (Neural Engine, FP16)
Size	~14 MB
HuggingFace	`aufklarer/CamPlusPlus-Speaker-CoreML`

The CAM++ model is downloaded automatically on first use of --voice-sample. See the Voice Cloning guide for reference audio tips and the Swift API.

Multi-Speaker Dialogue

Synthesize conversations between multiple speakers using inline speaker tags. Each speaker is assigned a voice from a reference audio file via the --speakers flag.

# Two-speaker dialogue with voice cloning
.build/release/speech speak "[S1] Hello there! [S2] Hey, how are you?" \
    --engine cosyvoice --speakers s1=alice.wav,s2=bob.wav -o dialogue.wav

# Three speakers
.build/release/speech speak "[A] Welcome. [B] Thanks! [C] Glad to be here." \
    --engine cosyvoice --speakers a=host.wav,b=guest1.wav,c=guest2.wav -o panel.wav

Speaker names in tags are case-insensitive and matched to the mapping keys. A configurable silence gap (default 0.2s) is inserted between turns.

Option	Default	Description
`--speakers`		Speaker mapping: `s1=file.wav,s2=file.wav`
`--turn-gap`	`0.2`	Silence between turns (seconds)
`--crossfade`	`0.0`	Crossfade overlap between turns (seconds)

Emotion & Style Tags

Control the speaking style per segment using inline emotion tags. CosyVoice3 uses the text prefix before the <|endofprompt|> token as a style instruction — emotion tags map to natural language instructions that replace this prefix.

# Emotion tags
.build/release/speech speak "(excited) Wow, amazing! (sad) But I have to go..." \
    --engine cosyvoice -o emotion.wav

# Combined with speakers
.build/release/speech speak "[S1] (happy) Great news! [S2] (surprised) Really?" \
    --engine cosyvoice --speakers s1=alice.wav,s2=bob.wav -o combined.wav

# Freeform instruction as tag
.build/release/speech speak "(Speak like a pirate) Ahoy matey!" \
    --engine cosyvoice -o pirate.wav

# Global instruction (applies to all segments without emotion tags)
.build/release/speech speak "Hello world" \
    --engine cosyvoice --cosy-instruct "Speak cheerfully" -o cheerful.wav

Built-in Emotion Tags

Tag	Instruction
`happy` / `excited`	Speak happily and with excitement.
`sad`	Speak sadly with a melancholic tone.
`angry`	Speak with anger and intensity.
`whispers` / `whispering`	Speak in a soft, gentle whisper.
`laughs` / `laughing`	Speak while laughing.
`calm`	Speak calmly and peacefully.
`surprised`	Speak with surprise and amazement.
`serious`	Speak in a serious, formal tone.

Unknown tags pass through as freeform instructions, so (Speak in a slow, dramatic voice) works as-is.

Model Control Tokens (`fl_` tokens)

Internally, the CosyVoice3 LLM uses special control tokens — prefixed fl_ — to switch between modes (zero-shot cloning, instructed synthesis, saving a speaker, etc.). These tokens are part of the upstream FunAudioLLM tokenizer; the Soniqo runtime emits the correct one automatically based on the CLI flag or Swift API call you use, so you never write them by hand.

Control token	Mode	How to invoke from Soniqo
`<\|fl_speaker_clone\|>`	Zero-shot voice cloning from a reference audio sample	Pass `--voice-sample reference.wav` on the CLI, or set `voiceSample:` on the Swift API.
`<\|fl_speaker_instruct\|>`	Instruction- or style-conditioned synthesis with a default voice	Pass `--cosy-instruct "Speak cheerfully"` or use an inline `(happy)` tag without `--voice-sample`.
`<\|fl_speaker_instruct2\|>`	Instruction synthesis combined with a cloned reference voice	Combine `--voice-sample reference.wav` with `--cosy-instruct "..."` (or an inline emotion tag) in the same call.
`<\|fl_save_speaker\|>`	Persist a speaker's embedding for re-use without re-encoding the reference audio each call	Not directly exposed in the Soniqo CLI — embeddings are computed per call. To cache, extract the 192-dim CAM++ vector yourself via the Speaker Embeddings module and pass it forward.
`<\|fl_speaker_clone_zh\|>`, `<\|fl_speaker_clone_en\|>`, …	Language-specific zero-shot cloning hints used by the upstream tokenizer	Combine `--voice-sample` with `--language german\|spanish\|chinese\|...`. Soniqo selects the correct language hint from the `--language` flag.

If you're porting from FunAudioLLM/CosyVoice

The table above maps each upstream fl_ control token to its Soniqo equivalent. You never need to splice fl_ tokens into your prompt yourself — pass the high-level CLI flags or Swift API arguments and the runtime will emit the correct sequence: clone → instruct → instruct2 → save_speaker.

Sampling

The LLM stage uses the following sampling configuration:

Parameter	Value
Top-k	25
Top-p	0.8
Repetition Aware Sampling	Enabled (window=10, tau_r=0.1)

Repetition Aware Sampling (RAS), from VALL-E 2, penalizes tokens that appeared in the last 10 generated tokens. This prevents repetitive audio artifacts and improves output stability.

Performance

On an M2 Max, CosyVoice3 achieves an RTF of approximately 0.5 — faster than real-time.

Stage	Latency
LLM (compiled)	~13 ms/token
DiT Flow Matching	370 - 520 ms
HiFi-GAN	50 - 170 ms

Compilation

The quantized LLM variants (4-bit / 8-bit / 8-bit-full) use compile(shapeless: true) for the autoregressive loop, which eliminates recompilation overhead across varying sequence lengths. The bf16 variant skips that compile — MLX-Swift's shapeless tracer cannot infer the output shape of the bias-fused matmul that plain Linear uses — and runs the generation loop eagerly. Batch-doubled CFG halves the number of DiT forward passes from 20 to 10 in all variants.