CosyVoice3

Fun-CosyVoice3-0.5B is a 9-language streaming text-to-speech model. It uses a three-stage pipeline — LLM token generation, DiT flow matching, and HiFi-GAN vocoding — to produce natural 24 kHz speech from text input. The model — also written CosyVoice 3 — is the latest of the FunAudioLLM CosyVoice family.

Supported Languages

LanguageCode
Chinesechinese
Englishenglish
Japanesejapanese
Koreankorean
Germangerman
Spanishspanish
Frenchfrench
Italianitalian
Russianrussian

Pipeline

CosyVoice3 synthesizes speech in three stages:

  1. LLM — Qwen2.5-0.5B backbone generates FSQ (Finite Scalar Quantization) speech tokens from text
  2. DiT Flow Matching — A 22-layer Diffusion Transformer converts speech tokens into mel spectrograms via Euler ODE integration
  3. HiFi-GAN — Neural Source Filter vocoder converts mel spectrograms into 24 kHz waveforms

Architecture

LLM (Qwen2.5-0.5B)

The language model generates discrete speech tokens autoregressively. The runtime ships in four quantization variants — 4-bit, 8-bit, 8-bit-full (int8 LLM + int8 DiT), and bf16 (unquantized) — picked per call via --cosyvoice-variant.

ParameterValue
Layers24
Hidden dimension896
Query heads14
Key/Value heads2 (GQA)
FSQ vocabulary6561
Quantization4-bit (default) / 8-bit / bf16

DiT Flow Matching

The Diffusion Transformer refines speech tokens into mel spectrograms using conditional flow matching with classifier-free guidance.

ParameterValue
Layers22
Dimension1024
Attention heads16
ConditioningAdaLN (Adaptive Layer Norm)
ODE solverEuler, 10 steps
CFG rate0.7

HiFi-GAN Vocoder

A Neural Source Filter (NSF) vocoder that converts mel spectrograms to waveforms.

ParameterValue
Harmonics8
Upsample ratio480x (8 x 5 x 3 x ISTFT 4)
ISTFTn_fft=16, hop=4
Output sample rate24 kHz

Model Weights

VariantLLMDiTSizeHuggingFace
4bit (default)int4, group=64bf16~1.2 GBaufklarer/CosyVoice3-0.5B-MLX-4bit
8bitint8, group=64bf16~1.4 GBaufklarer/CosyVoice3-0.5B-MLX-8bit
8bit-fullint8, group=64int8, group=64~1.6 GBaufklarer/CosyVoice3-0.5B-MLX-8bit-full
bf16bf16bf16~2.1 GBaufklarer/CosyVoice3-0.5B-MLX-bf16

Every bundle includes the LLM, the DiT flow-matching decoder, the HiFi-GAN vocoder, and the S3-Tokenizer reference encoder needed for zero-shot voice cloning. Pick smaller bundles for smaller download / disk footprint; pick bf16 when LLM/DiT quantisation noise becomes a problem (long-form synthesis, voice cloning fidelity).

CLI Usage

# Default 4-bit bundle
.build/release/speech speak "Hallo Welt" --engine cosyvoice --language german -o output.wav

# Pick a variant via --cosyvoice-variant
.build/release/speech speak "Hallo Welt" --engine cosyvoice --cosyvoice-variant bf16 --language german -o output.wav

Examples

# English
.build/release/speech speak "Hello, how are you?" --engine cosyvoice -o hello_en.wav

# Chinese
.build/release/speech speak "你好世界" --engine cosyvoice --language chinese -o hello_cn.wav

# Spanish
.build/release/speech speak "Hola, buenos días" --engine cosyvoice --language spanish -o hello_es.wav

# French
.build/release/speech speak "Bonjour le monde" --engine cosyvoice --language french -o hello_fr.wav

Voice Cloning

Clone any voice from a short reference audio sample using the --voice-sample flag. CosyVoice3 uses the CAM++ speaker encoder to extract a 192-dim embedding that conditions the DiT flow model.

# Voice cloning
.build/release/speech speak "Hello in your voice" --engine cosyvoice --voice-sample reference.wav -o cloned.wav

# Cross-language: clone voice, speak in German
.build/release/speech speak "Guten Tag" --engine cosyvoice --voice-sample reference.wav --language german -o german.wav

How It Works

  1. CAM++ speaker encoder extracts a 192-dim embedding from the reference audio via CoreML (Neural Engine)
  2. Affine projection (192 → 80) conditions the DiT flow matching decoder on the target voice
  3. HiFi-GAN vocoder converts the speaker-conditioned mel spectrogram to 24kHz audio

Speaker Encoder

PropertyValue
ModelCAM++ (Context-Aware Masking++)
Embedding192 dimensions
BackendCoreML (Neural Engine, FP16)
Size~14 MB
HuggingFaceaufklarer/CamPlusPlus-Speaker-CoreML

The CAM++ model is downloaded automatically on first use of --voice-sample. See the Voice Cloning guide for reference audio tips and the Swift API.

Multi-Speaker Dialogue

Synthesize conversations between multiple speakers using inline speaker tags. Each speaker is assigned a voice from a reference audio file via the --speakers flag.

# Two-speaker dialogue with voice cloning
.build/release/speech speak "[S1] Hello there! [S2] Hey, how are you?" \
    --engine cosyvoice --speakers s1=alice.wav,s2=bob.wav -o dialogue.wav

# Three speakers
.build/release/speech speak "[A] Welcome. [B] Thanks! [C] Glad to be here." \
    --engine cosyvoice --speakers a=host.wav,b=guest1.wav,c=guest2.wav -o panel.wav

Speaker names in tags are case-insensitive and matched to the mapping keys. A configurable silence gap (default 0.2s) is inserted between turns.

OptionDefaultDescription
--speakersSpeaker mapping: s1=file.wav,s2=file.wav
--turn-gap0.2Silence between turns (seconds)
--crossfade0.0Crossfade overlap between turns (seconds)

Emotion & Style Tags

Control the speaking style per segment using inline emotion tags. CosyVoice3 uses the text prefix before the <|endofprompt|> token as a style instruction — emotion tags map to natural language instructions that replace this prefix.

# Emotion tags
.build/release/speech speak "(excited) Wow, amazing! (sad) But I have to go..." \
    --engine cosyvoice -o emotion.wav

# Combined with speakers
.build/release/speech speak "[S1] (happy) Great news! [S2] (surprised) Really?" \
    --engine cosyvoice --speakers s1=alice.wav,s2=bob.wav -o combined.wav

# Freeform instruction as tag
.build/release/speech speak "(Speak like a pirate) Ahoy matey!" \
    --engine cosyvoice -o pirate.wav

# Global instruction (applies to all segments without emotion tags)
.build/release/speech speak "Hello world" \
    --engine cosyvoice --cosy-instruct "Speak cheerfully" -o cheerful.wav

Built-in Emotion Tags

TagInstruction
happy / excitedSpeak happily and with excitement.
sadSpeak sadly with a melancholic tone.
angrySpeak with anger and intensity.
whispers / whisperingSpeak in a soft, gentle whisper.
laughs / laughingSpeak while laughing.
calmSpeak calmly and peacefully.
surprisedSpeak with surprise and amazement.
seriousSpeak in a serious, formal tone.

Unknown tags pass through as freeform instructions, so (Speak in a slow, dramatic voice) works as-is.

Model Control Tokens (fl_ tokens)

Internally, the CosyVoice3 LLM uses special control tokens — prefixed fl_ — to switch between modes (zero-shot cloning, instructed synthesis, saving a speaker, etc.). These tokens are part of the upstream FunAudioLLM tokenizer; the Soniqo runtime emits the correct one automatically based on the CLI flag or Swift API call you use, so you never write them by hand.

Control tokenModeHow to invoke from Soniqo
<|fl_speaker_clone|>Zero-shot voice cloning from a reference audio samplePass --voice-sample reference.wav on the CLI, or set voiceSample: on the Swift API.
<|fl_speaker_instruct|>Instruction- or style-conditioned synthesis with a default voicePass --cosy-instruct "Speak cheerfully" or use an inline (happy) tag without --voice-sample.
<|fl_speaker_instruct2|>Instruction synthesis combined with a cloned reference voiceCombine --voice-sample reference.wav with --cosy-instruct "..." (or an inline emotion tag) in the same call.
<|fl_save_speaker|>Persist a speaker's embedding for re-use without re-encoding the reference audio each callNot directly exposed in the Soniqo CLI — embeddings are computed per call. To cache, extract the 192-dim CAM++ vector yourself via the Speaker Embeddings module and pass it forward.
<|fl_speaker_clone_zh|>, <|fl_speaker_clone_en|>, …Language-specific zero-shot cloning hints used by the upstream tokenizerCombine --voice-sample with --language german|spanish|chinese|.... Soniqo selects the correct language hint from the --language flag.
If you're porting from FunAudioLLM/CosyVoice

The table above maps each upstream fl_ control token to its Soniqo equivalent. You never need to splice fl_ tokens into your prompt yourself — pass the high-level CLI flags or Swift API arguments and the runtime will emit the correct sequence: clone → instruct → instruct2 → save_speaker.

Sampling

The LLM stage uses the following sampling configuration:

ParameterValue
Top-k25
Top-p0.8
Repetition Aware SamplingEnabled (window=10, tau_r=0.1)

Repetition Aware Sampling (RAS), from VALL-E 2, penalizes tokens that appeared in the last 10 generated tokens. This prevents repetitive audio artifacts and improves output stability.

Performance

On an M2 Max, CosyVoice3 achieves an RTF of approximately 0.5 — faster than real-time.

StageLatency
LLM (compiled)~13 ms/token
DiT Flow Matching370 - 520 ms
HiFi-GAN50 - 170 ms
Compilation

The quantized LLM variants (4-bit / 8-bit / 8-bit-full) use compile(shapeless: true) for the autoregressive loop, which eliminates recompilation overhead across varying sequence lengths. The bf16 variant skips that compile — MLX-Swift's shapeless tracer cannot infer the output shape of the bias-fused matmul that plain Linear uses — and runs the generation loop eagerly. Batch-doubled CFG halves the number of DiT forward passes from 20 to 10 in all variants.