Use case · Content creation

Clone a voice in 30 seconds.
Synthesise for hours.

Zero-shot voice cloning on Apple Silicon. Provide a 5–30 second reference clip and its transcript; CosyVoice 3 generates speech in that voice across nine languages, fully offline. No fine-tuning, no per-character pricing, no audio ever leaving the device.

Get started CosyVoice 3 guide HF · 8-bit bundle

What you can build

Where on-device cloning changes the math.

Cloud TTS bills per character and locks the voice on a server. On-device cloning lets you ship apps and creator tools where the voice belongs to your user and the synthesis cost is zero per use.

AI narrators for video

YouTube voiceovers, course modules, marketing clips — your founder's voice across every episode.

Multilingual creatives

Same voice, nine languages. Cross-lingual cloning keeps the speaker identity even when the words switch.

Personalised audiobooks

Let the listener pick their narrator — a celebrity, a parent, a family voice — from a 30-second sample.

Educational content

Lecturers can record once, then have lessons re-narrated automatically as their text scripts evolve.

Quickstart

One CLI call. One reference clip. One transcript.

For the best clone quality the model needs both the reference's acoustic prefix AND its text transcript. Skipping the transcript falls back to the legacy CAM++-only path with materially lower identity capture.

speech speak "Welcome to the demo." \
  --engine cosyvoice \
  --voice-sample ref.wav \
  --cosy-reference-transcript "Transcript of ref.wav (its text content)..." \
  --output out.wav

From Swift — extract the voice profile once, reuse it across many synthesis calls without re-encoding the reference each time:

import CosyVoiceTTS
import AudioCommon

let model = try await CosyVoiceTTSModel.fromPretrained()

let refAudio = try AudioFileLoader.load(
    url: URL(fileURLWithPath: "ref.wav"), targetSampleRate: 16_000)
let cacheDir = try HuggingFaceDownloader.getCacheDirectory(
    for: "aufklarer/CosyVoice3-0.5B-MLX-4bit")
let tokenizer = try SpeechTokenizerModel.fromSafetensors(
    at: cacheDir.appendingPathComponent("speech_tokenizer.safetensors"))

let profile = try model.extractVoiceProfile(
    audio: refAudio, sampleRate: 16_000,
    speechTokenizer: tokenizer,
    referenceTranscript: "Transcript of the reference clip."
)

let audio = model.synthesize(
    text: "Welcome to the demo.",
    voiceProfile: profile,
    language: "english"
)

Engines

Three cloning paths, one stack.

CosyVoice 3 zero-shot is the recommended default. Qwen3-TTS ICL is the English-first alternative. The legacy CAM++-only path is what older bundles used before zero-shot landed.

Engine	How	Languages	Best for
CosyVoice 3 zero-shot	prompt_token + prompt_feat + transcript	9 (zh/en/ja/ko/de/es/fr/it/ru)	Default. Highest identity capture, multilingual creative work.
Qwen3-TTS ICL	In-context-learning with reference audio	EN, ZH	English-first projects, when you don't need the other 7 languages.
CosyVoice 3 (legacy)	192-d CAM++ speaker embedding only	9	Bundles without the speech tokenizer. Material drop in identity match.

HuggingFace bundles

Pick a bundle by disk vs perceptual quality.

Both ship the same flow / vocoder / speech tokenizer. They differ only in LLM quantisation. The 8-bit LLM produces materially cleaner audio with fewer mid-segment drifts on long output; the 4-bit LLM is the size-sensitive default.

Default

~1.1 GB

CosyVoice3-0.5B-MLX-4bit

LLM int4 · DiT int4 · HiFi-GAN fp32 · S3-Tokenizer-v3 bf16. Best for size-sensitive deployments.

View on HuggingFace

Higher quality

~1.4 GB

CosyVoice3-0.5B-MLX-8bit

LLM int8 · DiT int4 · HiFi-GAN fp32 · S3-Tokenizer-v3 bf16. Cleaner logit distribution → less drift on long-form output.

View on HuggingFace

Quality tips

Five rules from the production tests.

1
5–30 s of clean speech. No background music, no overlapping voices, no heavy compression. The reference embeds the speaker — anything else in the clip leaks into the clone.
2
Always pass the transcript. --cosy-reference-transcript gives the LLM the linguistic context for the prompt audio. Skipping it costs accuracy and produces mid-utterance drifts.
3
Long-form text is auto-segmented. The synthesiser splits on sentence boundaries and reuses the same voice profile across segments — voice stays consistent across chapters or full podcasts.
4
The 8-bit bundle sounds cleaner. Same architecture, lower quantisation noise in the LLM logits — picks more text-aligned speech tokens, drifts less on long output.
5
Reuse voiceProfile across calls. extractVoiceProfile(…) runs the S3 tokenizer + mel extractor + CAM++ once per reference. Reuse the resulting struct across every synthesis call.

Deeper reading

Component guides.

CosyVoice 3

Architecture, CLI, Swift API, cloning internals.

Voice Cloning suite

CosyVoice + Qwen3-TTS ICL + CAM++ side by side.

Qwen3-TTS

12 Hz codec LM, EN / ZH ICL cloning.

Speaker embeddings

WeSpeaker / CAM++ — the encoders behind the clones.

Clone a voice in 30 seconds.Synthesise for hours.