Use case · Content creation

Clone a voice in 30 seconds.
Synthesise for hours.

Zero-shot voice cloning on Apple Silicon. Provide a 5–30 second reference clip and its transcript; CosyVoice 3 generates speech in that voice across nine languages, fully offline. No fine-tuning, no per-character pricing, no audio ever leaving the device.

What you can build

Where on-device cloning changes the math.

Cloud TTS bills per character and locks the voice on a server. On-device cloning lets you ship apps and creator tools where the voice belongs to your user and the synthesis cost is zero per use.

AI narrators for video

YouTube voiceovers, course modules, marketing clips — your founder's voice across every episode.

Multilingual creatives

Same voice, nine languages. Cross-lingual cloning keeps the speaker identity even when the words switch.

Personalised audiobooks

Let the listener pick their narrator — a celebrity, a parent, a family voice — from a 30-second sample.

Educational content

Lecturers can record once, then have lessons re-narrated automatically as their text scripts evolve.

Quickstart

One CLI call. One reference clip. One transcript.

For the best clone quality the model needs both the reference's acoustic prefix AND its text transcript. Skipping the transcript falls back to the legacy CAM++-only path with materially lower identity capture.

speech speak "Welcome to the demo." \
  --engine cosyvoice \
  --voice-sample ref.wav \
  --cosy-reference-transcript "Transcript of ref.wav (its text content)..." \
  --output out.wav
From Swift — extract the voice profile once, reuse it across many synthesis calls without re-encoding the reference each time:
import CosyVoiceTTS
import AudioCommon

let model = try await CosyVoiceTTSModel.fromPretrained()

let refAudio = try AudioFileLoader.load(
    url: URL(fileURLWithPath: "ref.wav"), targetSampleRate: 16_000)
let cacheDir = try HuggingFaceDownloader.getCacheDirectory(
    for: "aufklarer/CosyVoice3-0.5B-MLX-4bit")
let tokenizer = try SpeechTokenizerModel.fromSafetensors(
    at: cacheDir.appendingPathComponent("speech_tokenizer.safetensors"))

let profile = try model.extractVoiceProfile(
    audio: refAudio, sampleRate: 16_000,
    speechTokenizer: tokenizer,
    referenceTranscript: "Transcript of the reference clip."
)

let audio = model.synthesize(
    text: "Welcome to the demo.",
    voiceProfile: profile,
    language: "english"
)
Engines

Three cloning paths, one stack.

CosyVoice 3 zero-shot is the recommended default. Qwen3-TTS ICL is the English-first alternative. The legacy CAM++-only path is what older bundles used before zero-shot landed.

EngineHowLanguagesBest for
CosyVoice 3 zero-shotprompt_token + prompt_feat + transcript9 (zh/en/ja/ko/de/es/fr/it/ru)Default. Highest identity capture, multilingual creative work.
Qwen3-TTS ICLIn-context-learning with reference audioEN, ZHEnglish-first projects, when you don't need the other 7 languages.
CosyVoice 3 (legacy)192-d CAM++ speaker embedding only9Bundles without the speech tokenizer. Material drop in identity match.
HuggingFace bundles

Pick a bundle by disk vs perceptual quality.

Both ship the same flow / vocoder / speech tokenizer. They differ only in LLM quantisation. The 8-bit LLM produces materially cleaner audio with fewer mid-segment drifts on long output; the 4-bit LLM is the size-sensitive default.

Quality tips

Five rules from the production tests.

  • 1
    5–30 s of clean speech. No background music, no overlapping voices, no heavy compression. The reference embeds the speaker — anything else in the clip leaks into the clone.
  • 2
    Always pass the transcript. --cosy-reference-transcript gives the LLM the linguistic context for the prompt audio. Skipping it costs accuracy and produces mid-utterance drifts.
  • 3
    Long-form text is auto-segmented. The synthesiser splits on sentence boundaries and reuses the same voice profile across segments — voice stays consistent across chapters or full podcasts.
  • 4
    The 8-bit bundle sounds cleaner. Same architecture, lower quantisation noise in the LLM logits — picks more text-aligned speech tokens, drifts less on long output.
  • 5
    Reuse voiceProfile across calls. extractVoiceProfile(…) runs the S3 tokenizer + mel extractor + CAM++ once per reference. Reuse the resulting struct across every synthesis call.
Deeper reading

Component guides.