Clone a voice in 30 seconds.
Synthesise for hours.
Zero-shot voice cloning on Apple Silicon. Provide a 5–30 second reference clip and its transcript; CosyVoice 3 generates speech in that voice across nine languages, fully offline. No fine-tuning, no per-character pricing, no audio ever leaving the device.
Where on-device cloning changes the math.
Cloud TTS bills per character and locks the voice on a server. On-device cloning lets you ship apps and creator tools where the voice belongs to your user and the synthesis cost is zero per use.
YouTube voiceovers, course modules, marketing clips — your founder's voice across every episode.
Same voice, nine languages. Cross-lingual cloning keeps the speaker identity even when the words switch.
Let the listener pick their narrator — a celebrity, a parent, a family voice — from a 30-second sample.
Lecturers can record once, then have lessons re-narrated automatically as their text scripts evolve.
One CLI call. One reference clip. One transcript.
For the best clone quality the model needs both the reference's acoustic prefix AND its text transcript. Skipping the transcript falls back to the legacy CAM++-only path with materially lower identity capture.
speech speak "Welcome to the demo." \
--engine cosyvoice \
--voice-sample ref.wav \
--cosy-reference-transcript "Transcript of ref.wav (its text content)..." \
--output out.wavimport CosyVoiceTTS
import AudioCommon
let model = try await CosyVoiceTTSModel.fromPretrained()
let refAudio = try AudioFileLoader.load(
url: URL(fileURLWithPath: "ref.wav"), targetSampleRate: 16_000)
let cacheDir = try HuggingFaceDownloader.getCacheDirectory(
for: "aufklarer/CosyVoice3-0.5B-MLX-4bit")
let tokenizer = try SpeechTokenizerModel.fromSafetensors(
at: cacheDir.appendingPathComponent("speech_tokenizer.safetensors"))
let profile = try model.extractVoiceProfile(
audio: refAudio, sampleRate: 16_000,
speechTokenizer: tokenizer,
referenceTranscript: "Transcript of the reference clip."
)
let audio = model.synthesize(
text: "Welcome to the demo.",
voiceProfile: profile,
language: "english"
)Three cloning paths, one stack.
CosyVoice 3 zero-shot is the recommended default. Qwen3-TTS ICL is the English-first alternative. The legacy CAM++-only path is what older bundles used before zero-shot landed.
| Engine | How | Languages | Best for |
|---|---|---|---|
| CosyVoice 3 zero-shot | prompt_token + prompt_feat + transcript | 9 (zh/en/ja/ko/de/es/fr/it/ru) | Default. Highest identity capture, multilingual creative work. |
| Qwen3-TTS ICL | In-context-learning with reference audio | EN, ZH | English-first projects, when you don't need the other 7 languages. |
| CosyVoice 3 (legacy) | 192-d CAM++ speaker embedding only | 9 | Bundles without the speech tokenizer. Material drop in identity match. |
Pick a bundle by disk vs perceptual quality.
Both ship the same flow / vocoder / speech tokenizer. They differ only in LLM quantisation. The 8-bit LLM produces materially cleaner audio with fewer mid-segment drifts on long output; the 4-bit LLM is the size-sensitive default.
Five rules from the production tests.
- 15–30 s of clean speech. No background music, no overlapping voices, no heavy compression. The reference embeds the speaker — anything else in the clip leaks into the clone.
- 2Always pass the transcript.
--cosy-reference-transcriptgives the LLM the linguistic context for the prompt audio. Skipping it costs accuracy and produces mid-utterance drifts. - 3Long-form text is auto-segmented. The synthesiser splits on sentence boundaries and reuses the same voice profile across segments — voice stays consistent across chapters or full podcasts.
- 4The 8-bit bundle sounds cleaner. Same architecture, lower quantisation noise in the LLM logits — picks more text-aligned speech tokens, drifts less on long output.
- 5Reuse
voiceProfileacross calls.extractVoiceProfile(…)runs the S3 tokenizer + mel extractor + CAM++ once per reference. Reuse the resulting struct across every synthesis call.
