Use case · Content creation

Clone a voice in 30 seconds.
Synthesise for hours.

Zero-shot voice cloning on Apple Silicon. Provide a 5–30 second reference clip and its transcript; CosyVoice 3 generates speech in that voice across nine languages, fully offline. No fine-tuning, no per-character pricing, no audio ever leaving the device.

What you can build

Five voice-cloning recipes.

Each recipe centres on CosyVoice 3 for the actual synthesis but mixes in different pre/post components — speaker embeddings for matching, denoising for clean reference, Qwen3-TTS ICL when you only have audio.

Audiobook narration

Clone the author or a chosen voice once, render hours of consistent narration.

Dubbing & localisation

Keep a presenter's voice across translated tracks, in nine languages.

Character voices

Two-to-four custom voices per scene via inline speaker tags.

Personal-voice TTS

Restore a familiar voice for users who can no longer speak naturally.

Brand voice

A single consistent narrator across an entire product line.

Deeper reading

Component guides.