Use case · Content creation

Any voice.
Any length.

Three shapes of speech generation — clone a voice in seconds from a short reference clip, render high-quality neutral TTS at faster-than-real-time, or produce hour-long audiobooks and multi-speaker podcasts. All on-device.

Get started Voice cloning guide

Three sub-use-cases

Three flavours of synthesis.

Zero-shot cloning for personalised voices, fast neutral TTS for app UI, or long-form for narration and dialogue. Different engines, same on-device stack.

Voice cloning

Clone a voice from a short reference clip. Use IndexTTS2 for native MLX cloning with emotion and tempo controls, CosyVoice 3 for transcript-conditioned multilingual zero-shot, Chatterbox Flash for CoreML, or Qwen3-TTS ICL for EN/ZH.

Standard TTS

High-quality neutral speech, faster than real-time. Compact bundles for app UI, accessibility, in-app narration.

Long-form & multi-speaker

Audiobook chapters with a consistent narrator, or multi-speaker podcasts up to 90 min with inline speaker tags.

Deeper reading