Use case · Content creation
Any voice.
Any length.
Three shapes of speech generation — clone a voice in seconds from a short reference clip, render high-quality neutral TTS at faster-than-real-time, or produce hour-long audiobooks and multi-speaker podcasts. All on-device.
Three sub-use-cases
Three flavours of synthesis.
Zero-shot cloning for personalised voices, fast neutral TTS for app UI, or long-form for narration and dialogue. Different engines, same on-device stack.
Voice cloning
Clone a voice from a 5–30 s reference clip. Zero-shot, no fine-tuning, across nine languages.
Standard TTS
High-quality neutral speech, faster than real-time. Compact bundles for app UI, accessibility, in-app narration.
Long-form & multi-speaker
Audiobook chapters with a consistent narrator, or multi-speaker podcasts up to 90 min with inline speaker tags.
Deeper reading
