Use case · Content creation

Any voice.
Any length.

Three shapes of speech generation — clone a voice in seconds from a short reference clip, render high-quality neutral TTS at faster-than-real-time, or produce hour-long audiobooks and multi-speaker podcasts. All on-device.

Get started Voice cloning guide

Three sub-use-cases

Three flavours of synthesis.

Zero-shot cloning for personalised voices, fast neutral TTS for app UI, or long-form for narration and dialogue. Different engines, same on-device stack.

Voice cloning

Clone a voice from a 5–30 s reference clip. Zero-shot, no fine-tuning, across nine languages.

bf16 / 8-bit MLX bundles · 9 languages · zero-shot from a 5–30 s clip

CosyVoice 3 zero-shot

High-quality neutral speech, faster than real-time. Compact bundles for app UI, accessibility, in-app narration. Kokoro is the recommended default; reach for Qwen3-TTS when you need English/Mandarin in-context cloning from a short audio clip.

~80 MB · 50 voices · ~45 ms inference on Neural Engine

Studio-quality output at the only on-device 48 kHz sample rate in the stack, plus instruction-driven voice design ("a young woman, warm and gentle"). 2B-param diffusion-AR model with 30-language coverage. Production bundles are bf16 and int8.

48 kHz · 30 languages · voice design + cloning · int8 bundle ~3 GB

VoxCPM2

Learn more

Long-form & multi-speaker

Audiobook chapters with a consistent narrator, or multi-speaker podcasts up to 90 min with inline speaker tags.

Up to 90 min · 4 voices in one pass · RTFx 1.48 on M2 Max INT4

VibeVoice 1.5B

CosyVoice 3 segmented

Learn more

Quickstart · Standard TTS

One brew install. Speech in any voice.

The CLI ships with every TTS engine pre-wired. Kokoro is the default for general-purpose synthesis; switch the --engine flag to swap CosyVoice 3 (cloning) or VibeVoice (long-form, multi-speaker).

brew install speech

# Standard TTS — pick a Kokoro voice and synthesize
speech speak "Hello from on-device synthesis." \
  --engine kokoro --voice af_alloy --output hello.wav

# Voice cloning from a 5–30 s reference clip + its transcript
speech speak "Welcome to my podcast." \
  --engine cosyvoice \
  --voice-sample reference.wav \
  --cosy-reference-transcript "Reference clip transcript here." \
  --output cloned.wav

# Long-form, multi-speaker — inline speaker tags
speech speak --engine vibevoice --output episode.wav <<'EOF'
[1] Welcome to the show. Today we're talking about on-device speech.
[2] Thanks for having me.
[1] Let's start with the basics.
EOF

On-device performance

Numbers from M2 Max.

All four engines run faster than real-time on Apple Silicon; Kokoro and VibeVoice Realtime also fit on iOS. Pick by audio quality vs. bundle size — there's no latency dimension to optimise.

Kokoro 82M

~45 ms

50 voices · CoreML / ONNX · iOS-friendly

Qwen3-TTS

37 ms/step

12 Hz codec LM · faster than real-time · EN/ZH ICL

CosyVoice 3 (8-bit)

RTF ~0.45

9 languages · zero-shot cloning · MLX

VoxCPM2 (int8)

RTF ~1.0

48 kHz · 30 languages · voice design + cloning · MLX