Use case · Content creation

Any voice.
Any length.

Three shapes of speech generation — clone a voice in seconds from a short reference clip, render high-quality neutral TTS at faster-than-real-time, or produce hour-long audiobooks and multi-speaker podcasts. All on-device.

Three sub-use-cases

Three flavours of synthesis.

Zero-shot cloning for personalised voices, fast neutral TTS for app UI, or long-form for narration and dialogue. Different engines, same on-device stack.

Quickstart · Standard TTS

One brew install. Speech in any voice.

The CLI ships with every TTS engine pre-wired. Kokoro is the default for general-purpose synthesis; switch the --engine flag to swap CosyVoice 3 (cloning) or VibeVoice (long-form, multi-speaker).

brew install soniqo/tap/speech

# Standard TTS — pick a Kokoro voice and synthesize
speech speak "Hello from on-device synthesis." \
  --engine kokoro --voice af_alloy --output hello.wav

# Voice cloning from a 5–30 s reference clip + its transcript
speech speak "Welcome to my podcast." \
  --engine cosyvoice \
  --voice-sample reference.wav \
  --cosy-reference-transcript "Reference clip transcript here." \
  --output cloned.wav

# Long-form, multi-speaker — inline speaker tags
speech speak --engine vibevoice --output episode.wav <<'EOF'
[1] Welcome to the show. Today we're talking about on-device speech.
[2] Thanks for having me.
[1] Let's start with the basics.
EOF
On-device performance

Numbers from M2 Max.

All four engines run faster than real-time on Apple Silicon; Kokoro and VibeVoice Realtime also fit on iOS. Pick by audio quality vs. bundle size — there's no latency dimension to optimise.

Kokoro 82M
~45 ms
50 voices · CoreML / ONNX · iOS-friendly
Qwen3-TTS
37 ms/step
12 Hz codec LM · faster than real-time · EN/ZH ICL
CosyVoice 3 (8-bit)
RTF ~0.45
9 languages · zero-shot cloning · MLX
VoxCPM2 (int8)
RTF ~1.0
48 kHz · 30 languages · voice design + cloning · MLX
VibeVoice 1.5B
RTFx 1.48
M2 Max INT4 · up to 90 min · 4 voices
VibeVoice Realtime
~350 MB INT4
EN-only streaming · pre-baked voice caches
speech-server
POST /speak
Local REST · any TTS engine selectable per request
Deeper reading

Component guides.