Any voice.
Any length.
Three shapes of speech generation — clone a voice in seconds from a short reference clip, render high-quality neutral TTS at faster-than-real-time, or produce hour-long audiobooks and multi-speaker podcasts. All on-device.
Three flavours of synthesis.
Zero-shot cloning for personalised voices, fast neutral TTS for app UI, or long-form for narration and dialogue. Different engines, same on-device stack.
Clone a voice from a 5–30 s reference clip. Zero-shot, no fine-tuning, across nine languages.
High-quality neutral speech, faster than real-time. Compact bundles for app UI, accessibility, in-app narration. Kokoro is the recommended default; reach for Qwen3-TTS when you need English/Mandarin in-context cloning from a short audio clip.
Studio-quality output at the only on-device 48 kHz sample rate in the stack, plus instruction-driven voice design ("a young woman, warm and gentle"). 2B-param diffusion-AR model with 30-language coverage. Three quantised bundles (bf16/int8/int4).
Audiobook chapters with a consistent narrator, or multi-speaker podcasts up to 90 min with inline speaker tags.
One brew install. Speech in any voice.
The CLI ships with every TTS engine pre-wired. Kokoro is the default for general-purpose synthesis; switch the --engine flag to swap CosyVoice 3 (cloning) or VibeVoice (long-form, multi-speaker).
brew install soniqo/tap/speech
# Standard TTS — pick a Kokoro voice and synthesize
speech speak "Hello from on-device synthesis." \
--engine kokoro --voice af_alloy --output hello.wav
# Voice cloning from a 5–30 s reference clip + its transcript
speech speak "Welcome to my podcast." \
--engine cosyvoice \
--voice-sample reference.wav \
--cosy-reference-transcript "Reference clip transcript here." \
--output cloned.wav
# Long-form, multi-speaker — inline speaker tags
speech speak --engine vibevoice --output episode.wav <<'EOF'
[1] Welcome to the show. Today we're talking about on-device speech.
[2] Thanks for having me.
[1] Let's start with the basics.
EOFNumbers from M2 Max.
All four engines run faster than real-time on Apple Silicon; Kokoro and VibeVoice Realtime also fit on iOS. Pick by audio quality vs. bundle size — there's no latency dimension to optimise.
