Hours of audio.
One consistent voice.
Audiobook chapters, podcast episodes, training narration — rendered on-device on Apple Silicon, Android, or embedded Linux. Automatic segmentation keeps the voice stable across hours; multi-speaker mode handles dialogue between named characters.
Five long-form shapes.
Each engine has a sweet spot. Audiobooks lean on CosyVoice 3 for narrator fidelity. Multi-speaker podcasts lean on VibeVoice for episode-length context. Real-time / streaming uses the smaller VibeVoice Realtime.
Full-chapter passes with one consistent narrator voice. Automatic sentence-level segmentation, no manual stitching.
Inline speaker tags drive turn-taking. Cast two to four voices for an episode-length scripted show.
Generate as the listener listens. VibeVoice Realtime keeps the latency low enough for live conversations.
Newsletter-length articles, blog posts, internal docs — rendered as natural narration without screen-reader pacing.
Long-form content access for users with print or visual impairments, fully offline.
Pick one engine per use case.
All four engines run on Apple Silicon via MLX or CoreML; VibeVoice 1.5B and CosyVoice 3 are the workhorses for anything past five minutes.
| Engine | Max length | Multi-speaker | Best for |
|---|---|---|---|
| CosyVoice 3 segmented | Unlimited (auto sentence-split) | Inline [S1] / [S2] tags | Audiobooks, narrator-led content, 9 languages. |
| VibeVoice 1.5B | 90 minutes | Up to 4 speakers | Episode-length podcasts, multi-voice dialogue. |
| VibeVoice Realtime 0.5B | 90 minutes | Yes | Streaming output for live podcast generation. |
| Kokoro 82M | Short-to-medium | No — single voice per call | Cheap baseline narration, 50 voices, ~45 ms / utterance. |
One narrator, automatic sentence split.
Pipe a chapter text file in — the synthesiser splits on sentence boundaries and stitches the audio with a short silence gap. Voice stays consistent across the entire chapter via the reused VoiceProfile.
speech speak "$(cat chapter1.txt)" \
--engine cosyvoice \
--voice-sample narrator.wav \
--cosy-reference-transcript "..." \
--output chapter1.wavInline speaker tags for dialogue.
Two flavours: CosyVoice 3 for a tag-per-line dialogue, or VibeVoice 1.5B when you want an episode-length conversation with up to four voices.
speech speak "[Host] Welcome back. [Guest] Thanks for having me." \
--engine cosyvoice \
--speakers host=alice.wav,guest=bob.wav \
--output episode.wavspeech vibevoice script.txt \
--output podcast.wav \
--speakers Alice=alice.wav,Bob=bob.wavWhy long output stays coherent.
LLM-based TTS drifts when the autoregressive context grows past its training window. CosyVoice 3 segments the input text on sentence boundaries and re-anchors the voice profile per segment so the model always works inside its reliable range.
- 1Sentence-boundary split. Input is broken on
./!/?, with fall-back clause splits at commas / semicolons for over-long sentences. Short fragments merge forward so segments never drop below ~4 words. - 2Profile reused per segment. The same speaker embedding + prompt FSQ codes + prompt mel feed every segment, so voice timbre stays identical from minute one to hour two.
- 3Stitched with short silence. Default 200 ms gap between segments — sounds like a natural breath, masks any boundary artifact, and matches the cadence of a human read.
- 4Per-segment
maxTokensscales to content length. A short sentence gets a small token budget; a long one gets more. Prevents the LLM from filling time with repetitions when the content runs out.
