Use case · Audiobooks · Podcasts

Hours of audio.
One consistent voice.

Audiobook chapters, podcast episodes, training narration — rendered on-device on Apple Silicon, Android, or embedded Linux. Automatic segmentation keeps the voice stable across hours; multi-speaker mode handles dialogue between named characters.

Get started CosyVoice 3 guide VibeVoice guide

What you can build

Five long-form shapes.

Each engine has a sweet spot. Audiobooks lean on CosyVoice 3 for narrator fidelity. Multi-speaker podcasts lean on VibeVoice for episode-length context. Real-time / streaming uses the smaller VibeVoice Realtime.

Audiobook chapters

Full-chapter passes with one consistent narrator voice. Automatic sentence-level segmentation, no manual stitching.

Multi-speaker podcasts

Inline speaker tags drive turn-taking. Cast two to four voices for an episode-length scripted show.

Live podcast / streaming

Generate as the listener listens. VibeVoice Realtime keeps the latency low enough for live conversations.

Article TTS

Newsletter-length articles, blog posts, internal docs — rendered as natural narration without screen-reader pacing.

Accessibility narration

Long-form content access for users with print or visual impairments, fully offline.

Engines

Pick one engine per use case.

All four engines run on Apple Silicon via MLX or CoreML; VibeVoice 1.5B and CosyVoice 3 are the workhorses for anything past five minutes.

Engine	Max length	Multi-speaker	Best for
CosyVoice 3 segmented	Unlimited (auto sentence-split)	Inline [S1] / [S2] tags	Audiobooks, narrator-led content, 9 languages.
VibeVoice 1.5B	90 minutes	Up to 4 speakers	Episode-length podcasts, multi-voice dialogue.
VibeVoice Realtime 0.5B	90 minutes	Yes	Streaming output for live podcast generation.
Kokoro 82M	Short-to-medium	No — single voice per call	Cheap baseline narration, 50 voices, ~45 ms / utterance.

Quickstart · Audiobook chapter

One narrator, automatic sentence split.

Pipe a chapter text file in — the synthesiser splits on sentence boundaries and stitches the audio with a short silence gap. Voice stays consistent across the entire chapter via the reused VoiceProfile.

speech speak "$(cat chapter1.txt)" \
  --engine cosyvoice \
  --voice-sample narrator.wav \
  --cosy-reference-transcript "..." \
  --output chapter1.wav

Quickstart · Multi-speaker podcast

Inline speaker tags for dialogue.

Two flavours: CosyVoice 3 for a tag-per-line dialogue, or VibeVoice 1.5B when you want an episode-length conversation with up to four voices.

CosyVoice 3 — short dialogue, two speakers:

speech speak "[Host] Welcome back. [Guest] Thanks for having me." \
  --engine cosyvoice \
  --speakers host=alice.wav,guest=bob.wav \
  --output episode.wav

VibeVoice 1.5B — episode-length, multi-voice:

speech vibevoice script.txt \
  --output podcast.wav \
  --speakers Alice=alice.wav,Bob=bob.wav

Under the hood

Why long output stays coherent.

LLM-based TTS drifts when the autoregressive context grows past its training window. CosyVoice 3 segments the input text on sentence boundaries and re-anchors the voice profile per segment so the model always works inside its reliable range.

1
Sentence-boundary split. Input is broken on . / ! / ?, with fall-back clause splits at commas / semicolons for over-long sentences. Short fragments merge forward so segments never drop below ~4 words.
2
Profile reused per segment. The same speaker embedding + prompt FSQ codes + prompt mel feed every segment, so voice timbre stays identical from minute one to hour two.
3
Stitched with short silence. Default 200 ms gap between segments — sounds like a natural breath, masks any boundary artifact, and matches the cadence of a human read.
4
Per-segment maxTokens scales to content length. A short sentence gets a small token budget; a long one gets more. Prevents the LLM from filling time with repetitions when the content runs out.