VibeVoice
Microsoft VibeVoice is a long-form, multi-speaker text-to-speech model for English and Chinese. Unlike short-utterance TTS, it's designed to generate podcast-length dialogue, audiobook narration, and multi-speaker scenes in a single pass — up to 90 minutes with up to 4 distinct voices and consistent identity throughout. Two variants ship: Realtime-0.5B for low-latency streaming and 1.5B for long-form flagship quality.
What it is
- Long-form in one pass — up to 90 minutes of audio with consistent voices across the whole output; no per-sentence handoff
- Multi-speaker dialogue — 4 distinct speakers at once, each conditioned by its own voice cache
- English + Chinese — trained audio data is EN/ZH only; other languages are not supported (tokenizer accepts them but output is unintelligible)
- 24 kHz mono output — Float32 PCM, drop-in for
AudioCommon.WAVWriterandStreamingAudioPlayer - MIT license — model weights and our Swift port are both MIT; INT4 quantized derivatives are allowed
Architecture
Four cooperating components produce audio one 7.5 Hz latent at a time:
| Component | Description |
|---|---|
| Split Qwen2 backbone | 24-layer Qwen2.5 decoder (896 hidden, GQA 14/2 for Realtime-0.5B). The model is split: the lower 4 layers form a text LM, the upper 20 layers run as the TTS LM. Text windows (5 tokens at a time) flow through both; generated speech latents flow only through the TTS LM. |
| σ-VAE acoustic tokenizer | Streaming conv stack that encodes 24 kHz audio to a 64-dim latent at 7.5 Hz (3200× temporal downsample) and decodes latents back to waveform. Used for both voice-cache creation and final audio decode. |
| Diffusion head | Small 4-layer DDPM head with adaLN modulation. Samples each speech latent via 20-step DPM-Solver with classifier-free guidance (cfg = 1.3 default for Realtime-0.5B, 1.5 for 1.5B). |
| EOS classifier | Per-step binary classifier on the TTS LM's last hidden state. When sigmoid probability exceeds 0.5, generation stops. |
Voice cloning via voice-cache
Speaker identity does not come from a reference waveform at generation time. Instead, each voice ships as a precomputed .safetensors voice cache containing the conditioning KV caches and hidden states for a specific speaker — produced by running reference audio through the encoder path offline. Loading a voice cache is instantaneous at runtime; one model instance can swap voices cheaply between generations.
Example voice caches (MIT-licensed): mzbac/vibevoice.swift/voice_cache — 7 English voices including Carter, Davis, Emma, Frank, Grace, Mike, and Indian-accented Samuel.
Model
| Bundle | Quantization | Size | HuggingFace |
|---|---|---|---|
| Realtime-0.5B | BF16 (source) | ~1 GB | microsoft/VibeVoice-Realtime-0.5B |
| Realtime-0.5B INT4 | Qwen2 INT4, tokenizer + diffusion FP16 | ~350 MB | aufklarer/VibeVoice-Realtime-0.5B-MLX-INT4 |
| Realtime-0.5B INT8 | Qwen2 INT8 | ~570 MB | aufklarer/VibeVoice-Realtime-0.5B-MLX-INT8 |
| 1.5B long-form | BF16 (source) | ~3 GB | microsoft/VibeVoice-1.5B |
| 1.5B INT4 | Qwen2 INT4 | ~1 GB | aufklarer/VibeVoice-1.5B-MLX-INT4 |
Quantization produced by models/vibevoice/export/convert.py using MLX group-wise affine quant (32-group). Embeddings, norms, acoustic-tokenizer convolutions, and the EOS classifier stay in their source dtype.
Quick start
import VibeVoiceTTS
let tts = try await VibeVoiceTTSModel.fromPretrained()
try tts.loadVoice(from: "/path/to/voice_cache/en-Mike_man.safetensors")
let pcm = try await tts.generate(text: "Hello world.")
// pcm: [Float] at 24 kHz mono
Long-form 1.5B preset
let config = VibeVoiceTTSModel.Configuration.longForm1_5B
let tts = try await VibeVoiceTTSModel.fromPretrained(configuration: config)
try tts.loadVoice(from: "voices/narrator.safetensors")
let pcm = try await tts.generate(text: longTranscript) // up to ~90 min
The longForm1_5B preset bumps maxSpeechTokens to 4000 and cfgScale to 1.5 for higher-fidelity long-form output.
Swap voices between generations
try tts.loadVoice(from: "en-Mike_man.safetensors")
let a = try await tts.generate(text: "First speaker line.")
try tts.loadVoice(from: "en-Emma_woman.safetensors")
let b = try await tts.generate(text: "Second speaker line.")
CLI
audio vibevoice "Hello world." \
--voice-cache voice_cache/en-Mike_man.safetensors \
--output hello.wav
# Long-form 1.5B
audio vibevoice "Long paragraph ..." \
--voice-cache voices/narrator.safetensors \
--long-form \
--max-tokens 4000 \
--output episode.wav
Flags: --steps (DPM-Solver steps), --cfg (guidance), --model / --tokenizer to override HuggingFace IDs, --long-form to switch to the 1.5B preset, --verbose for timing.
Picking among speech-swift TTS modules
| Kokoro-82M | Qwen3-TTS | CosyVoice3 | VibeVoice Realtime | VibeVoice 1.5B | |
|---|---|---|---|---|---|
| Params | 82M | 7B | 7B | 500M | 1.5B |
| Backend | CoreML (ANE) | MLX | MLX | MLX | MLX |
| Languages | 8 | 10+ | 10+ | EN/ZH | EN/ZH |
| Voice cloning | Fixed presets | ICL reference | Zero-shot reference | Voice cache | Voice cache |
| Long-form | Short/medium | Streaming | Streaming | Streaming | Up to 90 min / 4 speakers |
…you need long-form, multi-speaker, or podcast/audiobook output in English or Chinese, with consistent voice identity across minutes of audio. For short-form multilingual TTS, Qwen3-TTS or CosyVoice3 are better fits. For iOS-native short utterances, Kokoro is the smallest option.