VibeVoice
Microsoft VibeVoice is a long-form, multi-speaker text-to-speech model for English and Chinese. Unlike short-utterance TTS, it's designed to generate podcast-length dialogue, audiobook narration, and multi-speaker scenes in a single pass — up to 90 minutes with up to 4 distinct voices and consistent identity throughout. Two variants ship: Realtime-0.5B for low-latency streaming and 1.5B for long-form flagship quality.
What it is
- Long-form in one pass — up to 90 minutes of audio with consistent voices across the whole output; no per-sentence handoff
- Multi-speaker dialogue — 4 distinct speakers at once, each conditioned by its own voice cache
- English + Chinese — trained audio data is EN/ZH only; other languages are not supported (tokenizer accepts them but output is unintelligible)
- 24 kHz mono output — Float32 PCM, drop-in for
AudioCommon.WAVWriterandStreamingAudioPlayer - MIT license — model weights and our Swift port are both MIT; INT4 quantized derivatives are allowed
Architecture
Four cooperating components produce audio one 7.5 Hz latent at a time:
| Component | Description |
|---|---|
| Split Qwen2 backbone | 24-layer Qwen2.5 decoder (896 hidden, GQA 14/2 for Realtime-0.5B). The model is split: the lower 4 layers form a text LM, the upper 20 layers run as the TTS LM. Text windows (5 tokens at a time) flow through both; generated speech latents flow only through the TTS LM. |
| σ-VAE acoustic tokenizer | Streaming conv stack that encodes 24 kHz audio to a 64-dim latent at 7.5 Hz (3200× temporal downsample) and decodes latents back to waveform. Used for both voice-cache creation and final audio decode. |
| Diffusion head | Small 4-layer DDPM head with adaLN modulation. Samples each speech latent via 20-step DPM-Solver with classifier-free guidance (cfg = 1.3 default for Realtime-0.5B, 1.5 for 1.5B). |
| EOS classifier | Per-step binary classifier on the TTS LM's last hidden state. When sigmoid probability exceeds 0.5, generation stops. |
Languages
Per Microsoft's model cards: Realtime-0.5B is English only (the upstream demo ships nine non-EN voice prompts as exploratory; quality is not guaranteed). 1.5B supports English and Chinese; other languages may produce plausible-sounding but unfaithful audio and should be considered experimental.
Voice identity — two distinct paths
The two variants take very different approaches to speaker conditioning, and each path has constraints worth knowing up-front.
Realtime-0.5B — pre-built voice caches
Speaker identity comes from a precomputed .safetensors voice cache containing the conditioning KV caches and hidden states for a specific speaker. Loading a cache is instantaneous; one model instance can swap voices cheaply between generations.
The Realtime-0.5B checkpoint is distributed inference-only — Microsoft does not ship the acoustic encoder, so voice caches cannot be minted from raw audio against this model. The supported source is one of Microsoft's pre-built .pt voice caches (Carter, Davis, Emma, Frank, Grace, Mike, Samuel for English, plus exploratory de/fr/it/jp/kr/nl/pl/pt voices), flattened into the .safetensors layout this loader expects.
1.5B long-form — voice cloning from raw audio
The 1.5B checkpoint does ship the acoustic encoder, so it can clone an arbitrary speaker from a reference waveform + transcript in one shot. The encoding is inlined on every synthesis call — there is no separate voice-cache file to manage.
speech vibevoice-encode-voice is gatedThe CLI surface for offline voice-cache creation against Realtime-0.5B will fail fast with a pointer to the 1.5B raw-audio workflow, since the encoder weights aren't present in the 0.5B checkpoint. Until Microsoft ships them, this is the only end-to-end path for cloning a custom speaker.
Model
| Bundle | Quantization | Size | HuggingFace |
|---|---|---|---|
| Realtime-0.5B | BF16 (source) | ~1 GB | microsoft/VibeVoice-Realtime-0.5B |
| Realtime-0.5B INT4 | Qwen2 INT4, tokenizer + diffusion FP16 | ~350 MB | aufklarer/VibeVoice-Realtime-0.5B-MLX-INT4 |
| Realtime-0.5B INT8 | Qwen2 INT8 | ~570 MB | aufklarer/VibeVoice-Realtime-0.5B-MLX-INT8 |
| 1.5B long-form | BF16 (source) | ~3 GB | microsoft/VibeVoice-1.5B |
| 1.5B INT4 (production) | Qwen2 INT4 + dual encoders | ~1 GB | aufklarer/VibeVoice-1.5B-MLX-INT4 |
Quantization uses MLX group-wise affine quant (32-group). Embeddings, norms, acoustic-tokenizer convolutions, and the EOS classifier stay in their source dtype.
Quick start
import VibeVoiceTTS
let tts = try await VibeVoiceTTSModel.fromPretrained()
try tts.loadVoice(from: "/path/to/voice_cache/en-Mike_man.safetensors")
let pcm = try await tts.generate(text: "Hello world.")
// pcm: [Float] at 24 kHz mono
Long-form 1.5B (different API)
1.5B has a different architecture (unified Qwen2 LM, dual encoders, LM token sampling) so it ships as a separate class — VibeVoice15BTTSModel. Reference audio + text go in a single call:
let tts = try await VibeVoice15BTTSModel.fromPretrained()
let pcm = try await tts.generate(
text: "Long English script.",
referenceAudio: refSamples, // [Float] mono speech, any rate
referenceTranscript: "",
sampleRate: 24000
)
No voice cache needed — the model encodes the reference audio through both acoustic_tokenizer (64-dim) and semantic_tokenizer (128-dim, ASR-trained) and sums them at audio prompt positions. Generation runs LM token sampling branched on <speech_diffusion> / <speech_end> / text — diffuses an acoustic latent only when the LM emits the speech token.
ASR-verified on M2 Max INT4 (RTFx 1.48): for input "Hello world. This is the one point five billion VibeVoice variant of the Microsoft text to speech model.", Nemotron transcribed the output as "hello world, this is the one point five billion via voice variant of the microsoft texas speech model" — every content word matched, only acoustic substitutions are VibeVoice → via voice and text to → texas.
Swap voices between generations
try tts.loadVoice(from: "en-Mike_man.safetensors")
let a = try await tts.generate(text: "First speaker line.")
try tts.loadVoice(from: "en-Emma_woman.safetensors")
let b = try await tts.generate(text: "Second speaker line.")
CLI
Realtime-0.5B with a converted Microsoft voice cache (English):
speech vibevoice "Hello world." \
--voice-cache voice_cache/en-Mike_man.safetensors \
--output hello.wav
1.5B long-form with raw reference audio + transcript (cloning an arbitrary speaker, English or Chinese):
speech vibevoice "Long paragraph ..." \
--long-form \
--reference-audio voice.wav \
--reference-transcript "what was actually said in voice.wav" \
--max-tokens 4000 \
--output episode.wav
Flags: --steps (DPM-Solver steps), --cfg (guidance), --model / --tokenizer to override HuggingFace IDs, --long-form to switch to the 1.5B preset, --verbose for timing.
Picking among speech-swift TTS modules
| Kokoro-82M | Qwen3-TTS | CosyVoice3 | VibeVoice Realtime | VibeVoice 1.5B | |
|---|---|---|---|---|---|
| Params | 82M | 7B | 7B | 500M | 1.5B |
| Backend | CoreML (ANE) | MLX | MLX | MLX | MLX |
| Languages | 8 | 10+ | 10+ | EN only | EN + ZH |
| Voice cloning | Fixed presets | ICL reference | Zero-shot reference | Pre-built voice caches only | Raw audio + transcript |
| Long-form | Short/medium | Streaming | Streaming | Streaming | Up to 90 min / 4 speakers |
…you need long-form, multi-speaker, or podcast/audiobook output in English or Chinese, with consistent voice identity across minutes of audio. For short-form multilingual TTS, Qwen3-TTS or CosyVoice3 are better fits. For iOS-native short utterances, Kokoro is the smallest option.