VibeVoice

Microsoft VibeVoice is a long-form, multi-speaker text-to-speech model for English and Chinese. Unlike short-utterance TTS, it's designed to generate podcast-length dialogue, audiobook narration, and multi-speaker scenes in a single pass — up to 90 minutes with up to 4 distinct voices and consistent identity throughout. Two variants ship: Realtime-0.5B for low-latency streaming and 1.5B for long-form flagship quality.

What it is

Architecture

Four cooperating components produce audio one 7.5 Hz latent at a time:

ComponentDescription
Split Qwen2 backbone24-layer Qwen2.5 decoder (896 hidden, GQA 14/2 for Realtime-0.5B). The model is split: the lower 4 layers form a text LM, the upper 20 layers run as the TTS LM. Text windows (5 tokens at a time) flow through both; generated speech latents flow only through the TTS LM.
σ-VAE acoustic tokenizerStreaming conv stack that encodes 24 kHz audio to a 64-dim latent at 7.5 Hz (3200× temporal downsample) and decodes latents back to waveform. Used for both voice-cache creation and final audio decode.
Diffusion headSmall 4-layer DDPM head with adaLN modulation. Samples each speech latent via 20-step DPM-Solver with classifier-free guidance (cfg = 1.3 default for Realtime-0.5B, 1.5 for 1.5B).
EOS classifierPer-step binary classifier on the TTS LM's last hidden state. When sigmoid probability exceeds 0.5, generation stops.

Voice cloning via voice-cache

Speaker identity does not come from a reference waveform at generation time. Instead, each voice ships as a precomputed .safetensors voice cache containing the conditioning KV caches and hidden states for a specific speaker — produced by running reference audio through the encoder path offline. Loading a voice cache is instantaneous at runtime; one model instance can swap voices cheaply between generations.

Example voice caches (MIT-licensed): mzbac/vibevoice.swift/voice_cache — 7 English voices including Carter, Davis, Emma, Frank, Grace, Mike, and Indian-accented Samuel.

Model

BundleQuantizationSizeHuggingFace
Realtime-0.5BBF16 (source)~1 GBmicrosoft/VibeVoice-Realtime-0.5B
Realtime-0.5B INT4Qwen2 INT4, tokenizer + diffusion FP16~350 MBaufklarer/VibeVoice-Realtime-0.5B-MLX-INT4
Realtime-0.5B INT8Qwen2 INT8~570 MBaufklarer/VibeVoice-Realtime-0.5B-MLX-INT8
1.5B long-formBF16 (source)~3 GBmicrosoft/VibeVoice-1.5B
1.5B INT4Qwen2 INT4~1 GBaufklarer/VibeVoice-1.5B-MLX-INT4

Quantization produced by models/vibevoice/export/convert.py using MLX group-wise affine quant (32-group). Embeddings, norms, acoustic-tokenizer convolutions, and the EOS classifier stay in their source dtype.

Quick start

import VibeVoiceTTS

let tts = try await VibeVoiceTTSModel.fromPretrained()
try tts.loadVoice(from: "/path/to/voice_cache/en-Mike_man.safetensors")
let pcm = try await tts.generate(text: "Hello world.")
// pcm: [Float] at 24 kHz mono

Long-form 1.5B preset

let config = VibeVoiceTTSModel.Configuration.longForm1_5B
let tts = try await VibeVoiceTTSModel.fromPretrained(configuration: config)
try tts.loadVoice(from: "voices/narrator.safetensors")
let pcm = try await tts.generate(text: longTranscript)  // up to ~90 min

The longForm1_5B preset bumps maxSpeechTokens to 4000 and cfgScale to 1.5 for higher-fidelity long-form output.

Swap voices between generations

try tts.loadVoice(from: "en-Mike_man.safetensors")
let a = try await tts.generate(text: "First speaker line.")
try tts.loadVoice(from: "en-Emma_woman.safetensors")
let b = try await tts.generate(text: "Second speaker line.")

CLI

audio vibevoice "Hello world." \
    --voice-cache voice_cache/en-Mike_man.safetensors \
    --output hello.wav

# Long-form 1.5B
audio vibevoice "Long paragraph ..." \
    --voice-cache voices/narrator.safetensors \
    --long-form \
    --max-tokens 4000 \
    --output episode.wav

Flags: --steps (DPM-Solver steps), --cfg (guidance), --model / --tokenizer to override HuggingFace IDs, --long-form to switch to the 1.5B preset, --verbose for timing.

Picking among speech-swift TTS modules

Kokoro-82MQwen3-TTSCosyVoice3VibeVoice RealtimeVibeVoice 1.5B
Params82M7B7B500M1.5B
BackendCoreML (ANE)MLXMLXMLXMLX
Languages810+10+EN/ZHEN/ZH
Voice cloningFixed presetsICL referenceZero-shot referenceVoice cacheVoice cache
Long-formShort/mediumStreamingStreamingStreamingUp to 90 min / 4 speakers
Pick VibeVoice when…

…you need long-form, multi-speaker, or podcast/audiobook output in English or Chinese, with consistent voice identity across minutes of audio. For short-form multilingual TTS, Qwen3-TTS or CosyVoice3 are better fits. For iOS-native short utterances, Kokoro is the smallest option.