VibeVoice

Microsoft VibeVoice is a long-form, multi-speaker text-to-speech model for English and Chinese. Unlike short-utterance TTS, it's designed to generate podcast-length dialogue, audiobook narration, and multi-speaker scenes in a single pass — up to 90 minutes with up to 4 distinct voices and consistent identity throughout. Two variants ship: Realtime-0.5B for low-latency streaming and 1.5B for long-form flagship quality.

What it is

Architecture

Four cooperating components produce audio one 7.5 Hz latent at a time:

ComponentDescription
Split Qwen2 backbone24-layer Qwen2.5 decoder (896 hidden, GQA 14/2 for Realtime-0.5B). The model is split: the lower 4 layers form a text LM, the upper 20 layers run as the TTS LM. Text windows (5 tokens at a time) flow through both; generated speech latents flow only through the TTS LM.
σ-VAE acoustic tokenizerStreaming conv stack that encodes 24 kHz audio to a 64-dim latent at 7.5 Hz (3200× temporal downsample) and decodes latents back to waveform. Used for both voice-cache creation and final audio decode.
Diffusion headSmall 4-layer DDPM head with adaLN modulation. Samples each speech latent via 20-step DPM-Solver with classifier-free guidance (cfg = 1.3 default for Realtime-0.5B, 1.5 for 1.5B).
EOS classifierPer-step binary classifier on the TTS LM's last hidden state. When sigmoid probability exceeds 0.5, generation stops.

Languages

Per Microsoft's model cards: Realtime-0.5B is English only (the upstream demo ships nine non-EN voice prompts as exploratory; quality is not guaranteed). 1.5B supports English and Chinese; other languages may produce plausible-sounding but unfaithful audio and should be considered experimental.

Voice identity — two distinct paths

The two variants take very different approaches to speaker conditioning, and each path has constraints worth knowing up-front.

Realtime-0.5B — pre-built voice caches

Speaker identity comes from a precomputed .safetensors voice cache containing the conditioning KV caches and hidden states for a specific speaker. Loading a cache is instantaneous; one model instance can swap voices cheaply between generations.

The Realtime-0.5B checkpoint is distributed inference-only — Microsoft does not ship the acoustic encoder, so voice caches cannot be minted from raw audio against this model. The supported source is one of Microsoft's pre-built .pt voice caches (Carter, Davis, Emma, Frank, Grace, Mike, Samuel for English, plus exploratory de/fr/it/jp/kr/nl/pl/pt voices), flattened into the .safetensors layout this loader expects.

1.5B long-form — voice cloning from raw audio

The 1.5B checkpoint does ship the acoustic encoder, so it can clone an arbitrary speaker from a reference waveform + transcript in one shot. The encoding is inlined on every synthesis call — there is no separate voice-cache file to manage.

FYI: speech vibevoice-encode-voice is gated

The CLI surface for offline voice-cache creation against Realtime-0.5B will fail fast with a pointer to the 1.5B raw-audio workflow, since the encoder weights aren't present in the 0.5B checkpoint. Until Microsoft ships them, this is the only end-to-end path for cloning a custom speaker.

Model

BundleQuantizationSizeHuggingFace
Realtime-0.5BBF16 (source)~1 GBmicrosoft/VibeVoice-Realtime-0.5B
Realtime-0.5B INT4Qwen2 INT4, tokenizer + diffusion FP16~350 MBaufklarer/VibeVoice-Realtime-0.5B-MLX-INT4
Realtime-0.5B INT8Qwen2 INT8~570 MBaufklarer/VibeVoice-Realtime-0.5B-MLX-INT8
1.5B long-formBF16 (source)~3 GBmicrosoft/VibeVoice-1.5B
1.5B INT4 (production)Qwen2 INT4 + dual encoders~1 GBaufklarer/VibeVoice-1.5B-MLX-INT4

Quantization uses MLX group-wise affine quant (32-group). Embeddings, norms, acoustic-tokenizer convolutions, and the EOS classifier stay in their source dtype.

Quick start

import VibeVoiceTTS

let tts = try await VibeVoiceTTSModel.fromPretrained()
try tts.loadVoice(from: "/path/to/voice_cache/en-Mike_man.safetensors")
let pcm = try await tts.generate(text: "Hello world.")
// pcm: [Float] at 24 kHz mono

Long-form 1.5B (different API)

1.5B has a different architecture (unified Qwen2 LM, dual encoders, LM token sampling) so it ships as a separate class — VibeVoice15BTTSModel. Reference audio + text go in a single call:

let tts = try await VibeVoice15BTTSModel.fromPretrained()
let pcm = try await tts.generate(
    text: "Long English script.",
    referenceAudio: refSamples,    // [Float] mono speech, any rate
    referenceTranscript: "",
    sampleRate: 24000
)

No voice cache needed — the model encodes the reference audio through both acoustic_tokenizer (64-dim) and semantic_tokenizer (128-dim, ASR-trained) and sums them at audio prompt positions. Generation runs LM token sampling branched on <speech_diffusion> / <speech_end> / text — diffuses an acoustic latent only when the LM emits the speech token.

ASR-verified on M2 Max INT4 (RTFx 1.48): for input "Hello world. This is the one point five billion VibeVoice variant of the Microsoft text to speech model.", Nemotron transcribed the output as "hello world, this is the one point five billion via voice variant of the microsoft texas speech model" — every content word matched, only acoustic substitutions are VibeVoice → via voice and text to → texas.

Swap voices between generations

try tts.loadVoice(from: "en-Mike_man.safetensors")
let a = try await tts.generate(text: "First speaker line.")
try tts.loadVoice(from: "en-Emma_woman.safetensors")
let b = try await tts.generate(text: "Second speaker line.")

CLI

Realtime-0.5B with a converted Microsoft voice cache (English):

speech vibevoice "Hello world." \
    --voice-cache voice_cache/en-Mike_man.safetensors \
    --output hello.wav

1.5B long-form with raw reference audio + transcript (cloning an arbitrary speaker, English or Chinese):

speech vibevoice "Long paragraph ..." \
    --long-form \
    --reference-audio voice.wav \
    --reference-transcript "what was actually said in voice.wav" \
    --max-tokens 4000 \
    --output episode.wav

Flags: --steps (DPM-Solver steps), --cfg (guidance), --model / --tokenizer to override HuggingFace IDs, --long-form to switch to the 1.5B preset, --verbose for timing.

Picking among speech-swift TTS modules

Kokoro-82MQwen3-TTSCosyVoice3VibeVoice RealtimeVibeVoice 1.5B
Params82M7B7B500M1.5B
BackendCoreML (ANE)MLXMLXMLXMLX
Languages810+10+EN onlyEN + ZH
Voice cloningFixed presetsICL referenceZero-shot referencePre-built voice caches onlyRaw audio + transcript
Long-formShort/mediumStreamingStreamingStreamingUp to 90 min / 4 speakers
Pick VibeVoice when…

…you need long-form, multi-speaker, or podcast/audiobook output in English or Chinese, with consistent voice identity across minutes of audio. For short-form multilingual TTS, Qwen3-TTS or CosyVoice3 are better fits. For iOS-native short utterances, Kokoro is the smallest option.