VibeVoice

Microsoft VibeVoice is a long-form, multi-speaker text-to-speech model for English and Chinese. Unlike short-utterance TTS, it's designed to generate podcast-length dialogue, audiobook narration, and multi-speaker scenes in a single pass — up to 90 minutes with up to 4 distinct voices and consistent identity throughout. Two variants ship: Realtime-0.5B for low-latency streaming and 1.5B for long-form flagship quality.

What it is

Long-form in one pass — up to 90 minutes of audio with consistent voices across the whole output; no per-sentence handoff
Multi-speaker dialogue — 4 distinct speakers at once, each conditioned by its own voice cache
English + Chinese — trained audio data is EN/ZH only; other languages are not supported (tokenizer accepts them but output is unintelligible)
24 kHz mono output — Float32 PCM, drop-in for AudioCommon.WAVWriter and StreamingAudioPlayer
MIT license — model weights and our Swift port are both MIT; INT4 quantized derivatives are allowed

Architecture

Four cooperating components produce audio one 7.5 Hz latent at a time:

Component	Description
Split Qwen2 backbone	24-layer Qwen2.5 decoder (896 hidden, GQA 14/2 for Realtime-0.5B). The model is split: the lower 4 layers form a text LM, the upper 20 layers run as the TTS LM. Text windows (5 tokens at a time) flow through both; generated speech latents flow only through the TTS LM.
σ-VAE acoustic tokenizer	Streaming conv stack that encodes 24 kHz audio to a 64-dim latent at 7.5 Hz (3200× temporal downsample) and decodes latents back to waveform. Used for both voice-cache creation and final audio decode.
Diffusion head	Small 4-layer DDPM head with adaLN modulation. Samples each speech latent via 20-step DPM-Solver with classifier-free guidance (cfg = 1.3 default for Realtime-0.5B, 1.5 for 1.5B).
EOS classifier	Per-step binary classifier on the TTS LM's last hidden state. When sigmoid probability exceeds 0.5, generation stops.

Languages

Per Microsoft's model cards: Realtime-0.5B is English only (the upstream demo ships nine non-EN voice prompts as exploratory; quality is not guaranteed). 1.5B supports English and Chinese; other languages may produce plausible-sounding but unfaithful audio and should be considered experimental.

Voice identity — two distinct paths

The two variants take very different approaches to speaker conditioning, and each path has constraints worth knowing up-front.

Realtime-0.5B — pre-built voice caches

Speaker identity comes from a precomputed .safetensors voice cache containing the conditioning KV caches and hidden states for a specific speaker. Loading a cache is instantaneous; one model instance can swap voices cheaply between generations.

The Realtime-0.5B checkpoint is distributed inference-only — Microsoft does not ship the acoustic encoder, so voice caches cannot be minted from raw audio against this model. The supported source is one of Microsoft's pre-built .pt voice caches (Carter, Davis, Emma, Frank, Grace, Mike, Samuel for English, plus exploratory de/fr/it/jp/kr/nl/pl/pt voices), flattened into the .safetensors layout this loader expects.

1.5B long-form — voice cloning from raw audio

The 1.5B checkpoint does ship the acoustic encoder, so it can clone an arbitrary speaker from a reference waveform + transcript in one shot. The encoding is inlined on every synthesis call — there is no separate voice-cache file to manage.

FYI: speech vibevoice-encode-voice is gated

The CLI surface for offline voice-cache creation against Realtime-0.5B will fail fast with a pointer to the 1.5B raw-audio workflow, since the encoder weights aren't present in the 0.5B checkpoint. Until Microsoft ships them, this is the only end-to-end path for cloning a custom speaker.

Model

Bundle	Quantization	Size	HuggingFace
Realtime-0.5B	BF16 (source)	~1 GB	microsoft/VibeVoice-Realtime-0.5B
Realtime-0.5B INT4	Qwen2 INT4, tokenizer + diffusion FP16	~350 MB	aufklarer/VibeVoice-Realtime-0.5B-MLX-INT4
Realtime-0.5B INT8	Qwen2 INT8	~570 MB	aufklarer/VibeVoice-Realtime-0.5B-MLX-INT8
1.5B long-form	BF16 (source)	~3 GB	microsoft/VibeVoice-1.5B
1.5B INT4 (production)	Qwen2 INT4 + dual encoders	~1 GB	aufklarer/VibeVoice-1.5B-MLX-INT4

Quantization uses MLX group-wise affine quant (32-group). Embeddings, norms, acoustic-tokenizer convolutions, and the EOS classifier stay in their source dtype.

Quick start

import VibeVoiceTTS

let tts = try await VibeVoiceTTSModel.fromPretrained()
try tts.loadVoice(from: "/path/to/voice_cache/en-Mike_man.safetensors")
let pcm = try await tts.generate(text: "Hello world.")
// pcm: [Float] at 24 kHz mono

Long-form 1.5B (different API)

1.5B has a different architecture (unified Qwen2 LM, dual encoders, LM token sampling) so it ships as a separate class — VibeVoice15BTTSModel. Reference audio + text go in a single call:

let tts = try await VibeVoice15BTTSModel.fromPretrained()
let pcm = try await tts.generate(
    text: "Long English script.",
    referenceAudio: refSamples,    // [Float] mono speech, any rate
    referenceTranscript: "",
    sampleRate: 24000
)

No voice cache needed — the model encodes the reference audio through both acoustic_tokenizer (64-dim) and semantic_tokenizer (128-dim, ASR-trained) and sums them at audio prompt positions. Generation runs LM token sampling branched on <speech_diffusion> / <speech_end> / text — diffuses an acoustic latent only when the LM emits the speech token.

ASR-verified on M2 Max INT4 (RTFx 1.48): for input "Hello world. This is the one point five billion VibeVoice variant of the Microsoft text to speech model.", Nemotron transcribed the output as "hello world, this is the one point five billion via voice variant of the microsoft texas speech model" — every content word matched, only acoustic substitutions are VibeVoice → via voice and text to → texas.

Swap voices between generations

try tts.loadVoice(from: "en-Mike_man.safetensors")
let a = try await tts.generate(text: "First speaker line.")
try tts.loadVoice(from: "en-Emma_woman.safetensors")
let b = try await tts.generate(text: "Second speaker line.")

CLI

Realtime-0.5B with a converted Microsoft voice cache (English):

speech vibevoice "Hello world." \
    --voice-cache voice_cache/en-Mike_man.safetensors \
    --output hello.wav

1.5B long-form with raw reference audio + transcript (cloning an arbitrary speaker, English or Chinese):

speech vibevoice "Long paragraph ..." \
    --long-form \
    --reference-audio voice.wav \
    --reference-transcript "what was actually said in voice.wav" \
    --max-tokens 4000 \
    --output episode.wav

Flags: --steps (DPM-Solver steps), --cfg (guidance), --model / --tokenizer to override HuggingFace IDs, --long-form to switch to the 1.5B preset, --verbose for timing.

Picking among speech-swift TTS modules

	Kokoro-82M	Qwen3-TTS	CosyVoice3	VibeVoice Realtime	VibeVoice 1.5B
Params	82M	7B	7B	500M	1.5B
Backend	CoreML (ANE)	MLX	MLX	MLX	MLX
Languages	8	10+	10+	EN only	EN + ZH
Voice cloning	Fixed presets	ICL reference	Zero-shot reference	Pre-built voice caches only	Raw audio + transcript
Long-form	Short/medium	Streaming	Streaming	Streaming	Up to 90 min / 4 speakers

Pick VibeVoice when…

…you need long-form, multi-speaker, or podcast/audiobook output in English or Chinese, with consistent voice identity across minutes of audio. For short-form multilingual TTS, Qwen3-TTS or CosyVoice3 are better fits. For iOS-native short utterances, Kokoro is the smallest option.