VibeVoice
本页面的英文版已于 2026-05-10 更新(语言支持范围、声音克隆工作流),翻译稍后跟进。请参考 英文版 获取最新内容。
Microsoft VibeVoice 是一款面向英语和中文的长篇、多说话人文本转语音模型。与短句 TTS 不同,它专为单次生成播客长度的对话、有声书叙述和多人场景而设计——最长 90 分钟,最多 4 个不同的声音,并在整个输出中保持一致的身份。提供两种变体:低延迟流式的 Realtime-0.5B,以及长篇旗舰品质的 1.5B。
概览
- 单次长篇生成 — 最长 90 分钟音频,整段输出声音一致,无需逐句拼接
- 多说话人对话 — 同时支持 4 个不同说话人,每人由各自的 voice cache 提供条件
- 英语 + 中文 — 训练音频仅为英语和中文;其他语言不受支持(分词器可接受但输出难以理解)
- 24 kHz 单声道输出 — Float32 PCM,可直接传入
AudioCommon.WAVWriter与流式AudioPlayer - MIT 许可证 — 模型权重与我们的 Swift 移植均为 MIT;允许 INT4 量化派生作品
架构
四个协同组件以 7.5 Hz 为节奏逐个生成音频潜变量:
| 组件 | 描述 |
|---|---|
| Split Qwen2 backbone | 24-layer Qwen2.5 decoder (896 hidden, GQA 14/2 for Realtime-0.5B). The model is split: the lower 4 layers form a text LM, the upper 20 layers run as the TTS LM. Text windows (5 tokens at a time) flow through both; generated speech latents flow only through the TTS LM. |
| σ-VAE acoustic tokenizer | 流式 conv stack that encodes 24 kHz audio to a 64-dim latent at 7.5 Hz (3200× temporal downsample) and decodes latents back to waveform. Used for both voice-cache creation and final audio decode. |
| Diffusion head | Small 4-layer DDPM head with adaLN modulation. Samples each speech latent via 20-step DPM-Solver with classifier-free guidance (cfg = 1.3 default for Realtime-0.5B, 1.5 for 1.5B). |
| EOS classifier | Per-step binary classifier on the TTS LM's last hidden state. When sigmoid probability exceeds 0.5, generation stops. |
通过 voice-cache 进行声音克隆
生成时,说话人身份不是来自参考波形。每个声音以预先计算好的 .safetensors voice cache 形式提供,其中包含特定说话人的条件 KV 缓存和隐藏状态——通过离线在编码器路径上运行参考音频得到。运行时加载 voice cache 是瞬时的;一个模型实例可以在多次生成之间廉价地切换声音。
示例 voice cache(MIT 许可证): mzbac/vibevoice.swift/voice_cache —— 包含 Carter、Davis、Emma、Frank、Grace、Mike 以及印度口音 Samuel 共 7 种英语声音。
模型
| 包 | 量化 | 大小 | HuggingFace |
|---|---|---|---|
| Realtime-0.5B | BF16 (source) | ~1 GB | microsoft/VibeVoice-Realtime-0.5B |
| Realtime-0.5B INT4 | Qwen2 INT4, tokenizer + diffusion FP16 | ~350 MB | aufklarer/VibeVoice-Realtime-0.5B-MLX-INT4 |
| Realtime-0.5B INT8 | Qwen2 INT8 | ~570 MB | aufklarer/VibeVoice-Realtime-0.5B-MLX-INT8 |
| 1.5B long-form | BF16 (source) | ~3 GB | microsoft/VibeVoice-1.5B |
| 1.5B INT4 (production) | Qwen2 INT4 + dual encoders | ~1 GB | aufklarer/VibeVoice-1.5B-MLX-INT4 |
量化通过 MLX 分组仿射量化(32 组)产生。Embeddings、归一化层、acoustic-tokenizer 卷积层和 EOS 分类器保留为源数据类型。
快速开始
import VibeVoiceTTS
let tts = try await VibeVoiceTTSModel.fromPretrained()
try tts.loadVoice(from: "/path/to/voice_cache/en-Mike_man.safetensors")
let pcm = try await tts.generate(text: "Hello world.")
// pcm: [Float] at 24 kHz mono
Long-form 1.5B (different API)
1.5B has a different architecture (unified Qwen2 LM, dual encoders, LM token sampling) so it ships as a separate class — VibeVoice15BTTSModel. Reference audio + text go in a single call:
let tts = try await VibeVoice15BTTSModel.fromPretrained()
let pcm = try await tts.generate(
text: "Long English script.",
referenceAudio: refSamples, // [Float] mono speech, any rate
referenceTranscript: "",
sampleRate: 24000
)
No voice cache needed — the model encodes the reference audio through both acoustic_tokenizer (64-dim) and semantic_tokenizer (128-dim, ASR-trained) and sums them at audio prompt positions. Generation runs LM token sampling branched on <speech_diffusion> / <speech_end> / text — diffuses an acoustic latent only when the LM emits the speech token.
ASR-verified on M2 Max INT4 (RTFx 1.48): for input "Hello world. This is the one point five billion VibeVoice variant of the Microsoft text to speech model.", Nemotron transcribed the output as "hello world, this is the one point five billion via voice variant of the microsoft texas speech model" — every content word matched, only acoustic substitutions are VibeVoice → via voice and text to → texas.
在多次生成之间切换声音
try tts.loadVoice(from: "en-Mike_man.safetensors")
let a = try await tts.generate(text: "First speaker line.")
try tts.loadVoice(from: "en-Emma_woman.safetensors")
let b = try await tts.generate(text: "Second speaker line.")
CLI
speech vibevoice "Hello world." \
--voice-cache voice_cache/en-Mike_man.safetensors \
--output hello.wav
# 长篇 1.5B
speech vibevoice "Long paragraph ..." \
--long-form \
--reference-audio reference_speech.wav \
--reference-transcript "exact transcript of the reference" \
--max-tokens 500 --steps 20 \
--output episode.wav
可选参数:--steps(DPM-Solver 步数)、--cfg(引导强度)、--model / --tokenizer 覆盖 HuggingFace ID、--long-form 切换到 1.5B 预设、--verbose 显示耗时。
在 speech-swift TTS 模块之间选择
| Kokoro-82M | Qwen3-TTS | CosyVoice3 | VibeVoice Realtime | VibeVoice 1.5B | |
|---|---|---|---|---|---|
| 参数量 | 82M | 7B | 7B | 500M | 1.5B |
| 后端 | CoreML (ANE) | MLX | MLX | MLX | MLX |
| 语言 | 8 | 10+ | 10+ | EN/ZH | EN/ZH |
| 声音克隆 | 固定预设 | ICL 参考音频 | 零样本参考 | voice cache | voice cache |
| 长篇 | 短/中 | 流式 | 流式 | 流式 | 最长 90 分钟 / 4 说话人 |
……当你需要英语或中文的长篇、多说话人或播客 / 有声书输出,且要求音频在数分钟内保持一致的声音身份时。短篇多语言 TTS 推荐 Qwen3-TTS 或 CosyVoice3。iOS 原生短句合成推荐体积最小的 Kokoro。