VibeVoice

Microsoft VibeVoice 是一款面向英语和中文的长篇、多说话人文本转语音模型。与短句 TTS 不同,它专为单次生成播客长度的对话、有声书叙述和多人场景而设计——最长 90 分钟,最多 4 个不同的声音,并在整个输出中保持一致的身份。提供两种变体:低延迟流式的 Realtime-0.5B,以及长篇旗舰品质的 1.5B

概览

架构

四个协同组件以 7.5 Hz 为节奏逐个生成音频潜变量:

组件描述
Split Qwen2 backbone24-layer Qwen2.5 decoder (896 hidden, GQA 14/2 for Realtime-0.5B). The model is split: the lower 4 layers form a text LM, the upper 20 layers run as the TTS LM. Text windows (5 tokens at a time) flow through both; generated speech latents flow only through the TTS LM.
σ-VAE acoustic tokenizer流式 conv stack that encodes 24 kHz audio to a 64-dim latent at 7.5 Hz (3200× temporal downsample) and decodes latents back to waveform. Used for both voice-cache creation and final audio decode.
Diffusion headSmall 4-layer DDPM head with adaLN modulation. Samples each speech latent via 20-step DPM-Solver with classifier-free guidance (cfg = 1.3 default for Realtime-0.5B, 1.5 for 1.5B).
EOS classifierPer-step binary classifier on the TTS LM's last hidden state. When sigmoid probability exceeds 0.5, generation stops.

通过 voice-cache 进行声音克隆

生成时,说话人身份不是来自参考波形。每个声音以预先计算好的 .safetensors voice cache 形式提供,其中包含特定说话人的条件 KV 缓存和隐藏状态——通过离线在编码器路径上运行参考音频得到。运行时加载 voice cache 是瞬时的;一个模型实例可以在多次生成之间廉价地切换声音。

示例 voice cache(MIT 许可证): mzbac/vibevoice.swift/voice_cache —— 包含 Carter、Davis、Emma、Frank、Grace、Mike 以及印度口音 Samuel 共 7 种英语声音。

模型

量化大小HuggingFace
Realtime-0.5BBF16 (source)~1 GBmicrosoft/VibeVoice-Realtime-0.5B
Realtime-0.5B INT4Qwen2 INT4, tokenizer + diffusion FP16~350 MBaufklarer/VibeVoice-Realtime-0.5B-MLX-INT4
Realtime-0.5B INT8Qwen2 INT8~570 MBaufklarer/VibeVoice-Realtime-0.5B-MLX-INT8
1.5B long-formBF16 (source)~3 GBmicrosoft/VibeVoice-1.5B
1.5B INT4Qwen2 INT4~1 GBaufklarer/VibeVoice-1.5B-MLX-INT4

量化由 models/vibevoice/export/convert.py 通过 MLX 分组仿射量化(32 组)产生。Embeddings、归一化层、acoustic-tokenizer 卷积层和 EOS 分类器保留为源数据类型。

快速开始

import VibeVoiceTTS

let tts = try await VibeVoiceTTSModel.fromPretrained()
try tts.loadVoice(from: "/path/to/voice_cache/en-Mike_man.safetensors")
let pcm = try await tts.generate(text: "Hello world.")
// pcm: [Float] at 24 kHz mono

长篇 1.5B 预设

let config = VibeVoiceTTSModel.Configuration.longForm1_5B
let tts = try await VibeVoiceTTSModel.fromPretrained(configuration: config)
try tts.loadVoice(from: "voices/narrator.safetensors")
let pcm = try await tts.generate(text: longTranscript)  // up to ~90 min

longForm1_5B 预设将 maxSpeechTokens 提升到 4000,cfgScale 提升到 1.5,以获得更高保真的长篇输出。

在多次生成之间切换声音

try tts.loadVoice(from: "en-Mike_man.safetensors")
let a = try await tts.generate(text: "First speaker line.")
try tts.loadVoice(from: "en-Emma_woman.safetensors")
let b = try await tts.generate(text: "Second speaker line.")

CLI

audio vibevoice "Hello world." \
    --voice-cache voice_cache/en-Mike_man.safetensors \
    --output hello.wav

# 长篇 1.5B
audio vibevoice "Long paragraph ..." \
    --voice-cache voices/narrator.safetensors \
    --long-form \
    --max-tokens 4000 \
    --output episode.wav

可选参数:--steps(DPM-Solver 步数)、--cfg(引导强度)、--model / --tokenizer 覆盖 HuggingFace ID、--long-form 切换到 1.5B 预设、--verbose 显示耗时。

在 speech-swift TTS 模块之间选择

Kokoro-82MQwen3-TTSCosyVoice3VibeVoice RealtimeVibeVoice 1.5B
参数量82M7B7B500M1.5B
后端CoreML (ANE)MLXMLXMLXMLX
语言810+10+EN/ZHEN/ZH
声音克隆固定预设ICL 参考音频零样本参考voice cachevoice cache
长篇短/中流式流式流式最长 90 分钟 / 4 说话人
何时选择 VibeVoice…

……当你需要英语或中文的长篇、多说话人或播客 / 有声书输出,且要求音频在数分钟内保持一致的声音身份时。短篇多语言 TTS 推荐 Qwen3-TTSCosyVoice3。iOS 原生短句合成推荐体积最小的 Kokoro