声音克隆

从一段简短的参考音频克隆任意声音。Qwen3-TTS 和 CosyVoice3 都支持声音克隆，但使用不同的说话人编码器——分别为 ECAPA-TDNN（1024 维）和 CAM++（192 维）。

工作原理

录制或提供一段目标声音的参考音频
说话人 embedding 提取 — 说话人编码器把参考音频转换为一个定维度的 embedding 向量
Embedding 注入 — 说话人 embedding 在合成时作为条件输入到 TTS 模型
语音合成 — TTS 模型生成与参考样本声学特征一致的语音

引擎

两个 TTS 引擎都支持声音克隆。它们分别使用不同的说话人编码器：

引擎	说话人编码器	Embedding	后端
Qwen3-TTS	ECAPA-TDNN	1024 维 x-vector	MLX (GPU)
CosyVoice3	CAM++	192 维	CoreML (Neural Engine)

CosyVoice3 + CAM++

CosyVoice3 使用来自阿里 3D-Speaker 项目的 CAM++（Context-Aware Masking++）说话人编码器。192 维 embedding 通过一个与 CosyVoice3 联合训练的 affine 投影层（192 → 80）对 DiT flow 模型进行条件化。

CAM++ 架构

阶段	说明
FCM	前端卷积模块（Conv2d + 2 个 ResBlock，32 通道）
TDNN	时延神经网络（320 到 128 通道，kernel size 5）
D-TDNN blocks	3 个密集连接块（12/24/16 层），带 context-aware masking
Stats Pool	均值 + 标准差池化（全局统计量）
Dense	到 192 维 embedding 的线性投影

CoreML 模型（~14 MB，FP16）运行在 Neural Engine 上。首次使用时从 aufklarer/CamPlusPlus-Speaker-CoreML 自动下载。

Qwen3-TTS 声音克隆

Qwen3-TTS 支持两种声音克隆模式：

ICL 模式（推荐）

In-Context Learning 模式通过 Mimi 语音 tokenizer 编码器将参考音频编码为 codec token，并在其前面拼接参考转写文本。这为模型提供了完整的声学上下文——质量更高，且 EOS 更可靠（修复了短文本和非英语语言下的问题）。

let (model, encoder) = try await Qwen3TTSModel.fromPretrainedWithEncoder()
let audio = model.synthesizeWithVoiceCloneICL(
    text: "Target text to synthesize.",
    referenceAudio: refSamples,
    referenceSampleRate: 24000,
    referenceText: "Exact transcript of reference audio.",
    language: "english",
    codecEncoder: encoder
)

X-Vector 模式

使用 ECAPA-TDNN 编码器产生一个 1024 维 x-vector。无需转写但质量较低。在短文本或某些语言下可能无法发出 EOS。

ECAPA-TDNN 架构

阶段	说明
TDNN	时延神经网络（128 到 512 通道，kernel size 5）
SE-Res2Net blocks	3 个带 Squeeze-and-Excitation 的块（512 通道，dilation 2/3/4）
MFA	多层特征聚合（1536 通道 + ReLU）
ASP	Attentive Statistics Pooling（1536 通道，时间维 softmax）
FC	全连接层（3072 到 1024 维）

权重（76 个参数）已包含在 Qwen3-TTS safetensors 中——无需单独下载。

CLI 使用

# CosyVoice3 声音克隆（CAM++，CoreML Neural Engine）
.build/release/speech speak "Text in the cloned voice" \
    --engine cosyvoice --voice-sample reference.wav -o output.wav

# Qwen3-TTS 声音克隆（ECAPA-TDNN，MLX GPU）
.build/release/speech speak "Text in the cloned voice" \
    --voice-sample reference.wav -o output.wav

示例

# CosyVoice3：多语言声音克隆（9 种语言）
.build/release/speech speak "Hello, this is my cloned voice." \
    --engine cosyvoice --voice-sample my_voice.wav -o cloned_hello.wav

# CosyVoice3：用另一种语言克隆声音
.build/release/speech speak "Guten Tag, das ist meine geklonte Stimme." \
    --engine cosyvoice --voice-sample my_voice.wav --language german -o german.wav

# Qwen3-TTS：英语声音克隆
.build/release/speech speak "The quick brown fox jumps over the lazy dog." \
    --voice-sample recording_15s.wav -o cloned_fox.wav

多说话人对话

CosyVoice3 支持带每位说话人独立声音克隆的多说话人对话。使用 --speakers 标志将说话人标签映射到参考音频文件：

# 带声音克隆的双人对话
.build/release/speech speak "[S1] Hello there! [S2] Hey, how are you?" \
    --engine cosyvoice --speakers s1=alice.wav,s2=bob.wav -o dialogue.wav

# 带情感标签 + 声音克隆的对话
.build/release/speech speak "[S1] (happy) Great news! [S2] (surprised) Really? Tell me more." \
    --engine cosyvoice --speakers s1=alice.wav,s2=bob.wav -o emotional_dialogue.wav

# 调整两段发言之间的静音
.build/release/speech speak "[S1] First line. [S2] Second line." \
    --engine cosyvoice --speakers s1=a.wav,s2=b.wav --turn-gap 0.5 -o gapped.wav

每位说话人的参考音频都会经过 CAM++ 编码器提取一个 192 维 embedding。模型只加载一次，供所有说话人复用。对话语法与情感标签的完整细节见 CosyVoice3 指南。

参考音频建议

时长：5 到 15 秒的语音效果最佳。更短的片段可能无法捕捉足够的声学特征；更长的片段收益递减。
单说话人：参考音频应只包含一位说话人。多说话人音频会产生不可预期的结果。
干净音频：尽量减少背景噪声、音乐和混响。在克隆之前，可以用语音增强模块清理嘈杂的参考音频。
自然语音：使用对话式、自然的语音，而不是耳语、喊叫或歌唱。

重要

对于 Qwen3-TTS，声音克隆仅在 base 模型上可用——customVoice 不支持。CosyVoice3 的声音克隆可用于默认模型。

Swift API

import CosyVoiceTTS

// CosyVoice3 声音克隆
let model = try await CosyVoiceTTSModel.fromPretrained()
let speaker = try await CamPlusPlusSpeaker.fromPretrained()

// 从参考音频提取 192 维说话人 embedding
let embedding = try speaker.embed(audio: refSamples, sampleRate: 16000)

// 用克隆的声音合成
let audio = model.synthesize(
    text: "Hello in a cloned voice!",
    speakerEmbedding: embedding
)

// 自定义 instruction + 说话人 embedding
let styledAudio = model.synthesize(
    text: "Hello!",
    instruction: "Speak happily and with excitement.",
    speakerEmbedding: embedding
)

// 多说话人对话
let segments = DialogueParser.parse("[S1] (happy) Hi! [S2] Hey there.")
let embeddings = ["S1": aliceEmbedding, "S2": bobEmbedding]
let dialogueAudio = DialogueSynthesizer.synthesize(
    segments: segments,
    speakerEmbeddings: embeddings,
    model: model,
    language: "english"
)

import Qwen3TTS

// Qwen3-TTS 声音克隆
let model = try await Qwen3TTSModel.fromPretrained()
let audio = model.synthesizeWithVoiceClone(
    text: "Hello in a cloned voice!",
    referenceAudio: refSamples,
    referenceSampleRate: 24000
)

Reference Audio Caching

Both synthesizeWithVoiceClone (x-vector) and synthesizeWithVoiceCloneICL (ICL) cache their per-reference preprocessing across calls on the same model instance. The x-vector path caches the ECAPA-TDNN speaker embedding; the ICL path additionally caches the Mimi codec encoder output. The cache is content-addressed (hash of raw samples + sample rate) and bounded to a small LRU (default 4 entries), so repeated generations against the same reference waveform skip the mel + encoder passes without unbounded memory growth.

let tts = try await Qwen3TTSModel.fromPretrained()

// First call: runs ECAPA-TDNN, caches the embedding
_ = tts.synthesizeWithVoiceClone(text: "Hello", referenceAudio: ref, ...)

// Subsequent calls with the same reference: cache hit
_ = tts.synthesizeWithVoiceClone(text: "How are you?", referenceAudio: ref, ...)

// Explicit eviction (rarely needed — LRU handles capacity)
tts.clearReferenceAudioCache()