音声クローン

短いリファレンス音声サンプルから任意の音声をクローンします。Qwen3-TTSとCosyVoice3の両方が異なるスピーカーエンコーダーで音声クローンをサポートします — それぞれECAPA-TDNN（1024次元）とCAM++（192次元）です。

仕組み

ターゲット音声のリファレンス音声サンプルを録音または提供
話者embedding抽出 — スピーカーエンコーダーがリファレンス音声を固定次元の埋め込みベクトルに処理
Embedding注入 — 話者embeddingが合成中にTTSモデルを条件付ける
音声合成 — TTSモデルがリファレンスサンプルの声質特徴に一致する音声を生成

エンジン

音声クローンは両方のTTSエンジンで利用可能です。各々が異なるスピーカーエンコーダーを使用します：

エンジン	スピーカーエンコーダー	embedding	バックエンド
Qwen3-TTS	ECAPA-TDNN	1024次元 x-vector	MLX (GPU)
CosyVoice3	CAM++	192次元	CoreML (Neural Engine)

CosyVoice3 + CAM++

CosyVoice3はAlibabaの3D-SpeakerプロジェクトからのCAM++（Context-Aware Masking++）スピーカーエンコーダーを使用します。192次元のembeddingは、CosyVoice3と共同トレーニングされたアフィン射影レイヤー（192 → 80）経由でDiTフローモデルを条件付けます。

CAM++アーキテクチャ

ステージ	説明
FCM	フロントエンド畳み込みモジュール（Conv2d + 2 ResBlocks、32チャネル）
TDNN	Time Delay Neural Network（320から128チャネル、カーネルサイズ5）
D-TDNNブロック	コンテキストアウェアマスキング付きの3つの密接続ブロック（12/24/16レイヤー）
統計プール	平均 + 標準偏差プーリング（グローバル統計）
Dense	192次元embeddingへの線形射影

CoreMLモデル（約14 MB、FP16）はNeural Engine上で動作します。初回使用時にaufklarer/CamPlusPlus-Speaker-CoreMLから自動的にダウンロードされます。

Qwen3-TTS 音声クローン

Qwen3-TTSは2つの音声クローンモードをサポートします：

ICLモード（推奨）

In-Context LearningモードはMimi音声トークナイザーエンコーダー経由でリファレンス音声をコーデックトークンにエンコードし、リファレンストランスクリプトで先頭に追加します。これによりモデルに完全な音響コンテキストが提供されます — 高品質で信頼性のあるEOS（短いテキストや非英語言語の問題を修正）。

let (model, encoder) = try await Qwen3TTSModel.fromPretrainedWithEncoder()
let audio = model.synthesizeWithVoiceCloneICL(
    text: "Target text to synthesize.",
    referenceAudio: refSamples,
    referenceSampleRate: 24000,
    referenceText: "Exact transcript of reference audio.",
    language: "english",
    codecEncoder: encoder
)

X-Vectorモード

1024次元x-vectorを生成するECAPA-TDNNエンコーダーを使用します。トランスクリプトは不要ですが品質が低下します。短いテキストや特定の言語でEOSを発行できない場合があります。

ECAPA-TDNNアーキテクチャ

ステージ	説明
TDNN	Time Delay Neural Network（128から512チャネル、カーネルサイズ5）
SE-Res2Netブロック	Squeeze-and-Excitation付きの3ブロック（512チャネル、dilation 2/3/4）
MFA	Multi-layer Feature Aggregation（1536チャネル + ReLU）
ASP	Attentive Statistics Pooling（1536チャネル、時間軸にわたるsoftmax）
FC	完全接続レイヤー（3072から1024次元）

重み（76パラメーター）はQwen3-TTS safetensorsに含まれています — 別のダウンロードは不要です。

CLIの使用法

# CosyVoice3 音声クローン (CAM++、CoreML Neural Engine)
.build/release/speech speak "Text in the cloned voice" \
    --engine cosyvoice --voice-sample reference.wav -o output.wav

# Qwen3-TTS 音声クローン (ECAPA-TDNN、MLX GPU)
.build/release/speech speak "Text in the cloned voice" \
    --voice-sample reference.wav -o output.wav

例

# CosyVoice3: 多言語音声クローン (9言語)
.build/release/speech speak "Hello, this is my cloned voice." \
    --engine cosyvoice --voice-sample my_voice.wav -o cloned_hello.wav

# CosyVoice3: 異なる言語で音声をクローン
.build/release/speech speak "Guten Tag, das ist meine geklonte Stimme." \
    --engine cosyvoice --voice-sample my_voice.wav --language german -o german.wav

# Qwen3-TTS: 英語音声クローン
.build/release/speech speak "The quick brown fox jumps over the lazy dog." \
    --voice-sample recording_15s.wav -o cloned_fox.wav

マルチスピーカー対話

CosyVoice3はスピーカーごとの音声クローンによるマルチスピーカー対話をサポートします。スピーカータグをリファレンス音声ファイルにマッピングするには--speakersフラグを使用します：

# 音声クローンを伴う2スピーカー対話
.build/release/speech speak "[S1] Hello there! [S2] Hey, how are you?" \
    --engine cosyvoice --speakers s1=alice.wav,s2=bob.wav -o dialogue.wav

# 感情タグ + 音声クローンを伴う対話
.build/release/speech speak "[S1] (happy) Great news! [S2] (surprised) Really? Tell me more." \
    --engine cosyvoice --speakers s1=alice.wav,s2=bob.wav -o emotional_dialogue.wav

# ターン間の無音を調整
.build/release/speech speak "[S1] First line. [S2] Second line." \
    --engine cosyvoice --speakers s1=a.wav,s2=b.wav --turn-gap 0.5 -o gapped.wav

各スピーカーのリファレンス音声は、192次元のembeddingを抽出するためにCAM++エンコーダーを通じて処理されます。モデルは一度ロードされ、すべてのスピーカーに再利用されます。対話構文と感情タグの完全な詳細についてはCosyVoice3ガイドを参照してください。

リファレンス音声のヒント

継続時間：5〜15秒の音声が最適です。短いクリップは十分な声質特徴を捉えない可能性があり、長いクリップは収穫逓減を提供します。
単一話者：リファレンスには1人の話者のみが含まれているべきです。マルチ話者音声は予測不可能な結果を生成します。
クリーンな音声：背景ノイズ、音楽、残響を最小限に抑えてください。クローンの前にノイズの多いリファレンスをクリーンにするには音声強調モジュールを使用してください。
自然な音声：ささやき、叫び、歌ではなく、会話的で自然な音声を使用してください。

重要

Qwen3-TTSでは、音声クローンはbaseモデルのみで動作します — customVoiceではありません。CosyVoice3の音声クローンはデフォルトモデルで動作します。

Swift API

import CosyVoiceTTS

// CosyVoice3 音声クローン
let model = try await CosyVoiceTTSModel.fromPretrained()
let speaker = try await CamPlusPlusSpeaker.fromPretrained()

// リファレンス音声から192次元のスピーカーembeddingを抽出
let embedding = try speaker.embed(audio: refSamples, sampleRate: 16000)

// クローン音声で合成
let audio = model.synthesize(
    text: "Hello in a cloned voice!",
    speakerEmbedding: embedding
)

// カスタム指示 + スピーカーembedding
let styledAudio = model.synthesize(
    text: "Hello!",
    instruction: "Speak happily and with excitement.",
    speakerEmbedding: embedding
)

// マルチスピーカー対話
let segments = DialogueParser.parse("[S1] (happy) Hi! [S2] Hey there.")
let embeddings = ["S1": aliceEmbedding, "S2": bobEmbedding]
let dialogueAudio = DialogueSynthesizer.synthesize(
    segments: segments,
    speakerEmbeddings: embeddings,
    model: model,
    language: "english"
)

import Qwen3TTS

// Qwen3-TTS 音声クローン
let model = try await Qwen3TTSModel.fromPretrained()
let audio = model.synthesizeWithVoiceClone(
    text: "Hello in a cloned voice!",
    referenceAudio: refSamples,
    referenceSampleRate: 24000
)

Reference Audio Caching

Both synthesizeWithVoiceClone (x-vector) and synthesizeWithVoiceCloneICL (ICL) cache their per-reference preprocessing across calls on the same model instance. The x-vector path caches the ECAPA-TDNN speaker embedding; the ICL path additionally caches the Mimi codec encoder output. The cache is content-addressed (hash of raw samples + sample rate) and bounded to a small LRU (default 4 entries), so repeated generations against the same reference waveform skip the mel + encoder passes without unbounded memory growth.

let tts = try await Qwen3TTSModel.fromPretrained()

// First call: runs ECAPA-TDNN, caches the embedding
_ = tts.synthesizeWithVoiceClone(text: "Hello", referenceAudio: ref, ...)

// Subsequent calls with the same reference: cache hit
_ = tts.synthesizeWithVoiceClone(text: "How are you?", referenceAudio: ref, ...)

// Explicit eviction (rarely needed — LRU handles capacity)
tts.clearReferenceAudioCache()