Voice Cloning

Clone any voice from a short reference audio sample. Both Qwen3-TTS and CosyVoice3 support voice cloning with different speaker encoders — ECAPA-TDNN (1024-dim) and CAM++ (192-dim) respectively.

How It Works

  1. Record or provide a reference audio sample of the target voice
  2. Speaker embedding extraction — a speaker encoder processes the reference audio into a fixed-dimensional embedding vector
  3. Embedding injection — the speaker embedding conditions the TTS model during synthesis
  4. Speech synthesis — the TTS model generates speech that matches the vocal characteristics of the reference sample

Engines

Voice cloning is available with both TTS engines. Each uses a different speaker encoder:

EngineSpeaker EncoderEmbeddingBackend
Qwen3-TTSECAPA-TDNN1024-dim x-vectorMLX (GPU)
CosyVoice3CAM++192-dimCoreML (Neural Engine)

CosyVoice3 + CAM++

CosyVoice3 uses the CAM++ (Context-Aware Masking++) speaker encoder from Alibaba's 3D-Speaker project. The 192-dim embedding conditions the DiT flow model via an affine projection layer (192 → 80) that was jointly trained with CosyVoice3.

CAM++ Architecture

StageDescription
FCMFront-end convolutional module (Conv2d + 2 ResBlocks, 32 channels)
TDNNTime Delay Neural Network (320 to 128 channels, kernel size 5)
D-TDNN blocks3 densely-connected blocks (12/24/16 layers) with context-aware masking
Stats PoolMean + standard deviation pooling (global statistics)
DenseLinear projection to 192-dim embedding

The CoreML model (~14 MB, FP16) runs on the Neural Engine. It is downloaded automatically from aufklarer/CamPlusPlus-Speaker-CoreML on first use.

Qwen3-TTS Voice Cloning

Qwen3-TTS supports two voice cloning modes:

ICL Mode (Recommended)

In-Context Learning mode encodes the reference audio into codec tokens via the Mimi speech tokenizer encoder and prepends them with the reference transcript. This gives the model full acoustic context — higher quality and reliable EOS (fixes issues with short texts and non-English languages).

let (model, encoder) = try await Qwen3TTSModel.fromPretrainedWithEncoder()
let audio = model.synthesizeWithVoiceCloneICL(
    text: "Target text to synthesize.",
    referenceAudio: refSamples,
    referenceSampleRate: 24000,
    referenceText: "Exact transcript of reference audio.",
    language: "english",
    codecEncoder: encoder
)

X-Vector Mode

Uses an ECAPA-TDNN encoder that produces a 1024-dim x-vector. No transcript needed but lower quality. May fail to emit EOS on short texts or certain languages.

ECAPA-TDNN Architecture

StageDescription
TDNNTime Delay Neural Network (128 to 512 channels, kernel size 5)
SE-Res2Net blocks3 blocks with Squeeze-and-Excitation (512 channels, dilation 2/3/4)
MFAMulti-layer Feature Aggregation (1536 channels + ReLU)
ASPAttentive Statistics Pooling (1536 channels, softmax over time)
FCFully connected layer (3072 to 1024 dimensions)

The weights (76 parameters) are included in the Qwen3-TTS safetensors — no separate download required.

CLI Usage

# CosyVoice3 voice cloning (CAM++, CoreML Neural Engine)
.build/release/audio speak "Text in the cloned voice" \
    --engine cosyvoice --voice-sample reference.wav -o output.wav

# Qwen3-TTS voice cloning (ECAPA-TDNN, MLX GPU)
.build/release/audio speak "Text in the cloned voice" \
    --voice-sample reference.wav -o output.wav

Examples

# CosyVoice3: multilingual voice cloning (9 languages)
.build/release/audio speak "Hello, this is my cloned voice." \
    --engine cosyvoice --voice-sample my_voice.wav -o cloned_hello.wav

# CosyVoice3: clone voice in a different language
.build/release/audio speak "Guten Tag, das ist meine geklonte Stimme." \
    --engine cosyvoice --voice-sample my_voice.wav --language german -o german.wav

# Qwen3-TTS: English voice cloning
.build/release/audio speak "The quick brown fox jumps over the lazy dog." \
    --voice-sample recording_15s.wav -o cloned_fox.wav

Multi-Speaker Dialogue

CosyVoice3 supports multi-speaker dialogue with per-speaker voice cloning. Use the --speakers flag to map speaker tags to reference audio files:

# Two-speaker dialogue with voice cloning
.build/release/audio speak "[S1] Hello there! [S2] Hey, how are you?" \
    --engine cosyvoice --speakers s1=alice.wav,s2=bob.wav -o dialogue.wav

# Dialogue with emotion tags + voice cloning
.build/release/audio speak "[S1] (happy) Great news! [S2] (surprised) Really? Tell me more." \
    --engine cosyvoice --speakers s1=alice.wav,s2=bob.wav -o emotional_dialogue.wav

# Adjust silence between turns
.build/release/audio speak "[S1] First line. [S2] Second line." \
    --engine cosyvoice --speakers s1=a.wav,s2=b.wav --turn-gap 0.5 -o gapped.wav

Each speaker's reference audio is processed through the CAM++ encoder to extract a 192-dim embedding. The model is loaded once and reused for all speakers. See the CosyVoice3 guide for full details on dialogue syntax and emotion tags.

Reference Audio Tips

Important

For Qwen3-TTS, voice cloning works with the base model only — not customVoice. CosyVoice3 voice cloning works with the default model.

Swift API

import CosyVoiceTTS

// CosyVoice3 voice cloning
let model = try await CosyVoiceTTSModel.fromPretrained()
let speaker = try await CamPlusPlusSpeaker.fromPretrained()

// Extract 192-dim speaker embedding from reference audio
let embedding = try speaker.embed(audio: refSamples, sampleRate: 16000)

// Synthesize with cloned voice
let audio = model.synthesize(
    text: "Hello in a cloned voice!",
    speakerEmbedding: embedding
)

// With custom instruction + speaker embedding
let styledAudio = model.synthesize(
    text: "Hello!",
    instruction: "Speak happily and with excitement.",
    speakerEmbedding: embedding
)

// Multi-speaker dialogue
let segments = DialogueParser.parse("[S1] (happy) Hi! [S2] Hey there.")
let embeddings = ["S1": aliceEmbedding, "S2": bobEmbedding]
let dialogueAudio = DialogueSynthesizer.synthesize(
    segments: segments,
    speakerEmbeddings: embeddings,
    model: model,
    language: "english"
)
import Qwen3TTS

// Qwen3-TTS voice cloning
let model = try await Qwen3TTSModel.fromPretrained()
let audio = model.synthesizeWithVoiceClone(
    text: "Hello in a cloned voice!",
    referenceAudio: refSamples,
    referenceSampleRate: 24000
)

Reference Audio Caching

Both synthesizeWithVoiceClone (x-vector) and synthesizeWithVoiceCloneICL (ICL) cache their per-reference preprocessing across calls on the same model instance. The x-vector path caches the ECAPA-TDNN speaker embedding; the ICL path additionally caches the Mimi codec encoder output. The cache is content-addressed (hash of raw samples + sample rate) and bounded to a small LRU (default 4 entries), so repeated generations against the same reference waveform skip the mel + encoder passes without unbounded memory growth.

let tts = try await Qwen3TTSModel.fromPretrained()

// First call: runs ECAPA-TDNN, caches the embedding
_ = tts.synthesizeWithVoiceClone(text: "Hello", referenceAudio: ref, ...)

// Subsequent calls with the same reference: cache hit
_ = tts.synthesizeWithVoiceClone(text: "How are you?", referenceAudio: ref, ...)

// Explicit eviction (rarely needed — LRU handles capacity)
tts.clearReferenceAudioCache()