Voice Cloning

Clone any voice from a short reference audio sample. Both Qwen3-TTS and CosyVoice3 support voice cloning with different speaker encoders — ECAPA-TDNN (1024-dim) and CAM++ (192-dim) respectively.

How It Works

Record or provide a reference audio sample of the target voice
Speaker embedding extraction — a speaker encoder processes the reference audio into a fixed-dimensional embedding vector
Embedding injection — the speaker embedding conditions the TTS model during synthesis
Speech synthesis — the TTS model generates speech that matches the vocal characteristics of the reference sample

Engines

Voice cloning is available with both TTS engines. Each uses a different speaker encoder:

Engine	Speaker Encoder	Embedding	Backend
Qwen3-TTS	ECAPA-TDNN	1024-dim x-vector	MLX (GPU)
CosyVoice3	CAM++	192-dim	CoreML (Neural Engine)

CosyVoice3 + CAM++

CosyVoice3 uses the CAM++ (Context-Aware Masking++) speaker encoder from Alibaba's 3D-Speaker project. The 192-dim embedding conditions the DiT flow model via an affine projection layer (192 → 80) that was jointly trained with CosyVoice3.

CAM++ Architecture

Stage	Description
FCM	Front-end convolutional module (Conv2d + 2 ResBlocks, 32 channels)
TDNN	Time Delay Neural Network (320 to 128 channels, kernel size 5)
D-TDNN blocks	3 densely-connected blocks (12/24/16 layers) with context-aware masking
Stats Pool	Mean + standard deviation pooling (global statistics)
Dense	Linear projection to 192-dim embedding

The CoreML model (~14 MB, FP16) runs on the Neural Engine. It is downloaded automatically from aufklarer/CamPlusPlus-Speaker-CoreML on first use.

Qwen3-TTS Voice Cloning

Qwen3-TTS supports two voice cloning modes:

ICL Mode (Recommended)

In-Context Learning mode encodes the reference audio into codec tokens via the Mimi speech tokenizer encoder and prepends them with the reference transcript. This gives the model full acoustic context — higher quality and reliable EOS (fixes issues with short texts and non-English languages).

let (model, encoder) = try await Qwen3TTSModel.fromPretrainedWithEncoder()
let audio = model.synthesizeWithVoiceCloneICL(
    text: "Target text to synthesize.",
    referenceAudio: refSamples,
    referenceSampleRate: 24000,
    referenceText: "Exact transcript of reference audio.",
    language: "english",
    codecEncoder: encoder
)

X-Vector Mode

Uses an ECAPA-TDNN encoder that produces a 1024-dim x-vector. No transcript needed but lower quality. May fail to emit EOS on short texts or certain languages.

ECAPA-TDNN Architecture

Stage	Description
TDNN	Time Delay Neural Network (128 to 512 channels, kernel size 5)
SE-Res2Net blocks	3 blocks with Squeeze-and-Excitation (512 channels, dilation 2/3/4)
MFA	Multi-layer Feature Aggregation (1536 channels + ReLU)
ASP	Attentive Statistics Pooling (1536 channels, softmax over time)
FC	Fully connected layer (3072 to 1024 dimensions)

The weights (76 parameters) are included in the Qwen3-TTS safetensors — no separate download required.

CLI Usage

# CosyVoice3 voice cloning (CAM++, CoreML Neural Engine)
.build/release/speech speak "Text in the cloned voice" \
    --engine cosyvoice --voice-sample reference.wav -o output.wav

# Qwen3-TTS voice cloning (ECAPA-TDNN, MLX GPU)
.build/release/speech speak "Text in the cloned voice" \
    --voice-sample reference.wav -o output.wav

Examples

# CosyVoice3: multilingual voice cloning (9 languages)
.build/release/speech speak "Hello, this is my cloned voice." \
    --engine cosyvoice --voice-sample my_voice.wav -o cloned_hello.wav

# CosyVoice3: clone voice in a different language
.build/release/speech speak "Guten Tag, das ist meine geklonte Stimme." \
    --engine cosyvoice --voice-sample my_voice.wav --language german -o german.wav

# Qwen3-TTS: English voice cloning
.build/release/speech speak "The quick brown fox jumps over the lazy dog." \
    --voice-sample recording_15s.wav -o cloned_fox.wav

Multi-Speaker Dialogue

CosyVoice3 supports multi-speaker dialogue with per-speaker voice cloning. Use the --speakers flag to map speaker tags to reference audio files:

# Two-speaker dialogue with voice cloning
.build/release/speech speak "[S1] Hello there! [S2] Hey, how are you?" \
    --engine cosyvoice --speakers s1=alice.wav,s2=bob.wav -o dialogue.wav

# Dialogue with emotion tags + voice cloning
.build/release/speech speak "[S1] (happy) Great news! [S2] (surprised) Really? Tell me more." \
    --engine cosyvoice --speakers s1=alice.wav,s2=bob.wav -o emotional_dialogue.wav

# Adjust silence between turns
.build/release/speech speak "[S1] First line. [S2] Second line." \
    --engine cosyvoice --speakers s1=a.wav,s2=b.wav --turn-gap 0.5 -o gapped.wav

Each speaker's reference audio is processed through the CAM++ encoder to extract a 192-dim embedding. The model is loaded once and reused for all speakers. See the CosyVoice3 guide for full details on dialogue syntax and emotion tags.

Reference Audio Tips

Duration: 5 to 15 seconds of speech works best. Shorter clips may not capture enough vocal characteristics; longer clips provide diminishing returns.
Single speaker: The reference should contain only one speaker. Multi-speaker audio will produce unpredictable results.
Clean audio: Minimize background noise, music, and reverberation. Use the speech enhancement module to clean noisy references before cloning.
Natural speech: Use conversational, natural-sounding speech rather than whispers, shouts, or singing.

Important

For Qwen3-TTS, voice cloning works with the base model only — not customVoice. CosyVoice3 voice cloning works with the default model.

Swift API

import CosyVoiceTTS

// CosyVoice3 voice cloning
let model = try await CosyVoiceTTSModel.fromPretrained()
let speaker = try await CamPlusPlusSpeaker.fromPretrained()

// Extract 192-dim speaker embedding from reference audio
let embedding = try speaker.embed(audio: refSamples, sampleRate: 16000)

// Synthesize with cloned voice
let audio = model.synthesize(
    text: "Hello in a cloned voice!",
    speakerEmbedding: embedding
)

// With custom instruction + speaker embedding
let styledAudio = model.synthesize(
    text: "Hello!",
    instruction: "Speak happily and with excitement.",
    speakerEmbedding: embedding
)

// Multi-speaker dialogue
let segments = DialogueParser.parse("[S1] (happy) Hi! [S2] Hey there.")
let embeddings = ["S1": aliceEmbedding, "S2": bobEmbedding]
let dialogueAudio = DialogueSynthesizer.synthesize(
    segments: segments,
    speakerEmbeddings: embeddings,
    model: model,
    language: "english"
)

import Qwen3TTS

// Qwen3-TTS voice cloning
let model = try await Qwen3TTSModel.fromPretrained()
let audio = model.synthesizeWithVoiceClone(
    text: "Hello in a cloned voice!",
    referenceAudio: refSamples,
    referenceSampleRate: 24000
)

Reference Audio Caching

Both synthesizeWithVoiceClone (x-vector) and synthesizeWithVoiceCloneICL (ICL) cache their per-reference preprocessing across calls on the same model instance. The x-vector path caches the ECAPA-TDNN speaker embedding; the ICL path additionally caches the Mimi codec encoder output. The cache is content-addressed (hash of raw samples + sample rate) and bounded to a small LRU (default 4 entries), so repeated generations against the same reference waveform skip the mel + encoder passes without unbounded memory growth.

let tts = try await Qwen3TTSModel.fromPretrained()

// First call: runs ECAPA-TDNN, caches the embedding
_ = tts.synthesizeWithVoiceClone(text: "Hello", referenceAudio: ref, ...)

// Subsequent calls with the same reference: cache hit
_ = tts.synthesizeWithVoiceClone(text: "How are you?", referenceAudio: ref, ...)

// Explicit eviction (rarely needed — LRU handles capacity)
tts.clearReferenceAudioCache()