Voice Cloning
Clone any voice from a short reference audio sample. Both Qwen3-TTS and CosyVoice3 support voice cloning with different speaker encoders — ECAPA-TDNN (1024-dim) and CAM++ (192-dim) respectively.
How It Works
- Record or provide a reference audio sample of the target voice
- Speaker embedding extraction — a speaker encoder processes the reference audio into a fixed-dimensional embedding vector
- Embedding injection — the speaker embedding conditions the TTS model during synthesis
- Speech synthesis — the TTS model generates speech that matches the vocal characteristics of the reference sample
Engines
Voice cloning is available with both TTS engines. Each uses a different speaker encoder:
| Engine | Speaker Encoder | Embedding | Backend |
|---|---|---|---|
| Qwen3-TTS | ECAPA-TDNN | 1024-dim x-vector | MLX (GPU) |
| CosyVoice3 | CAM++ | 192-dim | CoreML (Neural Engine) |
CosyVoice3 + CAM++
CosyVoice3 uses the CAM++ (Context-Aware Masking++) speaker encoder from Alibaba's 3D-Speaker project. The 192-dim embedding conditions the DiT flow model via an affine projection layer (192 → 80) that was jointly trained with CosyVoice3.
CAM++ Architecture
| Stage | Description |
|---|---|
| FCM | Front-end convolutional module (Conv2d + 2 ResBlocks, 32 channels) |
| TDNN | Time Delay Neural Network (320 to 128 channels, kernel size 5) |
| D-TDNN blocks | 3 densely-connected blocks (12/24/16 layers) with context-aware masking |
| Stats Pool | Mean + standard deviation pooling (global statistics) |
| Dense | Linear projection to 192-dim embedding |
The CoreML model (~14 MB, FP16) runs on the Neural Engine. It is downloaded automatically from aufklarer/CamPlusPlus-Speaker-CoreML on first use.
Qwen3-TTS + ECAPA-TDNN
Qwen3-TTS uses an ECAPA-TDNN encoder (Emphasized Channel Attention, Propagation and Aggregation in Time Delay Neural Networks) that produces a 1024-dim x-vector. The embedding is injected between think tokens and pad/bos tokens during generation.
ECAPA-TDNN Architecture
| Stage | Description |
|---|---|
| TDNN | Time Delay Neural Network (128 to 512 channels, kernel size 5) |
| SE-Res2Net blocks | 3 blocks with Squeeze-and-Excitation (512 channels, dilation 2/3/4) |
| MFA | Multi-layer Feature Aggregation (1536 channels + ReLU) |
| ASP | Attentive Statistics Pooling (1536 channels, softmax over time) |
| FC | Fully connected layer (3072 to 1024 dimensions) |
The weights (76 parameters) are included in the Qwen3-TTS safetensors — no separate download required.
CLI Usage
# CosyVoice3 voice cloning (CAM++, CoreML Neural Engine)
.build/release/audio speak "Text in the cloned voice" \
--engine cosyvoice --voice-sample reference.wav -o output.wav
# Qwen3-TTS voice cloning (ECAPA-TDNN, MLX GPU)
.build/release/audio speak "Text in the cloned voice" \
--voice-sample reference.wav -o output.wav
Examples
# CosyVoice3: multilingual voice cloning (9 languages)
.build/release/audio speak "Hello, this is my cloned voice." \
--engine cosyvoice --voice-sample my_voice.wav -o cloned_hello.wav
# CosyVoice3: clone voice in a different language
.build/release/audio speak "Guten Tag, das ist meine geklonte Stimme." \
--engine cosyvoice --voice-sample my_voice.wav --language german -o german.wav
# Qwen3-TTS: English voice cloning
.build/release/audio speak "The quick brown fox jumps over the lazy dog." \
--voice-sample recording_15s.wav -o cloned_fox.wav
Multi-Speaker Dialogue
CosyVoice3 supports multi-speaker dialogue with per-speaker voice cloning. Use the --speakers flag to map speaker tags to reference audio files:
# Two-speaker dialogue with voice cloning
.build/release/audio speak "[S1] Hello there! [S2] Hey, how are you?" \
--engine cosyvoice --speakers s1=alice.wav,s2=bob.wav -o dialogue.wav
# Dialogue with emotion tags + voice cloning
.build/release/audio speak "[S1] (happy) Great news! [S2] (surprised) Really? Tell me more." \
--engine cosyvoice --speakers s1=alice.wav,s2=bob.wav -o emotional_dialogue.wav
# Adjust silence between turns
.build/release/audio speak "[S1] First line. [S2] Second line." \
--engine cosyvoice --speakers s1=a.wav,s2=b.wav --turn-gap 0.5 -o gapped.wav
Each speaker's reference audio is processed through the CAM++ encoder to extract a 192-dim embedding. The model is loaded once and reused for all speakers. See the CosyVoice3 guide for full details on dialogue syntax and emotion tags.
Reference Audio Tips
- Duration: 5 to 15 seconds of speech works best. Shorter clips may not capture enough vocal characteristics; longer clips provide diminishing returns.
- Single speaker: The reference should contain only one speaker. Multi-speaker audio will produce unpredictable results.
- Clean audio: Minimize background noise, music, and reverberation. Use the speech enhancement module to clean noisy references before cloning.
- Natural speech: Use conversational, natural-sounding speech rather than whispers, shouts, or singing.
For Qwen3-TTS, voice cloning works with the base model only — not customVoice. CosyVoice3 voice cloning works with the default model.
Swift API
import CosyVoiceTTS
// CosyVoice3 voice cloning
let model = try await CosyVoiceTTSModel.fromPretrained()
let speaker = try await CamPlusPlusSpeaker.fromPretrained()
// Extract 192-dim speaker embedding from reference audio
let embedding = try speaker.embed(audio: refSamples, sampleRate: 16000)
// Synthesize with cloned voice
let audio = model.synthesize(
text: "Hello in a cloned voice!",
speakerEmbedding: embedding
)
// With custom instruction + speaker embedding
let styledAudio = model.synthesize(
text: "Hello!",
instruction: "Speak happily and with excitement.",
speakerEmbedding: embedding
)
// Multi-speaker dialogue
let segments = DialogueParser.parse("[S1] (happy) Hi! [S2] Hey there.")
let embeddings = ["S1": aliceEmbedding, "S2": bobEmbedding]
let dialogueAudio = DialogueSynthesizer.synthesize(
segments: segments,
speakerEmbeddings: embeddings,
model: model,
language: "english"
)
import Qwen3TTS
// Qwen3-TTS voice cloning
let model = try await Qwen3TTSModel.fromPretrained()
let audio = model.synthesizeWithVoiceClone(
text: "Hello in a cloned voice!",
referenceAudio: refSamples,
referenceSampleRate: 24000
)