Blog
Voice cloning benchmarks
July 2, 2026

Voice cloning models, measured across five languages.

We cloned one dataset reference speaker per language, generated the same short benchmark sentence in each language, then scored speaker similarity and ASR recovery. Every row below has the reference clip and the generated output next to the numbers.

Best average match
OmniVoice int8

0.707 mean speaker cosine across all five languages, with 0.0% ASR error.

Cleanest text recovery
OmniVoice

Exact WER/CER on English, German, Arabic, Spanish, and Chinese.

Fastest all-language run
OmniVoice

Mean RTF 0.45. Chatterbox is close at 0.90 on its four supported rows.

English

FLEURS test/en_us/1042003289011443756.wav

ModelPrecisionCosineASRAudioRTF
OmniVoiceint80.7010.0% WER3.80 s0.48
VoxCPM2bf160.6240.0% WER3.68 s1.84
Chatterbox Multilingualfp160.6250.0% WER3.80 s0.94
Fish Audio S2 Profp160.5900.0% WER3.44 s3.15

German

FLEURS test/de_de/10342213717361642954.wav

ModelPrecisionCosineASRAudioRTF
OmniVoiceint80.8370.0% WER3.47 s0.50
Fish Audio S2 Profp160.7490.0% WER3.76 s3.08
VoxCPM2bf160.7339.1% WER4.16 s1.75
Chatterbox Multilingualfp160.7279.1% WER3.92 s0.87

Modern Standard Arabic

FLEURS test/ar_eg/10863341459609935739.wav

ModelPrecisionCosineASRAudioRTF
VoxCPM2bf160.7570.0% WER3.20 s1.82
Fish Audio S2 Profp160.68614.3% WER3.48 s3.13
Chatterbox Multilingualfp160.6740.0% WER4.40 s0.88
OmniVoiceint80.6210.0% WER3.97 s0.47

Spanish

FLEURS test/es_419/16388069031423373053.wav

ModelPrecisionCosineASRAudioRTF
OmniVoiceint80.6840.0% WER5.26 s0.36
Chatterbox Multilingualfp160.6700.0% WER4.02 s0.92
VoxCPM2bf160.6580.0% WER3.36 s1.80
Fish Audio S2 Profp160.5840.0% WER3.53 s3.15

Chinese

FLEURS test/cmn_hans_cn/5479411876618006152.wav

ModelPrecisionCosineASRAudioRTF
OmniVoiceint80.6900.0% CER4.00 s0.42
VoxCPM2bf160.6580.0% CER2.88 s1.90
Fish Audio S2 Profp160.5980.0% CER3.11 s3.20

Higher speaker cosine means the generated clip is closer to the FLEURS reference speaker embedding. Lower WER/CER means Qwen3-ASR recovered the requested text more cleanly. Lower RTF is faster. These are engineering regression metrics, not a human MOS panel. Precision is listed per row: VoxCPM2’s public full-precision Swift path is bf16, while OmniVoice is shown with the published int8 bundle because the fp16 backbone was not used for this published run.

Reference audio and generated clones

English reference

English

FLEURS test/en_us/1042003289011443756.wav

Reference transcript: The Internet combines elements of both mass and interpersonal communication.

Generated text: This is a short voice cloning benchmark for on-device speech.

OmniVoice
int8 clone from the English reference
Cosine
0.701
ASR
0.0% WER
RTF
0.48
VoxCPM2
bf16 clone from the English reference
Cosine
0.624
ASR
0.0% WER
RTF
1.84
Chatterbox Multilingual
fp16 clone from the English reference
Cosine
0.625
ASR
0.0% WER
RTF
0.94
Fish Audio S2 Pro
fp16 clone from the English reference
Cosine
0.590
ASR
0.0% WER
RTF
3.15
German reference

German

FLEURS test/de_de/10342213717361642954.wav

Reference transcript: Es ist also möglich, dass der Vermerk einfach als Kennzeichnung hinzugefügt wurde.

Generated text: Dies ist ein kurzer Benchmark für lokale Sprachklonung auf dem Gerät.

OmniVoice
int8 clone from the German reference
Cosine
0.837
ASR
0.0% WER
RTF
0.50
Fish Audio S2 Pro
fp16 clone from the German reference
Cosine
0.749
ASR
0.0% WER
RTF
3.08
VoxCPM2
bf16 clone from the German reference
Cosine
0.733
ASR
9.1% WER
RTF
1.75
Chatterbox Multilingual
fp16 clone from the German reference
Cosine
0.727
ASR
9.1% WER
RTF
0.87
Modern Standard Arabic reference

Modern Standard Arabic

FLEURS test/ar_eg/10863341459609935739.wav

Reference transcript: لا تشوه الموقع بوضع علامات أو الكتابات الخادشة على الجدران في المباني.

Generated text: هذا اختبار قصير لاستنساخ الصوت على الجهاز.

VoxCPM2
bf16 clone from the Modern Standard Arabic reference
Cosine
0.757
ASR
0.0% WER
RTF
1.82
Fish Audio S2 Pro
fp16 clone from the Modern Standard Arabic reference
Cosine
0.686
ASR
14.3% WER
RTF
3.13
Chatterbox Multilingual
fp16 clone from the Modern Standard Arabic reference
Cosine
0.674
ASR
0.0% WER
RTF
0.88
OmniVoice
int8 clone from the Modern Standard Arabic reference
Cosine
0.621
ASR
0.0% WER
RTF
0.47
Spanish reference

Spanish

FLEURS test/es_419/16388069031423373053.wav

Reference transcript: Internet une y mezcla componentes propios de la comunicación masiva y entre personas.

Generated text: Esta es una breve prueba de clonación de voz en el dispositivo.

OmniVoice
int8 clone from the Spanish reference
Cosine
0.684
ASR
0.0% WER
RTF
0.36
Chatterbox Multilingual
fp16 clone from the Spanish reference
Cosine
0.670
ASR
0.0% WER
RTF
0.92
VoxCPM2
bf16 clone from the Spanish reference
Cosine
0.658
ASR
0.0% WER
RTF
1.80
Fish Audio S2 Pro
fp16 clone from the Spanish reference
Cosine
0.584
ASR
0.0% WER
RTF
3.15
Chinese reference

Chinese

FLEURS test/cmn_hans_cn/5479411876618006152.wav

Reference transcript: 互联网结合了大众传播和人际传播的要素。

Generated text: 这是一个简短的本地语音克隆测试。

OmniVoice
int8 clone from the Chinese reference
Cosine
0.690
ASR
0.0% CER
RTF
0.42
VoxCPM2
bf16 clone from the Chinese reference
Cosine
0.658
ASR
0.0% CER
RTF
1.90
Fish Audio S2 Pro
fp16 clone from the Chinese reference
Cosine
0.598
ASR
0.0% CER
RTF
3.20
Method

Dataset references, not hand-picked demos

References are single clips from the Google FLEURS test split: English, German, Arabic, Spanish, and Mandarin Chinese. For engines that accept a reference transcript, the exact FLEURS transcript was passed with the audio prompt.

The score shape mirrors the objective side of VoxCPM-style voice cloning evaluation: intelligibility via WER/CER, and cloning via speaker-embedding cosine similarity. The speaker encoder here is Soniqo’s `speech embed-speaker --engine mlx`, so compare rows inside this table, not against paper SIM percentages directly.

Chatterbox’s upstream language list includes Chinese, but the current Swift frontend only supports the direct tokenizer path for `en`, `ar`, `hi`, `de`, `es`, `fr`, `it`, and `pt`; the Chinese row is intentionally omitted until that frontend lands.

# Public speech-swift CLI example for one generated row.
speech speak "$TEXT" \
  --engine voxcpm2 \
  --voxcpm2-variant bf16 \
  --voxcpm2-ref-audio reference.wav \
  --language arabic \
  --output generated.wav

speech embed-speaker reference.wav --engine mlx --json
speech embed-speaker generated.wav --engine mlx --json
speech transcribe generated.wav --engine qwen3 --model 0.6B --language arabic

Why two scores?

A clone can sound like the speaker but say the wrong text, or say the text clearly while missing the speaker. Speaker cosine and ASR error catch different failure modes, so both need to be visible.

Emotion and style attribution

OmniVoice
Broad style hints

Good when you want a cloned speaker with simple delivery guidance, such as a calmer, younger, lower-pitched, or whispered read. This benchmark used a neutral delivery.

Chatterbox Multilingual
Expressiveness strength

Useful when you want the same speaker to sound more restrained or more animated without writing emotion tags into the text. This benchmark kept expressiveness neutral.

VoxCPM2
Voice direction in plain words

Strong fit when you want to describe the target voice or delivery in natural language while still cloning from a reference clip. This benchmark used the reference clip only.

Fish Audio S2 Pro
Acted delivery cues

Best when the script needs explicit moments like laughing, whispering, excitement, or sadness. This benchmark used plain text with no acting cues.

The benchmark intentionally leaves these controls neutral. That keeps speaker similarity tied to the FLEURS reference instead of rewarding a model for adding extra emotion, whispering, shouting, or laughter.

Reading the result

OmniVoice is the best all-around row set: highest average speaker cosine, exact ASR recovery in every requested language, and the fastest average RTF. VoxCPM2 bf16 is the strongest Arabic speaker-match row in this run and fixes the Arabic ASR miss seen in the earlier int8 pass.

Fish Audio is slower here, but it posts strong German and Arabic speaker similarity. Chatterbox is competitive on Arabic and Spanish, but the current Swift tokenizer frontend needs more work before Chinese can be benchmarked.