Voice cloning benchmarks

July 2, 2026

Voice cloning models, measured across five languages.

We cloned one dataset reference speaker per language, generated the same short benchmark sentence in each language, then scored speaker similarity and ASR recovery. Every row below has the reference clip and the generated output next to the numbers.

Best average match

OmniVoice int8

0.707 mean speaker cosine across all five languages, with 0.0% ASR error.

Cleanest text recovery

OmniVoice

Exact WER/CER on English, German, Arabic, Spanish, and Chinese.

Fastest all-language run

OmniVoice

Mean RTF 0.45. Chatterbox is close at 0.90 on its four supported rows.

English

FLEURS test/en_us/1042003289011443756.wav

Model	Precision	Cosine	ASR	Audio	RTF
OmniVoice	int8	0.701	0.0% WER	3.80 s	0.48
VoxCPM2	bf16	0.624	0.0% WER	3.68 s	1.84
Chatterbox Multilingual	fp16	0.625	0.0% WER	3.80 s	0.94
Fish Audio S2 Pro	fp16	0.590	0.0% WER	3.44 s	3.15

German

FLEURS test/de_de/10342213717361642954.wav

Model	Precision	Cosine	ASR	Audio	RTF
OmniVoice	int8	0.837	0.0% WER	3.47 s	0.50
Fish Audio S2 Pro	fp16	0.749	0.0% WER	3.76 s	3.08
VoxCPM2	bf16	0.733	9.1% WER	4.16 s	1.75
Chatterbox Multilingual	fp16	0.727	9.1% WER	3.92 s	0.87

Modern Standard Arabic

FLEURS test/ar_eg/10863341459609935739.wav

Model	Precision	Cosine	ASR	Audio	RTF
VoxCPM2	bf16	0.757	0.0% WER	3.20 s	1.82
Fish Audio S2 Pro	fp16	0.686	14.3% WER	3.48 s	3.13
Chatterbox Multilingual	fp16	0.674	0.0% WER	4.40 s	0.88
OmniVoice	int8	0.621	0.0% WER	3.97 s	0.47

Spanish

FLEURS test/es_419/16388069031423373053.wav

Model	Precision	Cosine	ASR	Audio	RTF
OmniVoice	int8	0.684	0.0% WER	5.26 s	0.36
Chatterbox Multilingual	fp16	0.670	0.0% WER	4.02 s	0.92
VoxCPM2	bf16	0.658	0.0% WER	3.36 s	1.80
Fish Audio S2 Pro	fp16	0.584	0.0% WER	3.53 s	3.15

Chinese

FLEURS test/cmn_hans_cn/5479411876618006152.wav

Model	Precision	Cosine	ASR	Audio	RTF
OmniVoice	int8	0.690	0.0% CER	4.00 s	0.42
VoxCPM2	bf16	0.658	0.0% CER	2.88 s	1.90
Fish Audio S2 Pro	fp16	0.598	0.0% CER	3.11 s	3.20

Higher speaker cosine means the generated clip is closer to the FLEURS reference speaker embedding. Lower WER/CER means Qwen3-ASR recovered the requested text more cleanly. Lower RTF is faster. These are engineering regression metrics, not a human MOS panel. Precision is listed per row: VoxCPM2’s public full-precision Swift path is bf16, while OmniVoice is shown with the published int8 bundle because the fp16 backbone was not used for this published run.

Reference audio and generated clones

English reference

English

FLEURS test/en_us/1042003289011443756.wav

Reference transcript: The Internet combines elements of both mass and interpersonal communication.

Generated text: This is a short voice cloning benchmark for on-device speech.

OmniVoice

int8 clone from the English reference

Cosine

0.701

ASR

0.0% WER

RTF

0.48

VoxCPM2

bf16 clone from the English reference

Cosine

0.624

ASR

0.0% WER

RTF

1.84

Chatterbox Multilingual

fp16 clone from the English reference

Cosine

0.625

ASR

0.0% WER

RTF

0.94

Fish Audio S2 Pro

fp16 clone from the English reference

Cosine

0.590

ASR

0.0% WER

RTF

3.15

German reference

German

FLEURS test/de_de/10342213717361642954.wav

Reference transcript: Es ist also möglich, dass der Vermerk einfach als Kennzeichnung hinzugefügt wurde.

Generated text: Dies ist ein kurzer Benchmark für lokale Sprachklonung auf dem Gerät.

OmniVoice

int8 clone from the German reference

Cosine

0.837

ASR

0.0% WER

RTF

0.50

Fish Audio S2 Pro

fp16 clone from the German reference

Cosine

0.749

ASR

0.0% WER

RTF

3.08

VoxCPM2

bf16 clone from the German reference

Cosine

0.733

ASR

9.1% WER

RTF

1.75

Chatterbox Multilingual

fp16 clone from the German reference

Cosine

0.727

ASR

9.1% WER

RTF

0.87

Modern Standard Arabic reference

Modern Standard Arabic

FLEURS test/ar_eg/10863341459609935739.wav

Reference transcript: لا تشوه الموقع بوضع علامات أو الكتابات الخادشة على الجدران في المباني.

Generated text: هذا اختبار قصير لاستنساخ الصوت على الجهاز.

VoxCPM2

bf16 clone from the Modern Standard Arabic reference

Cosine

0.757

ASR

0.0% WER

RTF

1.82

Fish Audio S2 Pro

fp16 clone from the Modern Standard Arabic reference

Cosine

0.686

ASR

14.3% WER

RTF

3.13

Chatterbox Multilingual

fp16 clone from the Modern Standard Arabic reference

Cosine

0.674

ASR

0.0% WER

RTF

0.88

OmniVoice

int8 clone from the Modern Standard Arabic reference

Cosine

0.621

ASR

0.0% WER

RTF

0.47

Spanish reference

Spanish

FLEURS test/es_419/16388069031423373053.wav

Reference transcript: Internet une y mezcla componentes propios de la comunicación masiva y entre personas.

Generated text: Esta es una breve prueba de clonación de voz en el dispositivo.

OmniVoice

int8 clone from the Spanish reference

Cosine

0.684

ASR

0.0% WER

RTF

0.36

Chatterbox Multilingual

fp16 clone from the Spanish reference

Cosine

0.670

ASR

0.0% WER

RTF

0.92

VoxCPM2

bf16 clone from the Spanish reference

Cosine

0.658

ASR

0.0% WER

RTF

1.80

Fish Audio S2 Pro

fp16 clone from the Spanish reference

Cosine

0.584

ASR

0.0% WER

RTF

3.15

Chinese reference

Chinese

FLEURS test/cmn_hans_cn/5479411876618006152.wav

Reference transcript: 互联网结合了大众传播和人际传播的要素。

Generated text: 这是一个简短的本地语音克隆测试。

OmniVoice

int8 clone from the Chinese reference

Cosine

0.690

ASR

0.0% CER

RTF

0.42

VoxCPM2

bf16 clone from the Chinese reference

Cosine

0.658

ASR

0.0% CER

RTF

1.90

Fish Audio S2 Pro

fp16 clone from the Chinese reference

Cosine

0.598

ASR

0.0% CER

RTF

3.20

Method

Dataset references, not hand-picked demos

References are single clips from the Google FLEURS test split: English, German, Arabic, Spanish, and Mandarin Chinese. For engines that accept a reference transcript, the exact FLEURS transcript was passed with the audio prompt.

The score shape mirrors the objective side of VoxCPM-style voice cloning evaluation: intelligibility via WER/CER, and cloning via speaker-embedding cosine similarity. The speaker encoder here is Soniqo’s `speech embed-speaker --engine mlx`, so compare rows inside this table, not against paper SIM percentages directly.

Chatterbox’s upstream language list includes Chinese, but the current Swift frontend only supports the direct tokenizer path for `en`, `ar`, `hi`, `de`, `es`, `fr`, `it`, and `pt`; the Chinese row is intentionally omitted until that frontend lands.

# Public speech-swift CLI example for one generated row.
speech speak "$TEXT" \
  --engine voxcpm2 \
  --voxcpm2-variant bf16 \
  --voxcpm2-ref-audio reference.wav \
  --language arabic \
  --output generated.wav

speech embed-speaker reference.wav --engine mlx --json
speech embed-speaker generated.wav --engine mlx --json
speech transcribe generated.wav --engine qwen3 --model 0.6B --language arabic

Why two scores?

A clone can sound like the speaker but say the wrong text, or say the text clearly while missing the speaker. Speaker cosine and ASR error catch different failure modes, so both need to be visible.

Emotion and style attribution

OmniVoice

Broad style hints

Good when you want a cloned speaker with simple delivery guidance, such as a calmer, younger, lower-pitched, or whispered read. This benchmark used a neutral delivery.

Chatterbox Multilingual

Expressiveness strength

Useful when you want the same speaker to sound more restrained or more animated without writing emotion tags into the text. This benchmark kept expressiveness neutral.

VoxCPM2

Voice direction in plain words

Strong fit when you want to describe the target voice or delivery in natural language while still cloning from a reference clip. This benchmark used the reference clip only.

Fish Audio S2 Pro

Acted delivery cues

Best when the script needs explicit moments like laughing, whispering, excitement, or sadness. This benchmark used plain text with no acting cues.

The benchmark intentionally leaves these controls neutral. That keeps speaker similarity tied to the FLEURS reference instead of rewarding a model for adding extra emotion, whispering, shouting, or laughter.

Reading the result

OmniVoice is the best all-around row set: highest average speaker cosine, exact ASR recovery in every requested language, and the fastest average RTF. VoxCPM2 bf16 is the strongest Arabic speaker-match row in this run and fixes the Arabic ASR miss seen in the earlier int8 pass.

Fish Audio is slower here, but it posts strong German and Arabic speaker similarity. Chatterbox is competitive on Arabic and Spanish, but the current Swift tokenizer frontend needs more work before Chinese can be benchmarked.

Try the stack

Speech Studio

Local desktop app for cloning voices and rendering multi-speaker scripts on your machine.

Open Speech Studio

Soniqo Cloud

Hosted endpoint for testing the same speech stack before wiring it into a product.

Open cloud.soniqo.audio

Voice cloning docs VoxCPM evaluation paper