Blog·
Voice cloning
May 17, 2026

Cloning a voice at 48 kHz
with VoxCPM2.

A new TTS model just landed in Soniqo. It runs on your laptop, outputs studio-quality 48 kHz audio, and clones a voice from a single short clip. This post walks through what you can build with it, the three ways it lets you clone a voice, and a friendly look at how the model works inside.

What you can build

Four things that change when cloning runs locally.

Running cloning on the device unlocks four properties at once — privacy, offline use, no per-call cost, and full voice ownership. Each of these opens a class of product that's awkward to build any other way.

Personal audiobook narrators

Record 30 seconds of a parent reading. The audiobook app then narrates any chapter in their voice — same warmth, same accent, locally generated each session.

Multilingual creator content

YouTubers and podcasters keep one consistent voice across 30 languages. Record once in English, ship the same episode in Japanese, Spanish, and Hindi without a vocal cast.

Accessibility & voice banking

People facing voice loss can bank their voice in a short clip and keep speaking through assistive tech that sounds like them — not like a generic TTS engine.

Product voices on demand

Describe the voice you want — "young woman, gentle and warm" — and the model designs it without a reference recording. Useful for game NPCs, kiosk prompts, or A/B testing brand voices.

On-device vs hosted

How VoxCPM2 compares to ElevenLabs.

ElevenLabs is the obvious cloud-API alternative. The trade-off is what runs where — and who owns the voice afterwards.

For products that need privacy guarantees, offline operation, or zero per-call cost, on-device cloning is the only option — every ElevenLabs call uploads audio to their servers.

VoxCPM2 (Soniqo)ElevenLabs
Where it runsOn the user’s deviceHosted API
Audio leaves the deviceNoYes (uploaded to ElevenLabs)
Offline useYesNo (requires internet)
Per-call costNonePer-character billing
Model licenceApache 2.0, open weightsProprietary, SaaS only
Max output sample rate48 kHz native48 kHz (Pro tier and above)
Languages3029 (Multilingual v2) · 70+ (Eleven v3)
Reference clip required5–30 s1 min (Instant) · 30 min (Professional)
Voice design from textYesYes

Both engines reach 48 kHz; both support a similar language spread for everyday cloning; both expose voice design from a text description. The genuine difference is whether the audio ever leaves the device.

Three cloning modes

One model, three ways in.

The model is the same in every call. What changes is which arguments you pass — that decides whether you're designing a voice from a description, copying a recorded one, or preserving an accent.

Voice design
When you don't have a reference recording.

Describe the voice in natural language. The model picks a matching voice and stays consistent across calls.

let audio = try await tts.generateVoxCPM2(
    text: "Welcome to the show.",
    instruct: "A young woman, gentle and warm voice."
)
Reference cloning
When you have a short clip of the target speaker.

Pass any 5–30 s of clean speech. The model copies the timbre and rhythm and synthesises new text in that voice.

let ref = try AudioFileLoader.load(
    url: URL(fileURLWithPath: "speaker.wav"),
    targetSampleRate: 16000
)
let audio = try await tts.generateVoxCPM2(
    text: "This is a cloned voice.",
    refAudio: ref
)
Ultimate cloning
When the speaker has a distinctive accent and you want it preserved.

Pass the clip AND its transcript. The model can now line up acoustic features with phonemes — accent and vowel choices carry through.

let audio = try await tts.generateVoxCPM2(
    text: "Hello from the cloned voice.",
    refAudio: ref,
    promptText: "this is what the reference clip actually said",
    promptAudio: ref
)
Three cloning modes, same modelEach mode arranges different pieces in the input sequence before the model. Voice design adds a written description, reference cloning adds an audio prefix, and ultimate cloning adds a paired audio-and-transcript example.Voice design(description)text to sayReference cloningreference audiotext to sayUltimate cloningreference audiotranscripttext to sayprompt audioaudio framestext conditiontext to synthesise

The same input slot, filled with different pieces. The model never sees a flag — it reads the sequence.

Under the hood

How VoxCPM2 produces audio.

Four cooperating modules. You don't need to know any of this to use the model, but if you're curious where the 48 kHz comes from — here it is.

VoxCPM2 architectureText and optional voice prompts feed an autoregressive language model and a residual refiner. A local diffusion transformer produces audio latents which the AudioVAE V2 decodes to a 48 kHz waveform.TextVoice prompt audioVoice instructionPrompt transcriptLocEncaudio + text fused into one streamTSLM · MiniCPM-4 backbone28-layer autoregressive LMdecides what audio patch comes nextRALMrefines each patch for prosodic detailLocDiT · diffusion estimatorpaints the audio latent in each slotAudioVAE V2 → 48 kHz waveform

The pipeline starts with a local encoder (LocEnc) that fuses text tokens and (optional) reference audio into one stream of vectors. That stream feeds the TSLM — a 28-layer MiniCPM-4 language model that decides what audio "patch" should come next, the same way a text LM picks the next token. A second pass through the RALM refines each patch.

Up to this point everything is a transformer. The interesting twist is the LocDiT: instead of choosing from a fixed vocabulary of discrete audio tokens, it runs a small diffusion process to paint the audio latent in each slot. No discrete codec means no quantisation bottleneck — which is what lets the final stage, AudioVAE V2, decode straight to 48 kHz. Every other on-device engine in this stack tops out at 24 kHz.

The split is worth noting: the autoregressive LM is great at deciding what should come next (content, rhythm, length); the diffusion head is great at painting acoustic detail (phase, spectrum). VoxCPM2 lets each do what it's good at. That's why the model holds its own at only 2B parameters — the architecture earns the perceptual quality, not the size.

The paper

Read the original research.

VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning
OpenBMB · arXiv:2509.24650 · Sept 2025
View on arXiv
Bundles

Three sizes. Pick by your disk budget.

All three bundles run the same architecture; they differ only in how aggressively the language model is quantised. The int8 bundle is the recommended default — it matches the upstream Python pipeline on the 8-sentence round-trip benchmark while being faster and 40% smaller than bf16.

BundleSizeBest for
bf16~5.0 GBReference / debugging.
int8
default
~3.0 GBEveryday cloning, audiobooks, podcasts.
int4~1.9 GBDisk-constrained deployments.
Try it

From the CLI.

# Voice design — no reference clip
speech speak "Welcome to the show." \
    --engine voxcpm2 \
    --voxcpm2-instruct "A young woman, gentle and warm voice." \
    --output design.wav

# Reference cloning — 5–30 s clean clip
speech speak "This is a cloned voice." \
    --engine voxcpm2 \
    --voxcpm2-variant int8 \
    --voice-sample speaker.wav \
    --output clone.wav
Where next

Keep reading.