Cloning a voice at 48 kHz
with VoxCPM2.
A new TTS model just landed in Soniqo. It runs on your laptop, outputs studio-quality 48 kHz audio, and clones a voice from a single short clip. This post walks through what you can build with it, the three ways it lets you clone a voice, and a friendly look at how the model works inside.
Four things that change when cloning runs locally.
Running cloning on the device unlocks four properties at once — privacy, offline use, no per-call cost, and full voice ownership. Each of these opens a class of product that's awkward to build any other way.
Record 30 seconds of a parent reading. The audiobook app then narrates any chapter in their voice — same warmth, same accent, locally generated each session.
YouTubers and podcasters keep one consistent voice across 30 languages. Record once in English, ship the same episode in Japanese, Spanish, and Hindi without a vocal cast.
People facing voice loss can bank their voice in a short clip and keep speaking through assistive tech that sounds like them — not like a generic TTS engine.
Describe the voice you want — "young woman, gentle and warm" — and the model designs it without a reference recording. Useful for game NPCs, kiosk prompts, or A/B testing brand voices.
How VoxCPM2 compares to ElevenLabs.
ElevenLabs is the obvious cloud-API alternative. The trade-off is what runs where — and who owns the voice afterwards.
For products that need privacy guarantees, offline operation, or zero per-call cost, on-device cloning is the only option — every ElevenLabs call uploads audio to their servers.
| VoxCPM2 (Soniqo) | ElevenLabs | |
|---|---|---|
| Where it runs | On the user’s device | Hosted API |
| Audio leaves the device | No | Yes (uploaded to ElevenLabs) |
| Offline use | Yes | No (requires internet) |
| Per-call cost | None | Per-character billing |
| Model licence | Apache 2.0, open weights | Proprietary, SaaS only |
| Max output sample rate | 48 kHz native | 48 kHz (Pro tier and above) |
| Languages | 30 | 29 (Multilingual v2) · 70+ (Eleven v3) |
| Reference clip required | 5–30 s | 1 min (Instant) · 30 min (Professional) |
| Voice design from text | Yes | Yes |
Both engines reach 48 kHz; both support a similar language spread for everyday cloning; both expose voice design from a text description. The genuine difference is whether the audio ever leaves the device.
One model, three ways in.
The model is the same in every call. What changes is which arguments you pass — that decides whether you're designing a voice from a description, copying a recorded one, or preserving an accent.
Describe the voice in natural language. The model picks a matching voice and stays consistent across calls.
let audio = try await tts.generateVoxCPM2(
text: "Welcome to the show.",
instruct: "A young woman, gentle and warm voice."
)Pass any 5–30 s of clean speech. The model copies the timbre and rhythm and synthesises new text in that voice.
let ref = try AudioFileLoader.load(
url: URL(fileURLWithPath: "speaker.wav"),
targetSampleRate: 16000
)
let audio = try await tts.generateVoxCPM2(
text: "This is a cloned voice.",
refAudio: ref
)Pass the clip AND its transcript. The model can now line up acoustic features with phonemes — accent and vowel choices carry through.
let audio = try await tts.generateVoxCPM2(
text: "Hello from the cloned voice.",
refAudio: ref,
promptText: "this is what the reference clip actually said",
promptAudio: ref
)The same input slot, filled with different pieces. The model never sees a flag — it reads the sequence.
How VoxCPM2 produces audio.
Four cooperating modules. You don't need to know any of this to use the model, but if you're curious where the 48 kHz comes from — here it is.
The pipeline starts with a local encoder (LocEnc) that fuses text tokens and (optional) reference audio into one stream of vectors. That stream feeds the TSLM — a 28-layer MiniCPM-4 language model that decides what audio "patch" should come next, the same way a text LM picks the next token. A second pass through the RALM refines each patch.
Up to this point everything is a transformer. The interesting twist is the LocDiT: instead of choosing from a fixed vocabulary of discrete audio tokens, it runs a small diffusion process to paint the audio latent in each slot. No discrete codec means no quantisation bottleneck — which is what lets the final stage, AudioVAE V2, decode straight to 48 kHz. Every other on-device engine in this stack tops out at 24 kHz.
The split is worth noting: the autoregressive LM is great at deciding what should come next (content, rhythm, length); the diffusion head is great at painting acoustic detail (phase, spectrum). VoxCPM2 lets each do what it's good at. That's why the model holds its own at only 2B parameters — the architecture earns the perceptual quality, not the size.
Read the original research.
Three sizes. Pick by your disk budget.
All three bundles run the same architecture; they differ only in how aggressively the language model is quantised. The int8 bundle is the recommended default — it matches the upstream Python pipeline on the 8-sentence round-trip benchmark while being faster and 40% smaller than bf16.
| Bundle | Size | Best for |
|---|---|---|
| bf16 | ~5.0 GB | Reference / debugging. |
| int8 default | ~3.0 GB | Everyday cloning, audiobooks, podcasts. |
| int4 | ~1.9 GB | Disk-constrained deployments. |
From the CLI.
# Voice design — no reference clip
speech speak "Welcome to the show." \
--engine voxcpm2 \
--voxcpm2-instruct "A young woman, gentle and warm voice." \
--output design.wav
# Reference cloning — 5–30 s clean clip
speech speak "This is a cloned voice." \
--engine voxcpm2 \
--voxcpm2-variant int8 \
--voice-sample speaker.wav \
--output clone.wavKeep reading.
Every CLI flag, every generation mode, responsible-use notes.
Cross-engine comparison: VoxCPM2, CosyVoice3, Qwen3-TTS ICL.
Install Soniqo on Apple Silicon and run your first synthesis.
More posts on on-device speech as they land.
