VoxCPM2
VoxCPM2 is a 2B-parameter, tokenizer-free diffusion-autoregressive TTS model from OpenBMB. It synthesises 48 kHz studio-quality audio in 30 languages with three production modes: zero-shot, single-reference voice cloning, and natural-language voice design ("a young female voice, warm and gentle"). On Apple Silicon it runs natively via MLX in bf16, int8, or int4 — the int8 bundle round-trips through Qwen3-ASR with 0% WER on the 8-sentence test harness and ~RTF 1.0.
What it is
- 48 kHz output — the only on-device engine in this stack with studio sample rate. Every other TTS module maxes at 24 kHz.
- Voice design — natural-language style control:
--voxcpm2-instruct "young female voice, warm and gentle". None of the other engines expose this. - Voice cloning — single-reference cloning from a 16 kHz clip; "ultimate cloning" (reference audio + transcript) for prosody preservation.
- 30 languages — English, Chinese, Indonesian, Japanese, Korean, and more. Auto-detected from text.
- Apache 2.0 — model weights inherit upstream openbmb licence; our Swift port is the same.
Architecture
Five cooperating components produce a 48 kHz waveform:
| Component | Description |
|---|---|
| MiniCPM-4 base LM | 28-layer MiniCPM-4 with LongRoPE, GQA (16 Q / 2 KV heads, 128 head dim), and SwiGLU MLP. Conditions on text tokens + audio latents. |
| Residual LM | 8-layer MiniCPM-4 variant without rotary embeddings. Refines the base LM hidden state per generated audio patch. |
| FSQ + Local DiT estimator | Scalar-quantised hidden states drive a 12-layer Diffusion Transformer (V2) operating on 64-dim audio latents in patches of 4. CFG-zero-star Euler solver, 10 timesteps default. |
| AudioVAE V2 | Causal convolutional decoder. Reads 16 kHz reference audio + emits 48 kHz waveform (3× upsample baked in). |
| Stop head | Per-step binary classifier on the LM hidden state. Argmax = 1 ends generation after minTokens patches. |
Bundles
Three quantisation variants, all converted from upstream PyTorch via the openbmb/VoxCPM2 checkpoints. Quantisation applies to the Linear projections inside the LM / residual LM / DiT estimator / projection heads; the AudioVAE vocoder stays at fp16/bf16 because quantising it hurts audio quality.
| Bundle | Quantisation | Size | HuggingFace |
|---|---|---|---|
| bf16 | None (reference) | ~5.0 GB | aufklarer/VoxCPM2-MLX-bf16 |
| int8 | MLX QuantizedLinear, group size 64 | ~3.0 GB | aufklarer/VoxCPM2-MLX-int8 |
| int4 | MLX QuantizedLinear, group size 64 | ~1.9 GB | aufklarer/VoxCPM2-MLX-int4 |
Round-trip ASR (Qwen3-ASR 0.6B, 8-sentence harness, M-series Apple Silicon):
| Variant | WER | RTF |
|---|---|---|
| bf16 | 2.04 % | 1.38 |
| int8 | 0.00 % | 1.02 |
| int4 | 4.08 % | 0.83 |
int8 is the recommended default — it matches the upstream Python pipeline bit-for-bit on the LM path while being faster and 40 % smaller than bf16. int4 is the smallest bundle with acceptable WER for casual use.
Quick start
import VoxCPM2TTS
let tts = try await VoxCPM2TTSModel.fromPretrained() // defaults to bf16
let audio = try await tts.generate(text: "Hello from VoxCPM2.", language: "english")
// audio: [Float] at 48 kHz mono
Pass an explicit model ID to pick the int8 / int4 bundle:
let tts = try await VoxCPM2TTSModel.fromPretrained(
modelId: "aufklarer/VoxCPM2-MLX-int8"
)
Voice design (instruction-driven)
Pass a natural-language style description; the model conditions synthesis on it without a reference audio sample:
let audio = try await tts.generateVoxCPM2(
text: "Welcome to the show.",
instruct: "A young woman, gentle and warm voice."
)
Voice cloning
Single-reference cloning from a 16 kHz mono clip:
let ref = try AudioFileLoader.load(url: URL(fileURLWithPath: "speaker.wav"),
targetSampleRate: 16000)
let audio = try await tts.generateVoxCPM2(
text: "This is a cloned voice.",
refAudio: ref
)
Ultimate cloning — pass both reference audio and the matching transcript so the LM can also condition on the lexical context, preserving prosody and accent more faithfully:
let audio = try await tts.generateVoxCPM2(
text: "Hello from the cloned voice.",
refAudio: ref,
promptText: "this is what the reference clip actually said",
promptAudio: ref
)
CLI
speech speak "Hello there." \
--engine voxcpm2 \
--voxcpm2-variant int8 \
--output hello.wav
# Voice design
speech speak "Welcome to the show." \
--engine voxcpm2 \
--voxcpm2-instruct "A young woman, gentle and warm voice." \
--output design.wav
# Voice cloning
speech speak "This is a cloned voice." \
--engine voxcpm2 \
--voice-sample speaker.wav \
--output clone.wav
Flags: --voxcpm2-variant {bf16,int8,int4}, --voxcpm2-instruct, --voxcpm2-ref-audio, --voxcpm2-prompt-audio + --voxcpm2-prompt-text, --voxcpm2-cfg-value, --voxcpm2-timesteps, --voxcpm2-max-tokens, --voxcpm2-min-tokens, --seed for reproducible synthesis.
Picking among speech-swift TTS modules
| Kokoro-82M | Qwen3-TTS | CosyVoice3 | VoxCPM2 | VibeVoice 1.5B | |
|---|---|---|---|---|---|
| Params | 82M | 0.6 / 1.7 B | 0.5B | 2B | 1.5B |
| Sample rate | 24 kHz | 24 kHz | 24 kHz | 48 kHz | 24 kHz |
| Backend | CoreML (ANE) | MLX, CoreML | MLX | MLX | MLX |
| Languages | 10 | 10 | 9 | 30 | EN + ZH |
| Voice design | Fixed presets | — | — | Instruction-driven | — |
| Voice cloning | — | ICL reference | Zero-shot reference | Reference + ultimate | Raw audio + transcript |
| Long-form | Short/medium | Streaming | Streaming | Streaming patches | Up to 90 min |
…you need 48 kHz output (music / broadcast applications) or natural-language voice design without a reference clip. For maximum-quality short-form English TTS with smaller download, CosyVoice3 or Qwen3-TTS are lighter. For long-form podcast / audiobook in EN/ZH, VibeVoice 1.5B is purpose-built.
Responsible use
Voice cloning is included. Obtain consent for any voice you clone and don't use the model to impersonate individuals, generate disinformation, or commit fraud. The full safety guidance from openbmb/VoxCPM2 applies.