VoxCPM2

VoxCPM2 is a 2B-parameter, tokenizer-free diffusion-autoregressive TTS model from OpenBMB. It synthesises 48 kHz studio-quality audio in 30 languages with three production modes: zero-shot, single-reference voice cloning, and natural-language voice design ("a young female voice, warm and gentle"). On Apple Silicon it runs natively via MLX in bf16, int8, or int4 — the int8 bundle round-trips through Qwen3-ASR with 0% WER on the 8-sentence test harness and ~RTF 1.0.

What it is

48 kHz output — the only on-device engine in this stack with studio sample rate. Every other TTS module maxes at 24 kHz.
Voice design — natural-language style control: --voxcpm2-instruct "young female voice, warm and gentle". None of the other engines expose this.
Voice cloning — single-reference cloning from a 16 kHz clip; "ultimate cloning" (reference audio + transcript) for prosody preservation.
30 languages — English, Chinese, Indonesian, Japanese, Korean, and more. Auto-detected from text.
Apache 2.0 — model weights inherit upstream openbmb licence; our Swift port is the same.

Architecture

Five cooperating components produce a 48 kHz waveform:

Component	Description
MiniCPM-4 base LM	28-layer MiniCPM-4 with LongRoPE, GQA (16 Q / 2 KV heads, 128 head dim), and SwiGLU MLP. Conditions on text tokens + audio latents.
Residual LM	8-layer MiniCPM-4 variant without rotary embeddings. Refines the base LM hidden state per generated audio patch.
FSQ + Local DiT estimator	Scalar-quantised hidden states drive a 12-layer Diffusion Transformer (V2) operating on 64-dim audio latents in patches of 4. CFG-zero-star Euler solver, 10 timesteps default.
AudioVAE V2	Causal convolutional decoder. Reads 16 kHz reference audio + emits 48 kHz waveform (3× upsample baked in).
Stop head	Per-step binary classifier on the LM hidden state. Argmax = 1 ends generation after `minTokens` patches.

Bundles

Three quantisation variants, all converted from upstream PyTorch via the openbmb/VoxCPM2 checkpoints. Quantisation applies to the Linear projections inside the LM / residual LM / DiT estimator / projection heads; the AudioVAE vocoder stays at fp16/bf16 because quantising it hurts audio quality.

Bundle	Quantisation	Size	HuggingFace
bf16	None (reference)	~5.0 GB	aufklarer/VoxCPM2-MLX-bf16
int8	MLX QuantizedLinear, group size 64	~3.0 GB	aufklarer/VoxCPM2-MLX-int8
int4	MLX QuantizedLinear, group size 64	~1.9 GB	aufklarer/VoxCPM2-MLX-int4

Round-trip ASR (Qwen3-ASR 0.6B, 8-sentence harness, M-series Apple Silicon):

Variant	WER	RTF
bf16	2.04 %	1.38
int8	0.00 %	1.02
int4	4.08 %	0.83

int8 is the recommended default — it matches the upstream Python pipeline bit-for-bit on the LM path while being faster and 40 % smaller than bf16. int4 is the smallest bundle with acceptable WER for casual use.

Quick start

import VoxCPM2TTS

let tts = try await VoxCPM2TTSModel.fromPretrained()  // defaults to bf16
let audio = try await tts.generate(text: "Hello from VoxCPM2.", language: "english")
// audio: [Float] at 48 kHz mono

Pass an explicit model ID to pick the int8 / int4 bundle:

let tts = try await VoxCPM2TTSModel.fromPretrained(
    modelId: "aufklarer/VoxCPM2-MLX-int8"
)

Voice design (instruction-driven)

Pass a natural-language style description; the model conditions synthesis on it without a reference audio sample:

let audio = try await tts.generateVoxCPM2(
    text: "Welcome to the show.",
    instruct: "A young woman, gentle and warm voice."
)

Voice cloning

Single-reference cloning from a 16 kHz mono clip:

let ref = try AudioFileLoader.load(url: URL(fileURLWithPath: "speaker.wav"),
                                   targetSampleRate: 16000)
let audio = try await tts.generateVoxCPM2(
    text: "This is a cloned voice.",
    refAudio: ref
)

Ultimate cloning — pass both reference audio and the matching transcript so the LM can also condition on the lexical context, preserving prosody and accent more faithfully:

let audio = try await tts.generateVoxCPM2(
    text: "Hello from the cloned voice.",
    refAudio: ref,
    promptText: "this is what the reference clip actually said",
    promptAudio: ref
)

CLI

speech speak "Hello there." \
    --engine voxcpm2 \
    --voxcpm2-variant int8 \
    --output hello.wav

# Voice design
speech speak "Welcome to the show." \
    --engine voxcpm2 \
    --voxcpm2-instruct "A young woman, gentle and warm voice." \
    --output design.wav

# Voice cloning
speech speak "This is a cloned voice." \
    --engine voxcpm2 \
    --voice-sample speaker.wav \
    --output clone.wav

Flags: --voxcpm2-variant {bf16,int8,int4}, --voxcpm2-instruct, --voxcpm2-ref-audio, --voxcpm2-prompt-audio + --voxcpm2-prompt-text, --voxcpm2-cfg-value, --voxcpm2-timesteps, --voxcpm2-max-tokens, --voxcpm2-min-tokens, --seed for reproducible synthesis.

Picking among speech-swift TTS modules

	Kokoro-82M	Qwen3-TTS	CosyVoice3	VoxCPM2	VibeVoice 1.5B
Params	82M	0.6 / 1.7 B	0.5B	2B	1.5B
Sample rate	24 kHz	24 kHz	24 kHz	48 kHz	24 kHz
Backend	CoreML (ANE)	MLX, CoreML	MLX	MLX	MLX
Languages	10	10	9	30	EN + ZH
Voice design	Fixed presets	—	—	Instruction-driven	—
Voice cloning	—	ICL reference	Zero-shot reference	Reference + ultimate	Raw audio + transcript
Long-form	Short/medium	Streaming	Streaming	Streaming patches	Up to 90 min

Pick VoxCPM2 when…

…you need 48 kHz output (music / broadcast applications) or natural-language voice design without a reference clip. For maximum-quality short-form English TTS with smaller download, CosyVoice3 or Qwen3-TTS are lighter. For long-form podcast / audiobook in EN/ZH, VibeVoice 1.5B is purpose-built.

Responsible use

Voice cloning is included. Obtain consent for any voice you clone and don't use the model to impersonate individuals, generate disinformation, or commit fraud. The full safety guidance from openbmb/VoxCPM2 applies.