VoxCPM2

VoxCPM2 is a 2B-parameter, tokenizer-free diffusion-autoregressive TTS model from OpenBMB. It synthesises 48 kHz studio-quality audio in 30 languages with three production modes: zero-shot, single-reference voice cloning, and natural-language voice design ("a young female voice, warm and gentle"). On Apple Silicon it runs natively via MLX in bf16, int8, or int4 — the int8 bundle round-trips through Qwen3-ASR with 0% WER on the 8-sentence test harness and ~RTF 1.0.

What it is

Architecture

Five cooperating components produce a 48 kHz waveform:

ComponentDescription
MiniCPM-4 base LM28-layer MiniCPM-4 with LongRoPE, GQA (16 Q / 2 KV heads, 128 head dim), and SwiGLU MLP. Conditions on text tokens + audio latents.
Residual LM8-layer MiniCPM-4 variant without rotary embeddings. Refines the base LM hidden state per generated audio patch.
FSQ + Local DiT estimatorScalar-quantised hidden states drive a 12-layer Diffusion Transformer (V2) operating on 64-dim audio latents in patches of 4. CFG-zero-star Euler solver, 10 timesteps default.
AudioVAE V2Causal convolutional decoder. Reads 16 kHz reference audio + emits 48 kHz waveform (3× upsample baked in).
Stop headPer-step binary classifier on the LM hidden state. Argmax = 1 ends generation after minTokens patches.

Bundles

Three quantisation variants, all converted from upstream PyTorch via the openbmb/VoxCPM2 checkpoints. Quantisation applies to the Linear projections inside the LM / residual LM / DiT estimator / projection heads; the AudioVAE vocoder stays at fp16/bf16 because quantising it hurts audio quality.

BundleQuantisationSizeHuggingFace
bf16None (reference)~5.0 GBaufklarer/VoxCPM2-MLX-bf16
int8MLX QuantizedLinear, group size 64~3.0 GBaufklarer/VoxCPM2-MLX-int8
int4MLX QuantizedLinear, group size 64~1.9 GBaufklarer/VoxCPM2-MLX-int4

Round-trip ASR (Qwen3-ASR 0.6B, 8-sentence harness, M-series Apple Silicon):

VariantWERRTF
bf162.04 %1.38
int80.00 %1.02
int44.08 %0.83

int8 is the recommended default — it matches the upstream Python pipeline bit-for-bit on the LM path while being faster and 40 % smaller than bf16. int4 is the smallest bundle with acceptable WER for casual use.

Quick start

import VoxCPM2TTS

let tts = try await VoxCPM2TTSModel.fromPretrained()  // defaults to bf16
let audio = try await tts.generate(text: "Hello from VoxCPM2.", language: "english")
// audio: [Float] at 48 kHz mono

Pass an explicit model ID to pick the int8 / int4 bundle:

let tts = try await VoxCPM2TTSModel.fromPretrained(
    modelId: "aufklarer/VoxCPM2-MLX-int8"
)

Voice design (instruction-driven)

Pass a natural-language style description; the model conditions synthesis on it without a reference audio sample:

let audio = try await tts.generateVoxCPM2(
    text: "Welcome to the show.",
    instruct: "A young woman, gentle and warm voice."
)

Voice cloning

Single-reference cloning from a 16 kHz mono clip:

let ref = try AudioFileLoader.load(url: URL(fileURLWithPath: "speaker.wav"),
                                   targetSampleRate: 16000)
let audio = try await tts.generateVoxCPM2(
    text: "This is a cloned voice.",
    refAudio: ref
)

Ultimate cloning — pass both reference audio and the matching transcript so the LM can also condition on the lexical context, preserving prosody and accent more faithfully:

let audio = try await tts.generateVoxCPM2(
    text: "Hello from the cloned voice.",
    refAudio: ref,
    promptText: "this is what the reference clip actually said",
    promptAudio: ref
)

CLI

speech speak "Hello there." \
    --engine voxcpm2 \
    --voxcpm2-variant int8 \
    --output hello.wav

# Voice design
speech speak "Welcome to the show." \
    --engine voxcpm2 \
    --voxcpm2-instruct "A young woman, gentle and warm voice." \
    --output design.wav

# Voice cloning
speech speak "This is a cloned voice." \
    --engine voxcpm2 \
    --voice-sample speaker.wav \
    --output clone.wav

Flags: --voxcpm2-variant {bf16,int8,int4}, --voxcpm2-instruct, --voxcpm2-ref-audio, --voxcpm2-prompt-audio + --voxcpm2-prompt-text, --voxcpm2-cfg-value, --voxcpm2-timesteps, --voxcpm2-max-tokens, --voxcpm2-min-tokens, --seed for reproducible synthesis.

Picking among speech-swift TTS modules

Kokoro-82MQwen3-TTSCosyVoice3VoxCPM2VibeVoice 1.5B
Params82M0.6 / 1.7 B0.5B2B1.5B
Sample rate24 kHz24 kHz24 kHz48 kHz24 kHz
BackendCoreML (ANE)MLX, CoreMLMLXMLXMLX
Languages1010930EN + ZH
Voice designFixed presetsInstruction-driven
Voice cloningICL referenceZero-shot referenceReference + ultimateRaw audio + transcript
Long-formShort/mediumStreamingStreamingStreaming patchesUp to 90 min
Pick VoxCPM2 when…

…you need 48 kHz output (music / broadcast applications) or natural-language voice design without a reference clip. For maximum-quality short-form English TTS with smaller download, CosyVoice3 or Qwen3-TTS are lighter. For long-form podcast / audiobook in EN/ZH, VibeVoice 1.5B is purpose-built.

Responsible use

Voice cloning is included. Obtain consent for any voice you clone and don't use the model to impersonate individuals, generate disinformation, or commit fraud. The full safety guidance from openbmb/VoxCPM2 applies.