PersonaPlex

Full-duplex speech-to-speech dialogue model based on the Moshi architecture (Kyutai). PersonaPlex 7B generates spoken responses directly from spoken input — no intermediate text pipeline required. The model ships with 18 voice presets and is available in 8-bit (recommended) and 4-bit quantization. 8-bit is the default — it is 30% faster and produces coherent responses, while 4-bit degrades output quality.

Architecture

PersonaPlex is a multi-stream autoregressive model with three core components:

ComponentDetails
Temporal Transformer32 layers, dim=4096, 32 heads, SwiGLU (hidden_scale=4.125), RoPE, 8-bit quantized (default)
Depformer6 layers, dim=1024, 16 heads, MultiLinear (weights_per_step=true), dep_q=16
Mimi Codec16 codebooks, 12.5 Hz frame rate, 24 kHz audio output

The model processes 17 streams simultaneously: 1 text stream + 8 user audio streams + 8 agent audio streams. This architecture enables full-duplex conversation where the model can listen and speak at the same time.

Voice Presets

PersonaPlex includes 18 built-in voice presets across natural and varied styles:

CategoryPresets
Natural FemaleNATF0, NATF1, NATF2, NATF3
Natural MaleNATM0, NATM1, NATM2, NATM3
Varied FemaleVARF0, VARF1, VARF2, VARF3, VARF4
Varied MaleVARM0, VARM1, VARM2, VARM3, VARM4

Inner Monologue

PersonaPlex generates two parallel streams at every step: 8 audio codebook tokens for the Mimi codec and one text token for the model’s internal monologue. The text stream is what the model is “thinking” as it speaks — it can diverge slightly from the final audio, but in practice it mirrors the spoken response closely enough to use as a live transcript.

The text tokens come back as raw SentencePiece piece IDs. Decode them with the SentencePieceDecoder that PersonaPlex ships:

import PersonaPlex
import AudioCommon

let model = try await PersonaPlexModel.fromPretrained()
let decoder = try model.makeTextDecoder()  // SentencePieceDecoder

let result = model.respondWithTranscript(userAudio: userSamples, voice: .NATM0)
let transcript = decoder.decode(result.textTokens)
print(transcript)        // "Sure, I can help with that..."
playAudio(result.audio)  // 24 kHz mono Float32

In streaming mode, respondStream emits textTokens chunks as they are produced — decode them incrementally to drive a live caption view while the audio is still generating. The --transcript CLI flag does exactly this behind the scenes.

Why it matters: SentencePieceDecoder is built on the shared AudioCommon.SentencePieceModel protobuf reader, so PersonaPlex, OmnilingualASR and any future SentencePiece-based model decode through the same tokenizer implementation. See the SentencePieceModel reference.

System Prompts

Pass any custom system prompt as a plain string — no external tokenization needed:

let response = model.respond(
    userAudio: audio,
    voice: .NATM0,
    systemPrompt: "You enjoy having a good conversation."
)

Or use a built-in preset:

CLI Usage

Generate a spoken response from an audio input:

# Basic speech-to-speech
.build/release/audio respond --input question.wav

# Choose a voice preset
.build/release/audio respond --input question.wav --voice NATM0

# Stream audio output during generation
.build/release/audio respond --input question.wav --stream

# Custom system prompt text
.build/release/audio respond --input question.wav --system-prompt-text "You enjoy having a good conversation."

# Use a preset system prompt
.build/release/audio respond --input question.wav --system-prompt customer-service

# Get transcript alongside audio
.build/release/audio respond --input question.wav --transcript

# JSON output with metadata
.build/release/audio respond --input question.wav --json

Options

OptionDescription
--inputInput audio file (WAV, required)
--voiceVoice preset name (e.g., NATM0, VARF2)
--system-promptSystem prompt preset: assistant, focused, customer-service, teacher
--system-prompt-textCustom system prompt text (overrides --system-prompt)
--max-stepsMaximum generation steps
--streamEmit audio chunks during generation
--compileUse MLX compiled inference for faster generation
--transcriptOutput the text transcript alongside audio
--jsonJSON output with metadata

Sampling parameters can also be overridden:

OptionDefaultDescription
--audio-temp0.8Audio token sampling temperature
--audio-top-k250Audio token top-k sampling
--text-temp0.7Text token sampling temperature
--text-top-k25Text token top-k sampling

Streaming

The --stream flag enables real-time audio output. Audio chunks are emitted as they are generated, so playback can begin before the full response is complete. This is particularly useful for interactive applications where low latency matters.

Performance

MetricValue
Real-time factor (RTF)~1.4 (8-bit, near real-time)
Step latency~112 ms/step on M2 Max (8-bit)
Model size (8-bit)~9.1 GB
Peak RAM (8-bit)~11 GB
Model size (4-bit)~4.9 GB
Peak RAM (4-bit)~7 GB
Important

PersonaPlex 7B (8-bit) requires at least 24 GB of RAM. The 4-bit variant fits on 16 GB devices but produces degraded output. On 8 GB devices, neither variant will fit. Use --compile for best performance on supported hardware.

Model Variants

ModelSizeHuggingFace
PersonaPlex-7B (8-bit) recommended9.1 GBaufklarer/PersonaPlex-7B-MLX-8bit
PersonaPlex-7B (4-bit)4.9 GBaufklarer/PersonaPlex-7B-MLX-4bit

Swift API

import PersonaPlex

let model = try await PersonaPlexModel.loadFromHub()
let response = try await model.respond(
    audioFile: "question.wav",
    voice: .NATM0,
    systemPrompt: .assistant
)
try response.audio.write(to: "answer.wav")