PersonaPlex

Full-duplex speech-to-speech dialogue model based on the Moshi architecture (Kyutai). PersonaPlex 7B generates spoken responses directly from spoken input — no intermediate text pipeline required. The model ships with 18 voice presets and is available in 8-bit (recommended) and 4-bit quantization. 8-bit is the default — it is 30% faster and produces coherent responses, while 4-bit degrades output quality.

Architecture

PersonaPlex is a multi-stream autoregressive model with three core components:

Component	Details
Temporal Transformer	32 layers, dim=4096, 32 heads, SwiGLU (hidden_scale=4.125), RoPE, 8-bit quantized (default)
Depformer	6 layers, dim=1024, 16 heads, MultiLinear (weights_per_step=true), dep_q=16
Mimi Codec	16 codebooks, 12.5 Hz frame rate, 24 kHz audio output

The model processes 17 streams simultaneously: 1 text stream + 8 user audio streams + 8 agent audio streams. This architecture enables full-duplex conversation where the model can listen and speak at the same time.

Voice Presets

PersonaPlex includes 18 built-in voice presets across natural and varied styles:

Category	Presets
Natural Female	`NATF0`, `NATF1`, `NATF2`, `NATF3`
Natural Male	`NATM0`, `NATM1`, `NATM2`, `NATM3`
Varied Female	`VARF0`, `VARF1`, `VARF2`, `VARF3`, `VARF4`
Varied Male	`VARM0`, `VARM1`, `VARM2`, `VARM3`, `VARM4`

Inner Monologue

PersonaPlex generates two parallel streams at every step: 8 audio codebook tokens for the Mimi codec and one text token for the model’s internal monologue. The text stream is what the model is “thinking” as it speaks — it can diverge slightly from the final audio, but in practice it mirrors the spoken response closely enough to use as a live transcript.

The text tokens come back as raw SentencePiece piece IDs. Decode them with the SentencePieceDecoder that PersonaPlex ships:

import PersonaPlex
import AudioCommon

let model = try await PersonaPlexModel.fromPretrained()
let decoder = try model.makeTextDecoder()  // SentencePieceDecoder

let result = model.respondWithTranscript(userAudio: userSamples, voice: .NATM0)
let transcript = decoder.decode(result.textTokens)
print(transcript)        // "Sure, I can help with that..."
playAudio(result.audio)  // 24 kHz mono Float32

In streaming mode, respondStream emits textTokens chunks as they are produced — decode them incrementally to drive a live caption view while the audio is still generating. The --transcript CLI flag does exactly this behind the scenes.

Why it matters: SentencePieceDecoder is built on the shared AudioCommon.SentencePieceModel protobuf reader, so PersonaPlex, OmnilingualASR and any future SentencePiece-based model decode through the same tokenizer implementation. See the SentencePieceModel reference.

System Prompts

Pass any custom system prompt as a plain string — no external tokenization needed:

let response = model.respond(
    userAudio: audio,
    voice: .NATM0,
    systemPrompt: "You enjoy having a good conversation."
)

Or use a built-in preset:

assistant — General-purpose helpful assistant (default)
focused — Concise, direct responses
customer-service — Polite, solution-oriented support agent
teacher — Patient, explanatory teaching style

CLI Usage

Generate a spoken response from an audio input:

# Basic speech-to-speech
.build/release/audio respond --input question.wav

# Choose a voice preset
.build/release/audio respond --input question.wav --voice NATM0

# Stream audio output during generation
.build/release/audio respond --input question.wav --stream

# Custom system prompt text
.build/release/audio respond --input question.wav --system-prompt-text "You enjoy having a good conversation."

# Use a preset system prompt
.build/release/audio respond --input question.wav --system-prompt customer-service

# Get transcript alongside audio
.build/release/audio respond --input question.wav --transcript

# JSON output with metadata
.build/release/audio respond --input question.wav --json

Options

Option	Description
`--input`	Input audio file (WAV, required)
`--voice`	Voice preset name (e.g., `NATM0`, `VARF2`)
`--system-prompt`	System prompt preset: assistant, focused, customer-service, teacher
`--system-prompt-text`	Custom system prompt text (overrides `--system-prompt`)
`--max-steps`	Maximum generation steps
`--stream`	Emit audio chunks during generation
`--compile`	Use MLX compiled inference for faster generation
`--transcript`	Output the text transcript alongside audio
`--json`	JSON output with metadata

Sampling parameters can also be overridden:

Option	Default	Description
`--audio-temp`	0.8	Audio token sampling temperature
`--audio-top-k`	250	Audio token top-k sampling
`--text-temp`	0.7	Text token sampling temperature
`--text-top-k`	25	Text token top-k sampling

Streaming

The --stream flag enables real-time audio output. Audio chunks are emitted as they are generated, so playback can begin before the full response is complete. This is particularly useful for interactive applications where low latency matters.

Performance

Metric	Value
Real-time factor (RTF)	~1.4 (8-bit, near real-time)
Step latency	~112 ms/step on M2 Max (8-bit)
Model size (8-bit)	~9.1 GB
Peak RAM (8-bit)	~11 GB
Model size (4-bit)	~4.9 GB
Peak RAM (4-bit)	~7 GB

Important

PersonaPlex 7B (8-bit) requires at least 24 GB of RAM. The 4-bit variant fits on 16 GB devices but produces degraded output. On 8 GB devices, neither variant will fit. Use --compile for best performance on supported hardware.

Model Variants

Model	Size	HuggingFace
PersonaPlex-7B (8-bit) recommended	9.1 GB	aufklarer/PersonaPlex-7B-MLX-8bit
PersonaPlex-7B (4-bit)	4.9 GB	aufklarer/PersonaPlex-7B-MLX-4bit

Swift API

import PersonaPlex

let model = try await PersonaPlexModel.loadFromHub()
let response = try await model.respond(
    audioFile: "question.wav",
    voice: .NATM0,
    systemPrompt: .assistant
)
try response.audio.write(to: "answer.wav")