PersonaPlex
Full-duplex speech-to-speech dialogue model based on the Moshi architecture (Kyutai). PersonaPlex 7B generates spoken responses directly from spoken input — no intermediate text pipeline required. The model is 4-bit quantized and ships with 18 voice presets.
Architecture
PersonaPlex is a multi-stream autoregressive model with three core components:
| Component | Details |
|---|---|
| Temporal Transformer | 32 layers, dim=4096, 32 heads, SwiGLU (hidden_scale=4.125), RoPE, 4-bit quantized |
| Depformer | 6 layers, dim=1024, 16 heads, MultiLinear (weights_per_step=true), dep_q=16 |
| Mimi Codec | 16 codebooks, 12.5 Hz frame rate, 24 kHz audio output |
The model processes 17 streams simultaneously: 1 text stream + 8 user audio streams + 8 agent audio streams. This architecture enables full-duplex conversation where the model can listen and speak at the same time.
Voice Presets
PersonaPlex includes 18 built-in voice presets across natural and varied styles:
| Category | Presets |
|---|---|
| Natural Female | NATF0, NATF1, NATF2, NATF3 |
| Natural Male | NATM0, NATM1, NATM2, NATM3 |
| Varied Female | VARF0, VARF1, VARF2, VARF3, VARF4 |
| Varied Male | VARM0, VARM1, VARM2, VARM3, VARM4 |
System Prompts
Built-in system prompts steer the model's conversational behavior:
assistant— General-purpose helpful assistant (default)focused— Concise, direct responsescustomer-service— Polite, solution-oriented support agentteacher— Patient, explanatory teaching style
CLI Usage
Generate a spoken response from an audio input:
# Basic speech-to-speech
.build/release/audio respond --input question.wav
# Choose a voice preset
.build/release/audio respond --input question.wav --voice NATM0
# Stream audio output during generation
.build/release/audio respond --input question.wav --stream
# Use a specific system prompt
.build/release/audio respond --input question.wav --system-prompt customer-service
# Get transcript alongside audio
.build/release/audio respond --input question.wav --transcript
# JSON output with metadata
.build/release/audio respond --input question.wav --json
Options
| Option | Description |
|---|---|
--input | Input audio file (WAV, required) |
--voice | Voice preset name (e.g., NATM0, VARF2) |
--system-prompt | System prompt: assistant, focused, customer-service, teacher |
--max-steps | Maximum generation steps |
--stream | Emit audio chunks during generation |
--compile | Use MLX compiled inference for faster generation |
--transcript | Output the text transcript alongside audio |
--json | JSON output with metadata |
Sampling parameters can also be overridden:
| Option | Default | Description |
|---|---|---|
--audio-temp | 0.8 | Audio token sampling temperature |
--audio-top-k | 250 | Audio token top-k sampling |
--text-temp | 0.7 | Text token sampling temperature |
--text-top-k | 25 | Text token top-k sampling |
Streaming
The --stream flag enables real-time audio output. Audio chunks are emitted as they are generated, so playback can begin before the full response is complete. This is particularly useful for interactive applications where low latency matters.
Performance
| Metric | Value |
|---|---|
| Real-time factor (RTF) | ~0.87 (faster than real-time) |
| Step latency | ~68 ms/step on M2 Max |
| Model size (4-bit) | ~5.5 GB |
| Peak RAM | ~7 GB |
PersonaPlex 7B requires at least 16 GB of RAM. On 8 GB devices, the model will not fit in memory. Use --compile for best performance on supported hardware.
Model Variants
| Model | Size | HuggingFace |
|---|---|---|
| PersonaPlex-7B (4-bit) | 4.9 GB | aufklarer/PersonaPlex-7B-MLX-4bit |
| PersonaPlex-7B (8-bit) | 9.1 GB | aufklarer/PersonaPlex-7B-MLX-8bit |
Swift API
import PersonaPlex
let model = try await PersonaPlexModel.loadFromHub()
let response = try await model.respond(
audioFile: "question.wav",
voice: .NATM0,
systemPrompt: .assistant
)
try response.audio.write(to: "answer.wav")