PersonaPlex

Full-duplex speech-to-speech dialogue model based on the Moshi architecture (Kyutai). PersonaPlex 7B generates spoken responses directly from spoken input — no intermediate text pipeline required. The model is 4-bit quantized and ships with 18 voice presets.

Architecture

PersonaPlex is a multi-stream autoregressive model with three core components:

ComponentDetails
Temporal Transformer32 layers, dim=4096, 32 heads, SwiGLU (hidden_scale=4.125), RoPE, 4-bit quantized
Depformer6 layers, dim=1024, 16 heads, MultiLinear (weights_per_step=true), dep_q=16
Mimi Codec16 codebooks, 12.5 Hz frame rate, 24 kHz audio output

The model processes 17 streams simultaneously: 1 text stream + 8 user audio streams + 8 agent audio streams. This architecture enables full-duplex conversation where the model can listen and speak at the same time.

Voice Presets

PersonaPlex includes 18 built-in voice presets across natural and varied styles:

CategoryPresets
Natural FemaleNATF0, NATF1, NATF2, NATF3
Natural MaleNATM0, NATM1, NATM2, NATM3
Varied FemaleVARF0, VARF1, VARF2, VARF3, VARF4
Varied MaleVARM0, VARM1, VARM2, VARM3, VARM4

System Prompts

Built-in system prompts steer the model's conversational behavior:

CLI Usage

Generate a spoken response from an audio input:

# Basic speech-to-speech
.build/release/audio respond --input question.wav

# Choose a voice preset
.build/release/audio respond --input question.wav --voice NATM0

# Stream audio output during generation
.build/release/audio respond --input question.wav --stream

# Use a specific system prompt
.build/release/audio respond --input question.wav --system-prompt customer-service

# Get transcript alongside audio
.build/release/audio respond --input question.wav --transcript

# JSON output with metadata
.build/release/audio respond --input question.wav --json

Options

OptionDescription
--inputInput audio file (WAV, required)
--voiceVoice preset name (e.g., NATM0, VARF2)
--system-promptSystem prompt: assistant, focused, customer-service, teacher
--max-stepsMaximum generation steps
--streamEmit audio chunks during generation
--compileUse MLX compiled inference for faster generation
--transcriptOutput the text transcript alongside audio
--jsonJSON output with metadata

Sampling parameters can also be overridden:

OptionDefaultDescription
--audio-temp0.8Audio token sampling temperature
--audio-top-k250Audio token top-k sampling
--text-temp0.7Text token sampling temperature
--text-top-k25Text token top-k sampling

Streaming

The --stream flag enables real-time audio output. Audio chunks are emitted as they are generated, so playback can begin before the full response is complete. This is particularly useful for interactive applications where low latency matters.

Performance

MetricValue
Real-time factor (RTF)~0.87 (faster than real-time)
Step latency~68 ms/step on M2 Max
Model size (4-bit)~5.5 GB
Peak RAM~7 GB
Important

PersonaPlex 7B requires at least 16 GB of RAM. On 8 GB devices, the model will not fit in memory. Use --compile for best performance on supported hardware.

Model Variants

ModelSizeHuggingFace
PersonaPlex-7B (4-bit)4.9 GBaufklarer/PersonaPlex-7B-MLX-4bit
PersonaPlex-7B (8-bit)9.1 GBaufklarer/PersonaPlex-7B-MLX-8bit

Swift API

import PersonaPlex

let model = try await PersonaPlexModel.loadFromHub()
let response = try await model.respond(
    audioFile: "question.wav",
    voice: .NATM0,
    systemPrompt: .assistant
)
try response.audio.write(to: "answer.wav")