PersonaPlex
Full-duplex speech-to-speech dialogue model based on the Moshi architecture (Kyutai). PersonaPlex 7B generates spoken responses directly from spoken input — no intermediate text pipeline required. The model ships with 18 voice presets and is available in 8-bit (recommended) and 4-bit quantization. 8-bit is the default — it is 30% faster and produces coherent responses, while 4-bit degrades output quality.
Architecture
PersonaPlex is a multi-stream autoregressive model with three core components:
| Component | Details |
|---|---|
| Temporal Transformer | 32 layers, dim=4096, 32 heads, SwiGLU (hidden_scale=4.125), RoPE, 8-bit quantized (default) |
| Depformer | 6 layers, dim=1024, 16 heads, MultiLinear (weights_per_step=true), dep_q=16 |
| Mimi Codec | 16 codebooks, 12.5 Hz frame rate, 24 kHz audio output |
The model processes 17 streams simultaneously: 1 text stream + 8 user audio streams + 8 agent audio streams. This architecture enables full-duplex conversation where the model can listen and speak at the same time.
Voice Presets
PersonaPlex includes 18 built-in voice presets across natural and varied styles:
| Category | Presets |
|---|---|
| Natural Female | NATF0, NATF1, NATF2, NATF3 |
| Natural Male | NATM0, NATM1, NATM2, NATM3 |
| Varied Female | VARF0, VARF1, VARF2, VARF3, VARF4 |
| Varied Male | VARM0, VARM1, VARM2, VARM3, VARM4 |
Inner Monologue
PersonaPlex generates two parallel streams at every step: 8 audio codebook tokens for the Mimi codec and one text token for the model’s internal monologue. The text stream is what the model is “thinking” as it speaks — it can diverge slightly from the final audio, but in practice it mirrors the spoken response closely enough to use as a live transcript.
The text tokens come back as raw SentencePiece piece IDs. Decode them with the SentencePieceDecoder that PersonaPlex ships:
import PersonaPlex
import AudioCommon
let model = try await PersonaPlexModel.fromPretrained()
let decoder = try model.makeTextDecoder() // SentencePieceDecoder
let result = model.respondWithTranscript(userAudio: userSamples, voice: .NATM0)
let transcript = decoder.decode(result.textTokens)
print(transcript) // "Sure, I can help with that..."
playAudio(result.audio) // 24 kHz mono Float32
In streaming mode, respondStream emits textTokens chunks as they are produced — decode them incrementally to drive a live caption view while the audio is still generating. The --transcript CLI flag does exactly this behind the scenes.
Why it matters: SentencePieceDecoder is built on the shared AudioCommon.SentencePieceModel protobuf reader, so PersonaPlex, OmnilingualASR and any future SentencePiece-based model decode through the same tokenizer implementation. See the SentencePieceModel reference.
System Prompts
Pass any custom system prompt as a plain string — no external tokenization needed:
let response = model.respond(
userAudio: audio,
voice: .NATM0,
systemPrompt: "You enjoy having a good conversation."
)
Or use a built-in preset:
assistant— General-purpose helpful assistant (default)focused— Concise, direct responsescustomer-service— Polite, solution-oriented support agentteacher— Patient, explanatory teaching style
CLI Usage
Generate a spoken response from an audio input:
# Basic speech-to-speech
.build/release/audio respond --input question.wav
# Choose a voice preset
.build/release/audio respond --input question.wav --voice NATM0
# Stream audio output during generation
.build/release/audio respond --input question.wav --stream
# Custom system prompt text
.build/release/audio respond --input question.wav --system-prompt-text "You enjoy having a good conversation."
# Use a preset system prompt
.build/release/audio respond --input question.wav --system-prompt customer-service
# Get transcript alongside audio
.build/release/audio respond --input question.wav --transcript
# JSON output with metadata
.build/release/audio respond --input question.wav --json
Options
| Option | Description |
|---|---|
--input | Input audio file (WAV, required) |
--voice | Voice preset name (e.g., NATM0, VARF2) |
--system-prompt | System prompt preset: assistant, focused, customer-service, teacher |
--system-prompt-text | Custom system prompt text (overrides --system-prompt) |
--max-steps | Maximum generation steps |
--stream | Emit audio chunks during generation |
--compile | Use MLX compiled inference for faster generation |
--transcript | Output the text transcript alongside audio |
--json | JSON output with metadata |
Sampling parameters can also be overridden:
| Option | Default | Description |
|---|---|---|
--audio-temp | 0.8 | Audio token sampling temperature |
--audio-top-k | 250 | Audio token top-k sampling |
--text-temp | 0.7 | Text token sampling temperature |
--text-top-k | 25 | Text token top-k sampling |
Streaming
The --stream flag enables real-time audio output. Audio chunks are emitted as they are generated, so playback can begin before the full response is complete. This is particularly useful for interactive applications where low latency matters.
Performance
| Metric | Value |
|---|---|
| Real-time factor (RTF) | ~1.4 (8-bit, near real-time) |
| Step latency | ~112 ms/step on M2 Max (8-bit) |
| Model size (8-bit) | ~9.1 GB |
| Peak RAM (8-bit) | ~11 GB |
| Model size (4-bit) | ~4.9 GB |
| Peak RAM (4-bit) | ~7 GB |
PersonaPlex 7B (8-bit) requires at least 24 GB of RAM. The 4-bit variant fits on 16 GB devices but produces degraded output. On 8 GB devices, neither variant will fit. Use --compile for best performance on supported hardware.
Model Variants
| Model | Size | HuggingFace |
|---|---|---|
| PersonaPlex-7B (8-bit) recommended | 9.1 GB | aufklarer/PersonaPlex-7B-MLX-8bit |
| PersonaPlex-7B (4-bit) | 4.9 GB | aufklarer/PersonaPlex-7B-MLX-4bit |
Swift API
import PersonaPlex
let model = try await PersonaPlexModel.loadFromHub()
let response = try await model.respond(
audioFile: "question.wav",
voice: .NATM0,
systemPrompt: .assistant
)
try response.audio.write(to: "answer.wav")