Use case · Conversational

Voice in.
Voice out.

Three shapes of voice-first interfaces — a single full-duplex speech-to-speech model, a compositional wake → VAD → ASR → LLM → TTS pipeline you fully control, and wake-word activation for hands-free entry. All on-device, no cloud APIs, no audio leaving the device.

Three sub-use-cases

Pick the shape that fits your product.

Drop-in dialogue model, compositional pipeline with per-stage control, or a thin wake-word trigger. Each runs entirely on-device.

Quickstart · Full-duplex

Drop in for OpenAI Realtime clients.

speech-server exposes /v1/realtime — the same session.update / input_audio_buffer.append / response.audio.delta event shape as the OpenAI Realtime API. Existing clients written against the OpenAI SDK keep working when pointed at the local URL.

brew install soniqo/tap/speech
# Start the local Realtime-compatible server (PersonaPlex backend)
speech serve --model personaplex --port 8080

# In your existing OpenAI Realtime client, swap the base URL:
#   wss://api.openai.com/v1/realtime  →  ws://localhost:8080/v1/realtime
# session.update / input_audio_buffer.append / response.audio.delta — identical event schema
Quickstart · Compositional

Compose your own pipeline in Swift.

The Apple-side VoicePipeline (in speech-swift) wraps the same orchestrator. Feed it any conforming STT / LLM / TTS / VAD implementation; it runs the five-state turn detector, fires events for each transition, and handles deferred interruption when the user speaks over the agent.

import SpeechCore
import SilereoVAD
import ParakeetSTT
import Qwen3Chat
import KokoroTTS

let pipeline = VoicePipeline(
    vad: SileroVAD.streaming(),
    stt: try await ParakeetSTT.fromPretrained(),
    llm: try await Qwen35Chat.fromPretrained(systemPrompt: "..."),
    tts: try await KokoroTTSModel.fromPretrained(voice: "af_alloy"),
    config: .init(
        maxUtteranceDuration: 15,
        minInterruptionDuration: 0.4
    )
)

for await event in pipeline.events {
    switch event {
    case .userSpeechStarted: print("listening")
    case .transcriptionCompleted(let text): print("user:", text)
    case .responseAudioDelta(let chunk): player.enqueue(chunk)
    case .interruption: player.cancel()
    default: break
    }
}

Read the full state machine, AEC integration, and tool-call loop in speech-core/docs/pipeline.md ↗.

On-device performance

Numbers from M2 Max.

Stage-by-stage latency budget you can compose. PersonaPlex needs ~24 GB RAM for the 8-bit bundle; everything else fits on iPhone-class hardware.

PersonaPlex 7B
~112 ms/step
M2 Max · 8-bit · RTF 1.4
Streaming Dictation
340 ms partial
Parakeet-EOU · 25 langs · 30 ms compute / chunk
Silero VAD v5
32 ms chunks
~1.2 MB · 23× real-time
Wake-Word
26× real-time
~4 MB INT8 · CoreML / ONNX
Qwen3.5 Chat
~15 tok/s
INT4 MLX on M2 Max · ~65 ms/token
speech-server
/v1/realtime WS
OpenAI-Realtime SDK clients work unchanged
Deeper reading

Component guides.