Voice in.
Voice out.
Three shapes of voice-first interfaces — a single full-duplex speech-to-speech model, a compositional wake → VAD → ASR → LLM → TTS pipeline you fully control, and wake-word activation for hands-free entry. All on-device, no cloud APIs, no audio leaving the device.
Pick the shape that fits your product.
Drop-in dialogue model, compositional pipeline with per-stage control, or a thin wake-word trigger. Each runs entirely on-device.
A single model takes mic input and produces voice output. Drop-in OpenAI-Realtime-compatible WebSocket; minimal code, opaque internals.
AEC → enhance → VAD → STT → LLM → TTS → audio, with a five-state turn detector, deferred interruption handling, and tool-call loop. The canonical orchestrator lives in speech-core.
Hands-free trigger for any voice flow. Custom keywords with per-phrase thresholds, plug into the same VAD pipeline as compositional agents.
Drop in for OpenAI Realtime clients.
speech-server exposes /v1/realtime — the same session.update / input_audio_buffer.append / response.audio.delta event shape as the OpenAI Realtime API. Existing clients written against the OpenAI SDK keep working when pointed at the local URL.
brew install soniqo/tap/speech
# Start the local Realtime-compatible server (PersonaPlex backend)
speech serve --model personaplex --port 8080
# In your existing OpenAI Realtime client, swap the base URL:
# wss://api.openai.com/v1/realtime → ws://localhost:8080/v1/realtime
# session.update / input_audio_buffer.append / response.audio.delta — identical event schemaCompose your own pipeline in Swift.
The Apple-side VoicePipeline (in speech-swift) wraps the same orchestrator. Feed it any conforming STT / LLM / TTS / VAD implementation; it runs the five-state turn detector, fires events for each transition, and handles deferred interruption when the user speaks over the agent.
import SpeechCore
import SilereoVAD
import ParakeetSTT
import Qwen3Chat
import KokoroTTS
let pipeline = VoicePipeline(
vad: SileroVAD.streaming(),
stt: try await ParakeetSTT.fromPretrained(),
llm: try await Qwen35Chat.fromPretrained(systemPrompt: "..."),
tts: try await KokoroTTSModel.fromPretrained(voice: "af_alloy"),
config: .init(
maxUtteranceDuration: 15,
minInterruptionDuration: 0.4
)
)
for await event in pipeline.events {
switch event {
case .userSpeechStarted: print("listening")
case .transcriptionCompleted(let text): print("user:", text)
case .responseAudioDelta(let chunk): player.enqueue(chunk)
case .interruption: player.cancel()
default: break
}
}Read the full state machine, AEC integration, and tool-call loop in speech-core/docs/pipeline.md ↗.
Numbers from M2 Max.
Stage-by-stage latency budget you can compose. PersonaPlex needs ~24 GB RAM for the 8-bit bundle; everything else fits on iPhone-class hardware.
