Use case · Conversational

Voice in.
Voice out.

Three shapes of voice-first interfaces — a single full-duplex speech-to-speech model, a compositional wake → VAD → ASR → LLM → TTS pipeline you fully control, and wake-word activation for hands-free entry. All on-device, no cloud APIs, no audio leaving the device.

Get started Speech-to-speech guide

Three sub-use-cases

Pick the shape that fits your product.

Drop-in dialogue model, compositional pipeline with per-stage control, or a thin wake-word trigger. Each runs entirely on-device.

Full-duplex speech-to-speech

A single model takes mic input and produces voice output. Drop-in OpenAI-Realtime-compatible WebSocket; minimal code, opaque internals.

~112 ms/step on M2 Max · 18 voices · inner-monologue captions

Compositional voice pipeline

AEC → enhance → VAD → STT → LLM → TTS → audio, with a five-state turn detector, deferred interruption handling, and tool-call loop. The canonical orchestrator lives in speech-core.

State machine + interruption handling documented in `speech-core/docs/pipeline.md`

VoicePipeline

TurnDetector

STT/LLM/TTS interfaces

Open in speech-core

Wake-word activation

Hands-free trigger for any voice flow. Custom keywords with per-phrase thresholds, plug into the same VAD pipeline as compositional agents.

~4 MB INT8 on Neural Engine · 26× real-time · 88% recall on 12 keywords

KWS-Zipformer-3M

WakeWordProvider

Learn more

Quickstart · Full-duplex

Drop in for OpenAI Realtime clients.

speech-server exposes /v1/realtime — the same session.update / input_audio_buffer.append / response.audio.delta event shape as the OpenAI Realtime API. Existing clients written against the OpenAI SDK keep working when pointed at the local URL.

brew install soniqo/tap/speech
# Start the local Realtime-compatible server (PersonaPlex backend)
speech serve --model personaplex --port 8080

# In your existing OpenAI Realtime client, swap the base URL:
#   wss://api.openai.com/v1/realtime  →  ws://localhost:8080/v1/realtime
# session.update / input_audio_buffer.append / response.audio.delta — identical event schema

Quickstart · Compositional

Compose your own pipeline in Swift.

The Apple-side VoicePipeline (in speech-swift) wraps the same orchestrator. Feed it any conforming STT / LLM / TTS / VAD implementation; it runs the five-state turn detector, fires events for each transition, and handles deferred interruption when the user speaks over the agent.

import SpeechCore
import SilereoVAD
import ParakeetSTT
import Qwen3Chat
import KokoroTTS

let pipeline = VoicePipeline(
    vad: SileroVAD.streaming(),
    stt: try await ParakeetSTT.fromPretrained(),
    llm: try await Qwen35Chat.fromPretrained(systemPrompt: "..."),
    tts: try await KokoroTTSModel.fromPretrained(voice: "af_alloy"),
    config: .init(
        maxUtteranceDuration: 15,
        minInterruptionDuration: 0.4
    )
)

for await event in pipeline.events {
    switch event {
    case .userSpeechStarted: print("listening")
    case .transcriptionCompleted(let text): print("user:", text)
    case .responseAudioDelta(let chunk): player.enqueue(chunk)
    case .interruption: player.cancel()
    default: break
    }
}

Read the full state machine, AEC integration, and tool-call loop in speech-core/docs/pipeline.md ↗.

On-device performance