API & Protocols

The AudioCommon module defines model-agnostic protocols and shared types. Any conforming model can be used interchangeably through these interfaces.

Protocol Overview

┌─────────────────────────────────────────────────────────┐
│                    AudioCommon                          │
│                                                         │
│  AudioChunk          SpeechGenerationModel (TTS)        │
│  AlignedWord         SpeechRecognitionModel (STT)       │
│  SpeechSegment       ForcedAlignmentModel               │
│                      SpeechToSpeechModel                │
│                      VoiceActivityDetectionModel (VAD)   │
│                      SpeakerEmbeddingModel              │
│                      SpeakerDiarizationModel            │
│                      SpeakerExtractionCapable           │
└─────────────────────────────────────────────────────────┘

SpeechRecognitionModel

Protocol for speech-to-text models.

public protocol SpeechRecognitionModel: AnyObject {
    var inputSampleRate: Int { get }
    func transcribe(audio: [Float], sampleRate: Int, language: String?) -> String
    func transcribeWithLanguage(audio: [Float], sampleRate: Int, language: String?) -> TranscriptionResult
}

Conforming types: Qwen3ASRModel, ParakeetASRModel, ParakeetStreamingASRModel, OmnilingualASRModel (CoreML), OmnilingualASRMLXModel (MLX)

SpeechGenerationModel

Protocol for text-to-speech models.

public protocol SpeechGenerationModel: AnyObject {
    var sampleRate: Int { get }
    func generate(text: String, language: String?) async throws -> [Float]
    func generateStream(text: String, language: String?) -> AsyncThrowingStream<AudioChunk, Error>  // has default impl
}

generateStream() has a default implementation that wraps generate() as a single chunk. Models with true streaming (e.g. Qwen3-TTS) override it.

Conforming types: Qwen3TTSModel, CosyVoiceTTSModel, KokoroTTSModel, Qwen35MLXChat

ForcedAlignmentModel

Protocol for word-level timestamp alignment.

public protocol ForcedAlignmentModel: AnyObject {
    func align(audio: [Float], text: String, sampleRate: Int, language: String?) -> [AlignedWord]
}

SpeechToSpeechModel

Protocol for speech-to-speech dialogue models.

public protocol SpeechToSpeechModel: AnyObject {
    var sampleRate: Int { get }
    func respond(userAudio: [Float]) -> [Float]
    func respondStream(userAudio: [Float]) -> AsyncThrowingStream<AudioChunk, Error>
}

Conforming types: PersonaPlexModel

VoiceActivityDetectionModel

Protocol for voice activity detection.

public protocol VoiceActivityDetectionModel: AnyObject {
    var inputSampleRate: Int { get }
    func detectSpeech(audio: [Float], sampleRate: Int) -> [SpeechSegment]
}

SpeakerEmbeddingModel

Protocol for speaker embedding extraction.

public protocol SpeakerEmbeddingModel: AnyObject {
    var inputSampleRate: Int { get }
    var embeddingDimension: Int { get }
    func embed(audio: [Float], sampleRate: Int) -> [Float]
}

Conforming types: WeSpeakerModel

SpeakerDiarizationModel

Protocol for speaker diarization models that assign speaker labels to audio segments.

public protocol SpeakerDiarizationModel: AnyObject {
    var inputSampleRate: Int { get }
    func diarize(audio: [Float], sampleRate: Int) -> [DiarizedSegment]
}

Conforming types: DiarizationPipeline (Pyannote), SortformerDiarizer

SpeakerExtractionCapable

Extended diarization protocol for engines that support extracting a target speaker's segments using a reference embedding. Not all engines support this (Sortformer is end-to-end and does not produce speaker embeddings).

public protocol SpeakerExtractionCapable: SpeakerDiarizationModel {
    func extractSpeaker(audio: [Float], sampleRate: Int, targetEmbedding: [Float]) -> [SpeechSegment]
}

Conforming types: DiarizationPipeline (Pyannote only)

Shared Types

AudioChunk

public struct AudioChunk {
    public let samples: [Float]   // PCM samples
    public let sampleRate: Int    // Sample rate (e.g. 24000)
}

SpeechSegment

public struct SpeechSegment {
    public let startTime: Float   // Start time in seconds
    public let endTime: Float     // End time in seconds
}

AlignedWord

public struct AlignedWord {
    public let text: String       // The word
    public let startTime: Float   // Start time in seconds
    public let endTime: Float     // End time in seconds
}

DiarizedSegment

public struct DiarizedSegment {
    public let startTime: Float   // Start time in seconds
    public let endTime: Float     // End time in seconds
    public let speakerId: Int     // Speaker identifier (0-based)
}

DialogueSegment

A parsed segment of multi-speaker dialogue text with optional speaker and emotion tags. Used with DialogueParser and DialogueSynthesizer for CosyVoice3 dialogue synthesis.

public struct DialogueSegment: Sendable, Equatable {
    public let speaker: String?   // Speaker identifier ("S1", "S2"), nil for untagged
    public let emotion: String?   // Emotion tag ("happy", "whispers"), nil if none
    public let text: String       // Cleaned text to synthesize
}

DialogueParser

Parses multi-speaker dialogue text with inline speaker tags ([S1]) and emotion tags ((happy)).

public enum DialogueParser {
    static func parse(_ text: String) -> [DialogueSegment]
    static func emotionToInstruction(_ emotion: String) -> String
}

Built-in emotions: happy/excited, sad, angry, whispers/whispering, laughs/laughing, calm, surprised, serious. Unknown tags pass through as freeform instructions.

DialogueSynthesizer

Orchestrates multi-segment dialogue synthesis with per-speaker voice cloning, silence gaps, and crossfade.

public enum DialogueSynthesizer {
    static func synthesize(
        segments: [DialogueSegment],
        speakerEmbeddings: [String: [Float]],
        model: CosyVoiceTTSModel,
        language: String,
        config: DialogueSynthesisConfig,
        verbose: Bool
    ) -> [Float]
}

DialogueSynthesisConfig

public struct DialogueSynthesisConfig: Sendable {
    public var turnGapSeconds: Float      // Default: 0.2
    public var crossfadeSeconds: Float    // Default: 0.0
    public var defaultInstruction: String // Default: "You are a helpful assistant."
    public var maxTokensPerSegment: Int   // Default: 500
}

PipelineLLM

Protocol for language model integration with voice pipelines. Bridges an LLM to the VoicePipeline's ASR → LLM → TTS flow.

public protocol PipelineLLM: AnyObject {
    func chat(messages: [(role: MessageRole, content: String)],
              onToken: @escaping (String, Bool) -> Void)
    func cancel()
}

Built-in adapter: Qwen3PipelineLLM bridges Qwen35MLXChat to this protocol with token cleanup, cancellation, and pending phrase accumulation.

AudioIO

Reusable audio I/O manager that eliminates AVAudioEngine boilerplate. Handles mic capture, resampling, playback, and audio level metering.

let audio = AudioIO()
try audio.startMicrophone(targetSampleRate: 16000) { samples in
    pipeline.pushAudio(samples)
}
audio.player.scheduleChunk(ttsOutput)
audio.stopMicrophone()

AudioIO includes a StreamingAudioPlayer for TTS output and an AudioRingBuffer for thread-safe audio transfer between capture and inference threads.

SentencePieceModel

Shared protobuf reader for SentencePiece .model files, lives in AudioCommon. Every module that needs to decode SentencePiece pieces (PersonaPlex, OmnilingualASR, future ASR / TTS ports) builds its own decoder on top of this single reader instead of re-implementing the protobuf wire format.

public struct SentencePieceModel: Sendable {
    public struct Piece: Sendable, Equatable {
        public let text: String
        public let score: Float
        public let type: Int32
        public var pieceType: PieceType? { get }
        public var isControlOrUnknown: Bool { get }
    }
    public enum PieceType: Int32 {
        case normal = 1, unknown = 2, control = 3,
             userDefined = 4, unused = 5, byte = 6
    }
    public let pieces: [Piece]
    public var count: Int { get }
    public subscript(_ id: Int) -> Piece? { get }
    public init(contentsOf url: URL) throws
    public init(modelPath: String) throws
    public init(data: Data) throws
}

Used by: OmnilingualASR.OmnilingualVocabulary, PersonaPlex.SentencePieceDecoder. Covered by 7 unit tests in Tests/AudioCommonTests/SentencePieceModelTests.

MLXCommon.SDPA

Scaled dot-product attention helpers shared across every MLX attention module (Qwen3-ASR / Qwen3-TTS / Qwen3-Chat / CosyVoice / PersonaPlex / OmnilingualASR). Each module keeps its own projections — SDPA only handles the reshape → attention → merge boilerplate.

public enum SDPA {
    // Flat [B, T, H*D] input: project/reshape happens inside
    public static func multiHead(
        q: MLXArray, k: MLXArray, v: MLXArray,
        numHeads: Int, headDim: Int, scale: Float,
        mask: MLXArray? = nil
    ) -> MLXArray

    // GQA / MQA variant with separate query and KV head counts
    public static func multiHead(
        q: MLXArray, k: MLXArray, v: MLXArray,
        numQueryHeads: Int, numKVHeads: Int, headDim: Int, scale: Float,
        mask: MLXArray? = nil
    ) -> MLXArray

    // Already-shaped [B, H, T, D] (RoPE / KV cache paths)
    public static func attendAndMerge(
        qHeads: MLXArray, kHeads: MLXArray, vHeads: MLXArray,
        scale: Float,
        mask: MLXArray? = nil
    ) -> MLXArray

    // Same, with ScaledDotProductAttentionMaskMode enum (newer API)
    public static func attendAndMerge(
        qHeads: MLXArray, kHeads: MLXArray, vHeads: MLXArray,
        scale: Float,
        mask: MLXFast.ScaledDotProductAttentionMaskMode
    ) -> MLXArray

    // Low-level head merge: [B, H, T, D] → [B, T, H*D]
    public static func mergeHeads(_ attn: MLXArray) -> MLXArray
}

All reshape calls use -1 for the batch dimension so the helpers compose with MLX.compile(shapeless:) graphs that vary batch at runtime (e.g. Qwen3-TTS Talker autoregressive decode).

HTTP API Server

The audio-server binary exposes every model in speech-swift as HTTP REST endpoints plus a WebSocket endpoint that implements the OpenAI Realtime API. Models are loaded lazily on first request; pass --preload to warm them all at startup.

swift build -c release
.build/release/audio-server --port 8080

# Preload every model at startup
.build/release/audio-server --port 8080 --preload

REST Endpoints

Endpoint	Method	Request	Response
`/transcribe`	POST	`audio/wav` body	JSON `{ text }` (Qwen3-ASR)
`/speak`	POST	JSON `{ text, engine?, language?, voice? }`	`audio/wav` body (Qwen3-TTS, CosyVoice, Kokoro)
`/respond`	POST	`audio/wav` body	`audio/wav` body (PersonaPlex)
`/enhance`	POST	`audio/wav` body	`audio/wav` body (DeepFilterNet3)
`/vad`	POST	`audio/wav` body	JSON segment list
`/diarize`	POST	`audio/wav` body	JSON `DiarizedSegment` list
`/embed-speaker`	POST	`audio/wav` body	JSON `[Float]` (256-dim)

# Transcribe a file
curl -X POST http://localhost:8080/transcribe \
  --data-binary @recording.wav \
  -H "Content-Type: audio/wav"

# Synthesize speech
curl -X POST http://localhost:8080/speak \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "engine": "cosyvoice"}' \
  -o output.wav

# Full speech-to-speech round trip
curl -X POST http://localhost:8080/respond \
  --data-binary @question.wav \
  -o response.wav

OpenAI Realtime API (`/v1/realtime`)

The WebSocket endpoint at ws://host:port/v1/realtime implements the OpenAI Realtime protocol. All messages are JSON with a type discriminator; audio payloads are base64-encoded PCM16 at 24 kHz mono.

Client → Server events

Event	Purpose
`session.update`	Configure engine, language, voice, and audio format
`input_audio_buffer.append`	Append a base64 PCM16 chunk to the input buffer
`input_audio_buffer.commit`	Commit the buffered audio for transcription
`input_audio_buffer.clear`	Discard the current input buffer
`response.create`	Request TTS synthesis for the supplied text/instructions

Server → Client events

Event	Meaning
`session.created`	Handshake complete, default config emitted
`session.updated`	Most recent `session.update` acknowledged
`input_audio_buffer.committed`	Audio accepted and queued for transcription
`conversation.item.input_audio_transcription.completed`	ASR result with final transcript text
`response.audio.delta`	Base64 PCM16 chunk of synthesized audio
`response.audio.done`	No more audio chunks for this response
`response.done`	Response finalized (metadata + latency stats)
`error`	Error envelope with `type` and `message`

const ws = new WebSocket('ws://localhost:8080/v1/realtime');

// ASR: push audio, request transcription
ws.send(JSON.stringify({ type: 'input_audio_buffer.append', audio: base64PCM16 }));
ws.send(JSON.stringify({ type: 'input_audio_buffer.commit' }));
// → conversation.item.input_audio_transcription.completed

// TTS: request synthesis and stream audio deltas
ws.send(JSON.stringify({
  type: 'response.create',
  response: { modalities: ['audio', 'text'], instructions: 'Hello world' }
}));
// → response.audio.delta (repeated), response.audio.done, response.done

The server lives in the AudioServer SPM product. An example browser client is shipped at Examples/websocket-client.html — open it alongside a running server to drive the full ASR + TTS round trip.

Model Downloads

All models are downloaded from HuggingFace on first use and cached in ~/Library/Caches/qwen3-speech/. The AudioCommon module provides a shared HuggingFaceDownloader that handles download, caching, and integrity verification.