API & Protocols

The AudioCommon module defines model-agnostic protocols and shared types. Any conforming model can be used interchangeably through these interfaces.

Protocol Overview

┌─────────────────────────────────────────────────────────┐
│                    AudioCommon                          │
│                                                         │
│  AudioChunk          SpeechGenerationModel (TTS)        │
│  AlignedWord         SpeechRecognitionModel (STT)       │
│  SpeechSegment       ForcedAlignmentModel               │
│                      SpeechToSpeechModel                │
│                      VoiceActivityDetectionModel (VAD)   │
│                      SpeakerEmbeddingModel              │
│                      SpeakerDiarizationModel            │
│                      SpeakerExtractionCapable           │
└─────────────────────────────────────────────────────────┘

SpeechRecognitionModel

Protocol for speech-to-text models.

public protocol SpeechRecognitionModel: AnyObject {
    var inputSampleRate: Int { get }
    func transcribe(audio: [Float], sampleRate: Int, language: String?) -> String
    func transcribeWithLanguage(audio: [Float], sampleRate: Int, language: String?) -> TranscriptionResult
}

Conforming types: Qwen3ASRModel, ParakeetASRModel, ParakeetStreamingASRModel, OmnilingualASRModel (CoreML), OmnilingualASRMLXModel (MLX)

SpeechGenerationModel

Protocol for text-to-speech models.

public protocol SpeechGenerationModel: AnyObject {
    var sampleRate: Int { get }
    func generate(text: String, language: String?) async throws -> [Float]
    func generateStream(text: String, language: String?) -> AsyncThrowingStream<AudioChunk, Error>  // has default impl
}

generateStream() has a default implementation that wraps generate() as a single chunk. Models with true streaming (e.g. Qwen3-TTS) override it.

Conforming types: Qwen3TTSModel, CosyVoiceTTSModel, KokoroTTSModel, Qwen35MLXChat

ForcedAlignmentModel

Protocol for word-level timestamp alignment.

public protocol ForcedAlignmentModel: AnyObject {
    func align(audio: [Float], text: String, sampleRate: Int, language: String?) -> [AlignedWord]
}

SpeechToSpeechModel

Protocol for speech-to-speech dialogue models.

public protocol SpeechToSpeechModel: AnyObject {
    var sampleRate: Int { get }
    func respond(userAudio: [Float]) -> [Float]
    func respondStream(userAudio: [Float]) -> AsyncThrowingStream<AudioChunk, Error>
}

Conforming types: PersonaPlexModel

VoiceActivityDetectionModel

Protocol for voice activity detection.

public protocol VoiceActivityDetectionModel: AnyObject {
    var inputSampleRate: Int { get }
    func detectSpeech(audio: [Float], sampleRate: Int) -> [SpeechSegment]
}

SpeakerEmbeddingModel

Protocol for speaker embedding extraction.

public protocol SpeakerEmbeddingModel: AnyObject {
    var inputSampleRate: Int { get }
    var embeddingDimension: Int { get }
    func embed(audio: [Float], sampleRate: Int) -> [Float]
}

Conforming types: WeSpeakerModel

SpeakerDiarizationModel

Protocol for speaker diarization models that assign speaker labels to audio segments.

public protocol SpeakerDiarizationModel: AnyObject {
    var inputSampleRate: Int { get }
    func diarize(audio: [Float], sampleRate: Int) -> [DiarizedSegment]
}

Conforming types: DiarizationPipeline (Pyannote), SortformerDiarizer

SpeakerExtractionCapable

Extended diarization protocol for engines that support extracting a target speaker's segments using a reference embedding. Not all engines support this (Sortformer is end-to-end and does not produce speaker embeddings).

public protocol SpeakerExtractionCapable: SpeakerDiarizationModel {
    func extractSpeaker(audio: [Float], sampleRate: Int, targetEmbedding: [Float]) -> [SpeechSegment]
}

Conforming types: DiarizationPipeline (Pyannote only)

Shared Types

AudioChunk

public struct AudioChunk {
    public let samples: [Float]   // PCM samples
    public let sampleRate: Int    // Sample rate (e.g. 24000)
}

SpeechSegment

public struct SpeechSegment {
    public let startTime: Float   // Start time in seconds
    public let endTime: Float     // End time in seconds
}

AlignedWord

public struct AlignedWord {
    public let text: String       // The word
    public let startTime: Float   // Start time in seconds
    public let endTime: Float     // End time in seconds
}

DiarizedSegment

public struct DiarizedSegment {
    public let startTime: Float   // Start time in seconds
    public let endTime: Float     // End time in seconds
    public let speakerId: Int     // Speaker identifier (0-based)
}

DialogueSegment

A parsed segment of multi-speaker dialogue text with optional speaker and emotion tags. Used with DialogueParser and DialogueSynthesizer for CosyVoice3 dialogue synthesis.

public struct DialogueSegment: Sendable, Equatable {
    public let speaker: String?   // Speaker identifier ("S1", "S2"), nil for untagged
    public let emotion: String?   // Emotion tag ("happy", "whispers"), nil if none
    public let text: String       // Cleaned text to synthesize
}

DialogueParser

Parses multi-speaker dialogue text with inline speaker tags ([S1]) and emotion tags ((happy)).

public enum DialogueParser {
    static func parse(_ text: String) -> [DialogueSegment]
    static func emotionToInstruction(_ emotion: String) -> String
}

Built-in emotions: happy/excited, sad, angry, whispers/whispering, laughs/laughing, calm, surprised, serious. Unknown tags pass through as freeform instructions.

DialogueSynthesizer

Orchestrates multi-segment dialogue synthesis with per-speaker voice cloning, silence gaps, and crossfade.

public enum DialogueSynthesizer {
    static func synthesize(
        segments: [DialogueSegment],
        speakerEmbeddings: [String: [Float]],
        model: CosyVoiceTTSModel,
        language: String,
        config: DialogueSynthesisConfig,
        verbose: Bool
    ) -> [Float]
}

DialogueSynthesisConfig

public struct DialogueSynthesisConfig: Sendable {
    public var turnGapSeconds: Float      // Default: 0.2
    public var crossfadeSeconds: Float    // Default: 0.0
    public var defaultInstruction: String // Default: "You are a helpful assistant."
    public var maxTokensPerSegment: Int   // Default: 500
}

PipelineLLM

Protocol for language model integration with voice pipelines. Bridges an LLM to the VoicePipeline's ASR → LLM → TTS flow.

public protocol PipelineLLM: AnyObject {
    func chat(messages: [(role: MessageRole, content: String)],
              onToken: @escaping (String, Bool) -> Void)
    func cancel()
}

Built-in adapter: Qwen3PipelineLLM bridges Qwen35MLXChat to this protocol with token cleanup, cancellation, and pending phrase accumulation.

AudioIO

Reusable audio I/O manager that eliminates AVAudioEngine boilerplate. Handles mic capture, resampling, playback, and audio level metering.

let audio = AudioIO()
try audio.startMicrophone(targetSampleRate: 16000) { samples in
    pipeline.pushAudio(samples)
}
audio.player.scheduleChunk(ttsOutput)
audio.stopMicrophone()

AudioIO includes a StreamingAudioPlayer for TTS output and an AudioRingBuffer for thread-safe audio transfer between capture and inference threads.

SentencePieceModel

Shared protobuf reader for SentencePiece .model files, lives in AudioCommon. Every module that needs to decode SentencePiece pieces (PersonaPlex, OmnilingualASR, future ASR / TTS ports) builds its own decoder on top of this single reader instead of re-implementing the protobuf wire format.

public struct SentencePieceModel: Sendable {
    public struct Piece: Sendable, Equatable {
        public let text: String
        public let score: Float
        public let type: Int32
        public var pieceType: PieceType? { get }
        public var isControlOrUnknown: Bool { get }
    }
    public enum PieceType: Int32 {
        case normal = 1, unknown = 2, control = 3,
             userDefined = 4, unused = 5, byte = 6
    }
    public let pieces: [Piece]
    public var count: Int { get }
    public subscript(_ id: Int) -> Piece? { get }
    public init(contentsOf url: URL) throws
    public init(modelPath: String) throws
    public init(data: Data) throws
}

Used by: OmnilingualASR.OmnilingualVocabulary, PersonaPlex.SentencePieceDecoder. Covered by 7 unit tests in Tests/AudioCommonTests/SentencePieceModelTests.

MLXCommon.SDPA

Scaled dot-product attention helpers shared across every MLX attention module (Qwen3-ASR / Qwen3-TTS / Qwen3-Chat / CosyVoice / PersonaPlex / OmnilingualASR). Each module keeps its own projections — SDPA only handles the reshape → attention → merge boilerplate.

public enum SDPA {
    // Flat [B, T, H*D] input: project/reshape happens inside
    public static func multiHead(
        q: MLXArray, k: MLXArray, v: MLXArray,
        numHeads: Int, headDim: Int, scale: Float,
        mask: MLXArray? = nil
    ) -> MLXArray

    // GQA / MQA variant with separate query and KV head counts
    public static func multiHead(
        q: MLXArray, k: MLXArray, v: MLXArray,
        numQueryHeads: Int, numKVHeads: Int, headDim: Int, scale: Float,
        mask: MLXArray? = nil
    ) -> MLXArray

    // Already-shaped [B, H, T, D] (RoPE / KV cache paths)
    public static func attendAndMerge(
        qHeads: MLXArray, kHeads: MLXArray, vHeads: MLXArray,
        scale: Float,
        mask: MLXArray? = nil
    ) -> MLXArray

    // Same, with ScaledDotProductAttentionMaskMode enum (newer API)
    public static func attendAndMerge(
        qHeads: MLXArray, kHeads: MLXArray, vHeads: MLXArray,
        scale: Float,
        mask: MLXFast.ScaledDotProductAttentionMaskMode
    ) -> MLXArray

    // Low-level head merge: [B, H, T, D] → [B, T, H*D]
    public static func mergeHeads(_ attn: MLXArray) -> MLXArray
}

All reshape calls use -1 for the batch dimension so the helpers compose with MLX.compile(shapeless:) graphs that vary batch at runtime (e.g. Qwen3-TTS Talker autoregressive decode).

HTTP API Server

The audio-server binary exposes every model in speech-swift as HTTP REST endpoints plus a WebSocket endpoint that implements the OpenAI Realtime API. Models are loaded lazily on first request; pass --preload to warm them all at startup.

swift build -c release
.build/release/audio-server --port 8080

# Preload every model at startup
.build/release/audio-server --port 8080 --preload

REST Endpoints

EndpointMethodRequestResponse
/transcribePOSTaudio/wav bodyJSON { text } (Qwen3-ASR)
/speakPOSTJSON { text, engine?, language?, voice? }audio/wav body (Qwen3-TTS, CosyVoice, Kokoro)
/respondPOSTaudio/wav bodyaudio/wav body (PersonaPlex)
/enhancePOSTaudio/wav bodyaudio/wav body (DeepFilterNet3)
/vadPOSTaudio/wav bodyJSON segment list
/diarizePOSTaudio/wav bodyJSON DiarizedSegment list
/embed-speakerPOSTaudio/wav bodyJSON [Float] (256-dim)
# Transcribe a file
curl -X POST http://localhost:8080/transcribe \
  --data-binary @recording.wav \
  -H "Content-Type: audio/wav"

# Synthesize speech
curl -X POST http://localhost:8080/speak \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "engine": "cosyvoice"}' \
  -o output.wav

# Full speech-to-speech round trip
curl -X POST http://localhost:8080/respond \
  --data-binary @question.wav \
  -o response.wav

OpenAI Realtime API (/v1/realtime)

The WebSocket endpoint at ws://host:port/v1/realtime implements the OpenAI Realtime protocol. All messages are JSON with a type discriminator; audio payloads are base64-encoded PCM16 at 24 kHz mono.

Client → Server events

EventPurpose
session.updateConfigure engine, language, voice, and audio format
input_audio_buffer.appendAppend a base64 PCM16 chunk to the input buffer
input_audio_buffer.commitCommit the buffered audio for transcription
input_audio_buffer.clearDiscard the current input buffer
response.createRequest TTS synthesis for the supplied text/instructions

Server → Client events

EventMeaning
session.createdHandshake complete, default config emitted
session.updatedMost recent session.update acknowledged
input_audio_buffer.committedAudio accepted and queued for transcription
conversation.item.input_audio_transcription.completedASR result with final transcript text
response.audio.deltaBase64 PCM16 chunk of synthesized audio
response.audio.doneNo more audio chunks for this response
response.doneResponse finalized (metadata + latency stats)
errorError envelope with type and message
const ws = new WebSocket('ws://localhost:8080/v1/realtime');

// ASR: push audio, request transcription
ws.send(JSON.stringify({ type: 'input_audio_buffer.append', audio: base64PCM16 }));
ws.send(JSON.stringify({ type: 'input_audio_buffer.commit' }));
// → conversation.item.input_audio_transcription.completed

// TTS: request synthesis and stream audio deltas
ws.send(JSON.stringify({
  type: 'response.create',
  response: { modalities: ['audio', 'text'], instructions: 'Hello world' }
}));
// → response.audio.delta (repeated), response.audio.done, response.done

The server lives in the AudioServer SPM product. An example browser client is shipped at Examples/websocket-client.html — open it alongside a running server to drive the full ASR + TTS round trip.

Model Downloads

All models are downloaded from HuggingFace on first use and cached in ~/Library/Caches/qwen3-speech/. The AudioCommon module provides a shared HuggingFaceDownloader that handles download, caching, and integrity verification.