API & Protocols
The AudioCommon module defines model-agnostic protocols and shared types. Any conforming model can be used interchangeably through these interfaces.
Protocol Overview
┌─────────────────────────────────────────────────────────┐
│ AudioCommon │
│ │
│ AudioChunk SpeechGenerationModel (TTS) │
│ AlignedWord SpeechRecognitionModel (STT) │
│ SpeechSegment ForcedAlignmentModel │
│ SpeechToSpeechModel │
│ VoiceActivityDetectionModel (VAD) │
│ SpeakerEmbeddingModel │
│ SpeakerDiarizationModel │
│ SpeakerExtractionCapable │
└─────────────────────────────────────────────────────────┘SpeechRecognitionModel
Protocol for speech-to-text models.
public protocol SpeechRecognitionModel: AnyObject {
var inputSampleRate: Int { get }
func transcribe(audio: [Float], sampleRate: Int, language: String?) -> String
func transcribeWithLanguage(audio: [Float], sampleRate: Int, language: String?) -> TranscriptionResult
}
Conforming types: Qwen3ASRModel, ParakeetASRModel, ParakeetStreamingASRModel, OmnilingualASRModel (CoreML), OmnilingualASRMLXModel (MLX)
SpeechGenerationModel
Protocol for text-to-speech models.
public protocol SpeechGenerationModel: AnyObject {
var sampleRate: Int { get }
func generate(text: String, language: String?) async throws -> [Float]
func generateStream(text: String, language: String?) -> AsyncThrowingStream<AudioChunk, Error> // has default impl
}
generateStream() has a default implementation that wraps generate() as a single chunk. Models with true streaming (e.g. Qwen3-TTS) override it.
Conforming types: Qwen3TTSModel, CosyVoiceTTSModel, KokoroTTSModel, Qwen35MLXChat
ForcedAlignmentModel
Protocol for word-level timestamp alignment.
public protocol ForcedAlignmentModel: AnyObject {
func align(audio: [Float], text: String, sampleRate: Int, language: String?) -> [AlignedWord]
}
SpeechToSpeechModel
Protocol for speech-to-speech dialogue models.
public protocol SpeechToSpeechModel: AnyObject {
var sampleRate: Int { get }
func respond(userAudio: [Float]) -> [Float]
func respondStream(userAudio: [Float]) -> AsyncThrowingStream<AudioChunk, Error>
}
Conforming types: PersonaPlexModel
VoiceActivityDetectionModel
Protocol for voice activity detection.
public protocol VoiceActivityDetectionModel: AnyObject {
var inputSampleRate: Int { get }
func detectSpeech(audio: [Float], sampleRate: Int) -> [SpeechSegment]
}
SpeakerEmbeddingModel
Protocol for speaker embedding extraction.
public protocol SpeakerEmbeddingModel: AnyObject {
var inputSampleRate: Int { get }
var embeddingDimension: Int { get }
func embed(audio: [Float], sampleRate: Int) -> [Float]
}
Conforming types: WeSpeakerModel
SpeakerDiarizationModel
Protocol for speaker diarization models that assign speaker labels to audio segments.
public protocol SpeakerDiarizationModel: AnyObject {
var inputSampleRate: Int { get }
func diarize(audio: [Float], sampleRate: Int) -> [DiarizedSegment]
}
Conforming types: DiarizationPipeline (Pyannote), SortformerDiarizer
SpeakerExtractionCapable
Extended diarization protocol for engines that support extracting a target speaker's segments using a reference embedding. Not all engines support this (Sortformer is end-to-end and does not produce speaker embeddings).
public protocol SpeakerExtractionCapable: SpeakerDiarizationModel {
func extractSpeaker(audio: [Float], sampleRate: Int, targetEmbedding: [Float]) -> [SpeechSegment]
}
Conforming types: DiarizationPipeline (Pyannote only)
Shared Types
AudioChunk
public struct AudioChunk {
public let samples: [Float] // PCM samples
public let sampleRate: Int // Sample rate (e.g. 24000)
}
SpeechSegment
public struct SpeechSegment {
public let startTime: Float // Start time in seconds
public let endTime: Float // End time in seconds
}
AlignedWord
public struct AlignedWord {
public let text: String // The word
public let startTime: Float // Start time in seconds
public let endTime: Float // End time in seconds
}
DiarizedSegment
public struct DiarizedSegment {
public let startTime: Float // Start time in seconds
public let endTime: Float // End time in seconds
public let speakerId: Int // Speaker identifier (0-based)
}
DialogueSegment
A parsed segment of multi-speaker dialogue text with optional speaker and emotion tags. Used with DialogueParser and DialogueSynthesizer for CosyVoice3 dialogue synthesis.
public struct DialogueSegment: Sendable, Equatable {
public let speaker: String? // Speaker identifier ("S1", "S2"), nil for untagged
public let emotion: String? // Emotion tag ("happy", "whispers"), nil if none
public let text: String // Cleaned text to synthesize
}
DialogueParser
Parses multi-speaker dialogue text with inline speaker tags ([S1]) and emotion tags ((happy)).
public enum DialogueParser {
static func parse(_ text: String) -> [DialogueSegment]
static func emotionToInstruction(_ emotion: String) -> String
}
Built-in emotions: happy/excited, sad, angry, whispers/whispering, laughs/laughing, calm, surprised, serious. Unknown tags pass through as freeform instructions.
DialogueSynthesizer
Orchestrates multi-segment dialogue synthesis with per-speaker voice cloning, silence gaps, and crossfade.
public enum DialogueSynthesizer {
static func synthesize(
segments: [DialogueSegment],
speakerEmbeddings: [String: [Float]],
model: CosyVoiceTTSModel,
language: String,
config: DialogueSynthesisConfig,
verbose: Bool
) -> [Float]
}
DialogueSynthesisConfig
public struct DialogueSynthesisConfig: Sendable {
public var turnGapSeconds: Float // Default: 0.2
public var crossfadeSeconds: Float // Default: 0.0
public var defaultInstruction: String // Default: "You are a helpful assistant."
public var maxTokensPerSegment: Int // Default: 500
}
PipelineLLM
Protocol for language model integration with voice pipelines. Bridges an LLM to the VoicePipeline's ASR → LLM → TTS flow.
public protocol PipelineLLM: AnyObject {
func chat(messages: [(role: MessageRole, content: String)],
onToken: @escaping (String, Bool) -> Void)
func cancel()
}
Built-in adapter: Qwen3PipelineLLM bridges Qwen35MLXChat to this protocol with token cleanup, cancellation, and pending phrase accumulation.
AudioIO
Reusable audio I/O manager that eliminates AVAudioEngine boilerplate. Handles mic capture, resampling, playback, and audio level metering.
let audio = AudioIO()
try audio.startMicrophone(targetSampleRate: 16000) { samples in
pipeline.pushAudio(samples)
}
audio.player.scheduleChunk(ttsOutput)
audio.stopMicrophone()
AudioIO includes a StreamingAudioPlayer for TTS output and an AudioRingBuffer for thread-safe audio transfer between capture and inference threads.
SentencePieceModel
Shared protobuf reader for SentencePiece .model files, lives in AudioCommon. Every module that needs to decode SentencePiece pieces (PersonaPlex, OmnilingualASR, future ASR / TTS ports) builds its own decoder on top of this single reader instead of re-implementing the protobuf wire format.
public struct SentencePieceModel: Sendable {
public struct Piece: Sendable, Equatable {
public let text: String
public let score: Float
public let type: Int32
public var pieceType: PieceType? { get }
public var isControlOrUnknown: Bool { get }
}
public enum PieceType: Int32 {
case normal = 1, unknown = 2, control = 3,
userDefined = 4, unused = 5, byte = 6
}
public let pieces: [Piece]
public var count: Int { get }
public subscript(_ id: Int) -> Piece? { get }
public init(contentsOf url: URL) throws
public init(modelPath: String) throws
public init(data: Data) throws
}
Used by: OmnilingualASR.OmnilingualVocabulary, PersonaPlex.SentencePieceDecoder. Covered by 7 unit tests in Tests/AudioCommonTests/SentencePieceModelTests.
MLXCommon.SDPA
Scaled dot-product attention helpers shared across every MLX attention module (Qwen3-ASR / Qwen3-TTS / Qwen3-Chat / CosyVoice / PersonaPlex / OmnilingualASR). Each module keeps its own projections — SDPA only handles the reshape → attention → merge boilerplate.
public enum SDPA {
// Flat [B, T, H*D] input: project/reshape happens inside
public static func multiHead(
q: MLXArray, k: MLXArray, v: MLXArray,
numHeads: Int, headDim: Int, scale: Float,
mask: MLXArray? = nil
) -> MLXArray
// GQA / MQA variant with separate query and KV head counts
public static func multiHead(
q: MLXArray, k: MLXArray, v: MLXArray,
numQueryHeads: Int, numKVHeads: Int, headDim: Int, scale: Float,
mask: MLXArray? = nil
) -> MLXArray
// Already-shaped [B, H, T, D] (RoPE / KV cache paths)
public static func attendAndMerge(
qHeads: MLXArray, kHeads: MLXArray, vHeads: MLXArray,
scale: Float,
mask: MLXArray? = nil
) -> MLXArray
// Same, with ScaledDotProductAttentionMaskMode enum (newer API)
public static func attendAndMerge(
qHeads: MLXArray, kHeads: MLXArray, vHeads: MLXArray,
scale: Float,
mask: MLXFast.ScaledDotProductAttentionMaskMode
) -> MLXArray
// Low-level head merge: [B, H, T, D] → [B, T, H*D]
public static func mergeHeads(_ attn: MLXArray) -> MLXArray
}
All reshape calls use -1 for the batch dimension so the helpers compose with MLX.compile(shapeless:) graphs that vary batch at runtime (e.g. Qwen3-TTS Talker autoregressive decode).
HTTP API Server
The audio-server binary exposes every model in speech-swift as HTTP REST endpoints plus a WebSocket endpoint that implements the OpenAI Realtime API. Models are loaded lazily on first request; pass --preload to warm them all at startup.
swift build -c release
.build/release/audio-server --port 8080
# Preload every model at startup
.build/release/audio-server --port 8080 --preload
REST Endpoints
| Endpoint | Method | Request | Response |
|---|---|---|---|
/transcribe | POST | audio/wav body | JSON { text } (Qwen3-ASR) |
/speak | POST | JSON { text, engine?, language?, voice? } | audio/wav body (Qwen3-TTS, CosyVoice, Kokoro) |
/respond | POST | audio/wav body | audio/wav body (PersonaPlex) |
/enhance | POST | audio/wav body | audio/wav body (DeepFilterNet3) |
/vad | POST | audio/wav body | JSON segment list |
/diarize | POST | audio/wav body | JSON DiarizedSegment list |
/embed-speaker | POST | audio/wav body | JSON [Float] (256-dim) |
# Transcribe a file
curl -X POST http://localhost:8080/transcribe \
--data-binary @recording.wav \
-H "Content-Type: audio/wav"
# Synthesize speech
curl -X POST http://localhost:8080/speak \
-H "Content-Type: application/json" \
-d '{"text": "Hello world", "engine": "cosyvoice"}' \
-o output.wav
# Full speech-to-speech round trip
curl -X POST http://localhost:8080/respond \
--data-binary @question.wav \
-o response.wav
OpenAI Realtime API (/v1/realtime)
The WebSocket endpoint at ws://host:port/v1/realtime implements the OpenAI Realtime protocol. All messages are JSON with a type discriminator; audio payloads are base64-encoded PCM16 at 24 kHz mono.
Client → Server events
| Event | Purpose |
|---|---|
session.update | Configure engine, language, voice, and audio format |
input_audio_buffer.append | Append a base64 PCM16 chunk to the input buffer |
input_audio_buffer.commit | Commit the buffered audio for transcription |
input_audio_buffer.clear | Discard the current input buffer |
response.create | Request TTS synthesis for the supplied text/instructions |
Server → Client events
| Event | Meaning |
|---|---|
session.created | Handshake complete, default config emitted |
session.updated | Most recent session.update acknowledged |
input_audio_buffer.committed | Audio accepted and queued for transcription |
conversation.item.input_audio_transcription.completed | ASR result with final transcript text |
response.audio.delta | Base64 PCM16 chunk of synthesized audio |
response.audio.done | No more audio chunks for this response |
response.done | Response finalized (metadata + latency stats) |
error | Error envelope with type and message |
const ws = new WebSocket('ws://localhost:8080/v1/realtime');
// ASR: push audio, request transcription
ws.send(JSON.stringify({ type: 'input_audio_buffer.append', audio: base64PCM16 }));
ws.send(JSON.stringify({ type: 'input_audio_buffer.commit' }));
// → conversation.item.input_audio_transcription.completed
// TTS: request synthesis and stream audio deltas
ws.send(JSON.stringify({
type: 'response.create',
response: { modalities: ['audio', 'text'], instructions: 'Hello world' }
}));
// → response.audio.delta (repeated), response.audio.done, response.done
The server lives in the AudioServer SPM product. An example browser client is shipped at Examples/websocket-client.html — open it alongside a running server to drive the full ASR + TTS round trip.
Model Downloads
All models are downloaded from HuggingFace on first use and cached in ~/Library/Caches/qwen3-speech/. The AudioCommon module provides a shared HuggingFaceDownloader that handles download, caching, and integrity verification.