API & Protocols

The AudioCommon module defines model-agnostic protocols and shared types. Any conforming model can be used interchangeably through these interfaces.

Protocol Overview

┌─────────────────────────────────────────────────────────┐
│                    AudioCommon                          │
│                                                         │
│  AudioChunk          SpeechGenerationModel (TTS)        │
│  AlignedWord         SpeechRecognitionModel (STT)       │
│  SpeechSegment       ForcedAlignmentModel               │
│                      SpeechToSpeechModel                │
│                      VoiceActivityDetectionModel (VAD)   │
│                      SpeakerEmbeddingModel              │
│                      SpeakerDiarizationModel            │
│                      SpeakerExtractionCapable           │
└─────────────────────────────────────────────────────────┘

SpeechRecognitionModel

Protocol for speech-to-text models.

public protocol SpeechRecognitionModel: AnyObject {
    var inputSampleRate: Int { get }
    func transcribe(audio: [Float], sampleRate: Int, language: String?) -> String
    func transcribeWithLanguage(audio: [Float], sampleRate: Int, language: String?) -> TranscriptionResult
}

Conforming types: Qwen3ASRModel, ParakeetASRModel

SpeechGenerationModel

Protocol for text-to-speech models.

public protocol SpeechGenerationModel: AnyObject {
    var sampleRate: Int { get }
    func generate(text: String, language: String?) async throws -> [Float]
    func generateStream(text: String, language: String?) -> AsyncThrowingStream<AudioChunk, Error>  // has default impl
}

generateStream() has a default implementation that wraps generate() as a single chunk. Models with true streaming (e.g. Qwen3-TTS) override it.

Conforming types: Qwen3TTSModel, CosyVoiceTTSModel, KokoroTTSModel, Qwen3ChatModel

ForcedAlignmentModel

Protocol for word-level timestamp alignment.

public protocol ForcedAlignmentModel: AnyObject {
    func align(audio: [Float], text: String, sampleRate: Int, language: String?) -> [AlignedWord]
}

SpeechToSpeechModel

Protocol for speech-to-speech dialogue models.

public protocol SpeechToSpeechModel: AnyObject {
    var sampleRate: Int { get }
    func respond(userAudio: [Float]) -> [Float]
    func respondStream(userAudio: [Float]) -> AsyncThrowingStream<AudioChunk, Error>
}

Conforming types: PersonaPlexModel

VoiceActivityDetectionModel

Protocol for voice activity detection.

public protocol VoiceActivityDetectionModel: AnyObject {
    var inputSampleRate: Int { get }
    func detectSpeech(audio: [Float], sampleRate: Int) -> [SpeechSegment]
}

SpeakerEmbeddingModel

Protocol for speaker embedding extraction.

public protocol SpeakerEmbeddingModel: AnyObject {
    var inputSampleRate: Int { get }
    var embeddingDimension: Int { get }
    func embed(audio: [Float], sampleRate: Int) -> [Float]
}

Conforming types: WeSpeakerModel

SpeakerDiarizationModel

Protocol for speaker diarization models that assign speaker labels to audio segments.

public protocol SpeakerDiarizationModel: AnyObject {
    var inputSampleRate: Int { get }
    func diarize(audio: [Float], sampleRate: Int) -> [DiarizedSegment]
}

Conforming types: DiarizationPipeline (Pyannote), SortformerDiarizer

SpeakerExtractionCapable

Extended diarization protocol for engines that support extracting a target speaker's segments using a reference embedding. Not all engines support this (Sortformer is end-to-end and does not produce speaker embeddings).

public protocol SpeakerExtractionCapable: SpeakerDiarizationModel {
    func extractSpeaker(audio: [Float], sampleRate: Int, targetEmbedding: [Float]) -> [SpeechSegment]
}

Conforming types: DiarizationPipeline (Pyannote only)

Shared Types

AudioChunk

public struct AudioChunk {
    public let samples: [Float]   // PCM samples
    public let sampleRate: Int    // Sample rate (e.g. 24000)
}

SpeechSegment

public struct SpeechSegment {
    public let startTime: Float   // Start time in seconds
    public let endTime: Float     // End time in seconds
}

AlignedWord

public struct AlignedWord {
    public let text: String       // The word
    public let startTime: Float   // Start time in seconds
    public let endTime: Float     // End time in seconds
}

DiarizedSegment

public struct DiarizedSegment {
    public let startTime: Float   // Start time in seconds
    public let endTime: Float     // End time in seconds
    public let speakerId: Int     // Speaker identifier (0-based)
}

DialogueSegment

A parsed segment of multi-speaker dialogue text with optional speaker and emotion tags. Used with DialogueParser and DialogueSynthesizer for CosyVoice3 dialogue synthesis.

public struct DialogueSegment: Sendable, Equatable {
    public let speaker: String?   // Speaker identifier ("S1", "S2"), nil for untagged
    public let emotion: String?   // Emotion tag ("happy", "whispers"), nil if none
    public let text: String       // Cleaned text to synthesize
}

DialogueParser

Parses multi-speaker dialogue text with inline speaker tags ([S1]) and emotion tags ((happy)).

public enum DialogueParser {
    static func parse(_ text: String) -> [DialogueSegment]
    static func emotionToInstruction(_ emotion: String) -> String
}

Built-in emotions: happy/excited, sad, angry, whispers/whispering, laughs/laughing, calm, surprised, serious. Unknown tags pass through as freeform instructions.

DialogueSynthesizer

Orchestrates multi-segment dialogue synthesis with per-speaker voice cloning, silence gaps, and crossfade.

public enum DialogueSynthesizer {
    static func synthesize(
        segments: [DialogueSegment],
        speakerEmbeddings: [String: [Float]],
        model: CosyVoiceTTSModel,
        language: String,
        config: DialogueSynthesisConfig,
        verbose: Bool
    ) -> [Float]
}

DialogueSynthesisConfig

public struct DialogueSynthesisConfig: Sendable {
    public var turnGapSeconds: Float      // Default: 0.2
    public var crossfadeSeconds: Float    // Default: 0.0
    public var defaultInstruction: String // Default: "You are a helpful assistant."
    public var maxTokensPerSegment: Int   // Default: 500
}

PipelineLLM

Protocol for language model integration with voice pipelines. Bridges an LLM to the VoicePipeline's ASR → LLM → TTS flow.

public protocol PipelineLLM: AnyObject {
    func chat(messages: [(role: MessageRole, content: String)],
              onToken: @escaping (String, Bool) -> Void)
    func cancel()
}

Built-in adapter: Qwen3PipelineLLM bridges Qwen3ChatModel to this protocol with token cleanup, cancellation, and pending phrase accumulation.

AudioIO

Reusable audio I/O manager that eliminates AVAudioEngine boilerplate. Handles mic capture, resampling, playback, and audio level metering.

let audio = AudioIO()
try audio.startMicrophone(targetSampleRate: 16000) { samples in
    pipeline.pushAudio(samples)
}
audio.player.scheduleChunk(ttsOutput)
audio.stopMicrophone()

AudioIO includes a StreamingAudioPlayer for TTS output and an AudioRingBuffer for thread-safe audio transfer between capture and inference threads.

Model Downloads

All models are downloaded from HuggingFace on first use and cached in ~/Library/Caches/qwen3-speech/. The AudioCommon module provides a shared HuggingFaceDownloader that handles download, caching, and integrity verification.