API & Protocols
The AudioCommon module defines model-agnostic protocols and shared types. Any conforming model can be used interchangeably through these interfaces.
Protocol Overview
┌─────────────────────────────────────────────────────────┐
│ AudioCommon │
│ │
│ AudioChunk SpeechGenerationModel (TTS) │
│ AlignedWord SpeechRecognitionModel (STT) │
│ SpeechSegment ForcedAlignmentModel │
│ SpeechToSpeechModel │
│ VoiceActivityDetectionModel (VAD) │
│ SpeakerEmbeddingModel │
│ SpeakerDiarizationModel │
│ SpeakerExtractionCapable │
└─────────────────────────────────────────────────────────┘SpeechRecognitionModel
Protocol for speech-to-text models.
public protocol SpeechRecognitionModel: AnyObject {
var inputSampleRate: Int { get }
func transcribe(audio: [Float], sampleRate: Int, language: String?) -> String
func transcribeWithLanguage(audio: [Float], sampleRate: Int, language: String?) -> TranscriptionResult
}
Conforming types: Qwen3ASRModel, ParakeetASRModel
SpeechGenerationModel
Protocol for text-to-speech models.
public protocol SpeechGenerationModel: AnyObject {
var sampleRate: Int { get }
func generate(text: String, language: String?) async throws -> [Float]
func generateStream(text: String, language: String?) -> AsyncThrowingStream<AudioChunk, Error> // has default impl
}
generateStream() has a default implementation that wraps generate() as a single chunk. Models with true streaming (e.g. Qwen3-TTS) override it.
Conforming types: Qwen3TTSModel, CosyVoiceTTSModel, KokoroTTSModel, Qwen3ChatModel
ForcedAlignmentModel
Protocol for word-level timestamp alignment.
public protocol ForcedAlignmentModel: AnyObject {
func align(audio: [Float], text: String, sampleRate: Int, language: String?) -> [AlignedWord]
}
SpeechToSpeechModel
Protocol for speech-to-speech dialogue models.
public protocol SpeechToSpeechModel: AnyObject {
var sampleRate: Int { get }
func respond(userAudio: [Float]) -> [Float]
func respondStream(userAudio: [Float]) -> AsyncThrowingStream<AudioChunk, Error>
}
Conforming types: PersonaPlexModel
VoiceActivityDetectionModel
Protocol for voice activity detection.
public protocol VoiceActivityDetectionModel: AnyObject {
var inputSampleRate: Int { get }
func detectSpeech(audio: [Float], sampleRate: Int) -> [SpeechSegment]
}
SpeakerEmbeddingModel
Protocol for speaker embedding extraction.
public protocol SpeakerEmbeddingModel: AnyObject {
var inputSampleRate: Int { get }
var embeddingDimension: Int { get }
func embed(audio: [Float], sampleRate: Int) -> [Float]
}
Conforming types: WeSpeakerModel
SpeakerDiarizationModel
Protocol for speaker diarization models that assign speaker labels to audio segments.
public protocol SpeakerDiarizationModel: AnyObject {
var inputSampleRate: Int { get }
func diarize(audio: [Float], sampleRate: Int) -> [DiarizedSegment]
}
Conforming types: DiarizationPipeline (Pyannote), SortformerDiarizer
SpeakerExtractionCapable
Extended diarization protocol for engines that support extracting a target speaker's segments using a reference embedding. Not all engines support this (Sortformer is end-to-end and does not produce speaker embeddings).
public protocol SpeakerExtractionCapable: SpeakerDiarizationModel {
func extractSpeaker(audio: [Float], sampleRate: Int, targetEmbedding: [Float]) -> [SpeechSegment]
}
Conforming types: DiarizationPipeline (Pyannote only)
Shared Types
AudioChunk
public struct AudioChunk {
public let samples: [Float] // PCM samples
public let sampleRate: Int // Sample rate (e.g. 24000)
}
SpeechSegment
public struct SpeechSegment {
public let startTime: Float // Start time in seconds
public let endTime: Float // End time in seconds
}
AlignedWord
public struct AlignedWord {
public let text: String // The word
public let startTime: Float // Start time in seconds
public let endTime: Float // End time in seconds
}
DiarizedSegment
public struct DiarizedSegment {
public let startTime: Float // Start time in seconds
public let endTime: Float // End time in seconds
public let speakerId: Int // Speaker identifier (0-based)
}
DialogueSegment
A parsed segment of multi-speaker dialogue text with optional speaker and emotion tags. Used with DialogueParser and DialogueSynthesizer for CosyVoice3 dialogue synthesis.
public struct DialogueSegment: Sendable, Equatable {
public let speaker: String? // Speaker identifier ("S1", "S2"), nil for untagged
public let emotion: String? // Emotion tag ("happy", "whispers"), nil if none
public let text: String // Cleaned text to synthesize
}
DialogueParser
Parses multi-speaker dialogue text with inline speaker tags ([S1]) and emotion tags ((happy)).
public enum DialogueParser {
static func parse(_ text: String) -> [DialogueSegment]
static func emotionToInstruction(_ emotion: String) -> String
}
Built-in emotions: happy/excited, sad, angry, whispers/whispering, laughs/laughing, calm, surprised, serious. Unknown tags pass through as freeform instructions.
DialogueSynthesizer
Orchestrates multi-segment dialogue synthesis with per-speaker voice cloning, silence gaps, and crossfade.
public enum DialogueSynthesizer {
static func synthesize(
segments: [DialogueSegment],
speakerEmbeddings: [String: [Float]],
model: CosyVoiceTTSModel,
language: String,
config: DialogueSynthesisConfig,
verbose: Bool
) -> [Float]
}
DialogueSynthesisConfig
public struct DialogueSynthesisConfig: Sendable {
public var turnGapSeconds: Float // Default: 0.2
public var crossfadeSeconds: Float // Default: 0.0
public var defaultInstruction: String // Default: "You are a helpful assistant."
public var maxTokensPerSegment: Int // Default: 500
}
PipelineLLM
Protocol for language model integration with voice pipelines. Bridges an LLM to the VoicePipeline's ASR → LLM → TTS flow.
public protocol PipelineLLM: AnyObject {
func chat(messages: [(role: MessageRole, content: String)],
onToken: @escaping (String, Bool) -> Void)
func cancel()
}
Built-in adapter: Qwen3PipelineLLM bridges Qwen3ChatModel to this protocol with token cleanup, cancellation, and pending phrase accumulation.
AudioIO
Reusable audio I/O manager that eliminates AVAudioEngine boilerplate. Handles mic capture, resampling, playback, and audio level metering.
let audio = AudioIO()
try audio.startMicrophone(targetSampleRate: 16000) { samples in
pipeline.pushAudio(samples)
}
audio.player.scheduleChunk(ttsOutput)
audio.stopMicrophone()
AudioIO includes a StreamingAudioPlayer for TTS output and an AudioRingBuffer for thread-safe audio transfer between capture and inference threads.
Model Downloads
All models are downloaded from HuggingFace on first use and cached in ~/Library/Caches/qwen3-speech/. The AudioCommon module provides a shared HuggingFaceDownloader that handles download, caching, and integrity verification.