API 및 프로토콜

AudioCommon 모듈은 모델에 독립적인 프로토콜과 공유 타입을 정의합니다. 이를 준수하는 모든 모델은 이러한 인터페이스를 통해 상호 교환 가능하게 사용할 수 있습니다.

프로토콜 개요

┌─────────────────────────────────────────────────────────┐
│                    AudioCommon                          │
│                                                         │
│  AudioChunk          SpeechGenerationModel (TTS)        │
│  AlignedWord         SpeechRecognitionModel (STT)       │
│  SpeechSegment       ForcedAlignmentModel               │
│                      SpeechToSpeechModel                │
│                      VoiceActivityDetectionModel (VAD)   │
│                      SpeakerEmbeddingModel              │
│                      SpeakerDiarizationModel            │
│                      SpeakerExtractionCapable           │
└─────────────────────────────────────────────────────────┘

SpeechRecognitionModel

음성-텍스트 모델을 위한 프로토콜입니다.

public protocol SpeechRecognitionModel: AnyObject {
    var inputSampleRate: Int { get }
    func transcribe(audio: [Float], sampleRate: Int, language: String?) -> String
    func transcribeWithLanguage(audio: [Float], sampleRate: Int, language: String?) -> TranscriptionResult
}

준수 타입: Qwen3ASRModel, ParakeetASRModel, ParakeetStreamingASRModel, OmnilingualASRModel (CoreML), OmnilingualASRMLXModel (MLX)

SpeechGenerationModel

텍스트-음성 모델을 위한 프로토콜입니다.

public protocol SpeechGenerationModel: AnyObject {
    var sampleRate: Int { get }
    func generate(text: String, language: String?) async throws -> [Float]
    func generateStream(text: String, language: String?) -> AsyncThrowingStream<AudioChunk, Error>  // has default impl
}

generateStream()은 generate()를 단일 chunk로 래핑하는 기본 구현을 제공합니다. 진정한 스트리밍을 지원하는 모델(예: Qwen3-TTS)은 이를 재정의합니다.

준수 타입: Qwen3TTSModel, CosyVoiceTTSModel, KokoroTTSModel, Qwen35MLXChat

ForcedAlignmentModel

단어 수준 타임스탬프 정렬을 위한 프로토콜입니다.

public protocol ForcedAlignmentModel: AnyObject {
    func align(audio: [Float], text: String, sampleRate: Int, language: String?) -> [AlignedWord]
}

SpeechToSpeechModel

음성-음성 대화 모델을 위한 프로토콜입니다.

public protocol SpeechToSpeechModel: AnyObject {
    var sampleRate: Int { get }
    func respond(userAudio: [Float]) -> [Float]
    func respondStream(userAudio: [Float]) -> AsyncThrowingStream<AudioChunk, Error>
}

준수 타입: PersonaPlexModel

VoiceActivityDetectionModel

음성 활동 감지를 위한 프로토콜입니다.

public protocol VoiceActivityDetectionModel: AnyObject {
    var inputSampleRate: Int { get }
    func detectSpeech(audio: [Float], sampleRate: Int) -> [SpeechSegment]
}

SpeakerEmbeddingModel

화자 임베딩 추출을 위한 프로토콜입니다.

public protocol SpeakerEmbeddingModel: AnyObject {
    var inputSampleRate: Int { get }
    var embeddingDimension: Int { get }
    func embed(audio: [Float], sampleRate: Int) -> [Float]
}

준수 타입: WeSpeakerModel

SpeakerDiarizationModel

오디오 세그먼트에 화자 레이블을 할당하는 화자 분리 모델을 위한 프로토콜입니다.

public protocol SpeakerDiarizationModel: AnyObject {
    var inputSampleRate: Int { get }
    func diarize(audio: [Float], sampleRate: Int) -> [DiarizedSegment]
}

준수 타입: DiarizationPipeline (Pyannote), SortformerDiarizer

SpeakerExtractionCapable

레퍼런스 임베딩을 사용해 대상 화자의 세그먼트를 추출하는 엔진을 위한 확장 화자 분리 프로토콜입니다. 모든 엔진이 이를 지원하지는 않습니다(Sortformer는 엔드투엔드이며 화자 임베딩을 생성하지 않습니다).

public protocol SpeakerExtractionCapable: SpeakerDiarizationModel {
    func extractSpeaker(audio: [Float], sampleRate: Int, targetEmbedding: [Float]) -> [SpeechSegment]
}

준수 타입: DiarizationPipeline (Pyannote 전용)

공유 타입

AudioChunk

public struct AudioChunk {
    public let samples: [Float]   // PCM samples
    public let sampleRate: Int    // Sample rate (e.g. 24000)
}

SpeechSegment

public struct SpeechSegment {
    public let startTime: Float   // Start time in seconds
    public let endTime: Float     // End time in seconds
}

AlignedWord

public struct AlignedWord {
    public let text: String       // The word
    public let startTime: Float   // Start time in seconds
    public let endTime: Float     // End time in seconds
}

DiarizedSegment

public struct DiarizedSegment {
    public let startTime: Float   // Start time in seconds
    public let endTime: Float     // End time in seconds
    public let speakerId: Int     // Speaker identifier (0-based)
}

DialogueSegment

선택적 화자 및 감정 태그를 갖는 다화자 대화 텍스트의 파싱된 세그먼트입니다. CosyVoice3 대화 합성을 위해 DialogueParser 및 DialogueSynthesizer와 함께 사용됩니다.

public struct DialogueSegment: Sendable, Equatable {
    public let speaker: String?   // Speaker identifier ("S1", "S2"), nil for untagged
    public let emotion: String?   // Emotion tag ("happy", "whispers"), nil if none
    public let text: String       // Cleaned text to synthesize
}

DialogueParser

인라인 화자 태그([S1])와 감정 태그((happy))가 포함된 다화자 대화 텍스트를 파싱합니다.

public enum DialogueParser {
    static func parse(_ text: String) -> [DialogueSegment]
    static func emotionToInstruction(_ emotion: String) -> String
}

내장 감정: happy/excited, sad, angry, whispers/whispering, laughs/laughing, calm, surprised, serious. 알 수 없는 태그는 자유 형식 지시로 전달됩니다.

DialogueSynthesizer

화자별 음성 복제, 무음 간격, 크로스페이드를 포함하는 다중 세그먼트 대화 합성을 오케스트레이션합니다.

public enum DialogueSynthesizer {
    static func synthesize(
        segments: [DialogueSegment],
        speakerEmbeddings: [String: [Float]],
        model: CosyVoiceTTSModel,
        language: String,
        config: DialogueSynthesisConfig,
        verbose: Bool
    ) -> [Float]
}

DialogueSynthesisConfig

public struct DialogueSynthesisConfig: Sendable {
    public var turnGapSeconds: Float      // Default: 0.2
    public var crossfadeSeconds: Float    // Default: 0.0
    public var defaultInstruction: String // Default: "You are a helpful assistant."
    public var maxTokensPerSegment: Int   // Default: 500
}

PipelineLLM

음성 파이프라인과 언어 모델 통합을 위한 프로토콜입니다. VoicePipeline의 ASR → LLM → TTS 흐름에 LLM을 연결합니다.

public protocol PipelineLLM: AnyObject {
    func chat(messages: [(role: MessageRole, content: String)],
              onToken: @escaping (String, Bool) -> Void)
    func cancel()
}

내장 어댑터: Qwen3PipelineLLM은 토큰 정리, 취소, 대기 구문 누적을 포함해 Qwen35MLXChat을 이 프로토콜에 연결합니다.

AudioIO

AVAudioEngine 보일러플레이트를 제거하는 재사용 가능한 오디오 I/O 관리자입니다. 마이크 캡처, 리샘플링, 재생, 오디오 레벨 미터링을 처리합니다.

let audio = AudioIO()
try audio.startMicrophone(targetSampleRate: 16000) { samples in
    pipeline.pushAudio(samples)
}
audio.player.scheduleChunk(ttsOutput)
audio.stopMicrophone()

AudioIO는 TTS 출력을 위한 StreamingAudioPlayer와 캡처 및 추론 스레드 간의 스레드 안전 오디오 전송을 위한 AudioRingBuffer를 포함합니다.

SentencePieceModel

SentencePiece .model 파일을 위한 공유 protobuf 리더이며, AudioCommon에 위치합니다. SentencePiece 조각을 디코딩해야 하는 모든 모듈(PersonaPlex, OmnilingualASR, 향후 ASR / TTS 포트)은 protobuf wire 형식을 재구현하는 대신 이 단일 리더 위에 자체 디코더를 빌드합니다.

public struct SentencePieceModel: Sendable {
    public struct Piece: Sendable, Equatable {
        public let text: String
        public let score: Float
        public let type: Int32
        public var pieceType: PieceType? { get }
        public var isControlOrUnknown: Bool { get }
    }
    public enum PieceType: Int32 {
        case normal = 1, unknown = 2, control = 3,
             userDefined = 4, unused = 5, byte = 6
    }
    public let pieces: [Piece]
    public var count: Int { get }
    public subscript(_ id: Int) -> Piece? { get }
    public init(contentsOf url: URL) throws
    public init(modelPath: String) throws
    public init(data: Data) throws
}

사용처: OmnilingualASR.OmnilingualVocabulary, PersonaPlex.SentencePieceDecoder. Tests/AudioCommonTests/SentencePieceModelTests의 7개 단위 테스트로 커버됩니다.

MLXCommon.SDPA

모든 MLX attention 모듈(Qwen3-ASR / Qwen3-TTS / Qwen3-Chat / CosyVoice / PersonaPlex / OmnilingualASR)에서 공유하는 scaled dot-product attention 헬퍼입니다. 각 모듈은 자체 projection을 유지하며 — SDPA는 reshape → attention → merge 보일러플레이트만 처리합니다.

public enum SDPA {
    // Flat [B, T, H*D] input: project/reshape happens inside
    public static func multiHead(
        q: MLXArray, k: MLXArray, v: MLXArray,
        numHeads: Int, headDim: Int, scale: Float,
        mask: MLXArray? = nil
    ) -> MLXArray

    // GQA / MQA variant with separate query and KV head counts
    public static func multiHead(
        q: MLXArray, k: MLXArray, v: MLXArray,
        numQueryHeads: Int, numKVHeads: Int, headDim: Int, scale: Float,
        mask: MLXArray? = nil
    ) -> MLXArray

    // Already-shaped [B, H, T, D] (RoPE / KV cache paths)
    public static func attendAndMerge(
        qHeads: MLXArray, kHeads: MLXArray, vHeads: MLXArray,
        scale: Float,
        mask: MLXArray? = nil
    ) -> MLXArray

    // Same, with ScaledDotProductAttentionMaskMode enum (newer API)
    public static func attendAndMerge(
        qHeads: MLXArray, kHeads: MLXArray, vHeads: MLXArray,
        scale: Float,
        mask: MLXFast.ScaledDotProductAttentionMaskMode
    ) -> MLXArray

    // Low-level head merge: [B, H, T, D] → [B, T, H*D]
    public static func mergeHeads(_ attn: MLXArray) -> MLXArray
}

모든 reshape 호출은 배치 차원에 -1을 사용하므로, 헬퍼는 런타임에 배치가 달라지는 MLX.compile(shapeless:) 그래프(예: Qwen3-TTS Talker autoregressive decode)와 함께 구성할 수 있습니다.

HTTP API 서버

speech-server 바이너리는 speech-swift의 모든 모델을 HTTP REST 엔드포인트와 함께 OpenAI Realtime API를 구현하는 WebSocket 엔드포인트로 공개합니다. 모델은 첫 요청 시 지연 로드되며, --preload를 전달하면 시작 시 모두 워밍업됩니다.

swift build -c release
.build/release/speech-server --port 8080

# 시작 시 모든 모델 미리 로드
.build/release/speech-server --port 8080 --preload

REST 엔드포인트

엔드포인트	메서드	요청	응답
`/transcribe`	POST	`audio/wav` 본문	JSON `{ text }` (Qwen3-ASR)
`/speak`	POST	JSON `{ text, engine?, language?, voice? }`	`audio/wav` 본문 (Qwen3-TTS, CosyVoice, Kokoro)
`/respond`	POST	`audio/wav` 본문	`audio/wav` 본문 (PersonaPlex)
`/enhance`	POST	`audio/wav` 본문	`audio/wav` 본문 (DeepFilterNet3)
`/vad`	POST	`audio/wav` 본문	JSON 세그먼트 리스트
`/diarize`	POST	`audio/wav` 본문	JSON `DiarizedSegment` 리스트
`/embed-speaker`	POST	`audio/wav` 본문	JSON `[Float]` (256차원)

# 파일 전사
curl -X POST http://localhost:8080/transcribe \
  --data-binary @recording.wav \
  -H "Content-Type: audio/wav"

# 음성 합성
curl -X POST http://localhost:8080/speak \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "engine": "cosyvoice"}' \
  -o output.wav

# 완전한 음성-음성 왕복
curl -X POST http://localhost:8080/respond \
  --data-binary @question.wav \
  -o response.wav

OpenAI Realtime API (`/v1/realtime`)

ws://host:port/v1/realtime의 WebSocket 엔드포인트는 OpenAI Realtime 프로토콜을 구현합니다. 모든 메시지는 type 판별자를 갖는 JSON이며, 오디오 페이로드는 24 kHz 모노의 base64 인코딩된 PCM16입니다.

클라이언트 → 서버 이벤트

이벤트	용도
`session.update`	엔진, 언어, 음색 및 오디오 형식 구성
`input_audio_buffer.append`	입력 버퍼에 base64 PCM16 chunk 추가
`input_audio_buffer.commit`	버퍼링된 오디오를 전사를 위해 커밋
`input_audio_buffer.clear`	현재 입력 버퍼 폐기
`response.create`	제공된 텍스트/지시에 대한 TTS 합성 요청

서버 → 클라이언트 이벤트

이벤트	의미
`session.created`	핸드셰이크 완료, 기본 구성 전송
`session.updated`	최근 `session.update` 확인
`input_audio_buffer.committed`	오디오가 수락되어 전사 대기열에 추가됨
`conversation.item.input_audio_transcription.completed`	최종 전사 텍스트가 포함된 ASR 결과
`response.audio.delta`	합성된 오디오의 Base64 PCM16 chunk
`response.audio.done`	이 응답의 오디오 chunk 종료
`response.done`	응답 확정 (메타데이터 + 지연 통계)
`error`	`type`과 `message`를 포함하는 오류 엔벨로프

const ws = new WebSocket('ws://localhost:8080/v1/realtime');

// ASR: 오디오 전송, 전사 요청
ws.send(JSON.stringify({ type: 'input_audio_buffer.append', audio: base64PCM16 }));
ws.send(JSON.stringify({ type: 'input_audio_buffer.commit' }));
// → conversation.item.input_audio_transcription.completed

// TTS: 합성 요청 및 오디오 delta 스트리밍
ws.send(JSON.stringify({
  type: 'response.create',
  response: { modalities: ['audio', 'text'], instructions: 'Hello world' }
}));
// → response.audio.delta (반복), response.audio.done, response.done

서버는 AudioServer SPM 프로덕트에 포함되어 있습니다. 예제 브라우저 클라이언트는 Examples/websocket-client.html에 제공됩니다 — 실행 중인 서버 옆에서 열어 전체 ASR + TTS 왕복을 구동하세요.

모델 다운로드

모든 모델은 최초 사용 시 HuggingFace에서 다운로드되어 ~/Library/Caches/qwen3-speech/에 캐시됩니다. AudioCommon 모듈은 다운로드, 캐싱, 무결성 검증을 처리하는 공유 HuggingFaceDownloader를 제공합니다.