API và giao thức

Module AudioCommon định nghĩa các giao thức không phụ thuộc mô hình và các kiểu dữ liệu dùng chung. Mọi mô hình tuân thủ các giao thức này có thể thay thế cho nhau qua các interface đó.

Tổng quan giao thức

┌─────────────────────────────────────────────────────────┐
│                    AudioCommon                          │
│                                                         │
│  AudioChunk          SpeechGenerationModel (TTS)        │
│  AlignedWord         SpeechRecognitionModel (STT)       │
│  SpeechSegment       ForcedAlignmentModel               │
│                      SpeechToSpeechModel                │
│                      VoiceActivityDetectionModel (VAD)   │
│                      SpeakerEmbeddingModel              │
│                      SpeakerDiarizationModel            │
│                      SpeakerExtractionCapable           │
└─────────────────────────────────────────────────────────┘

SpeechRecognitionModel

Giao thức cho các mô hình chuyển giọng nói thành văn bản.

public protocol SpeechRecognitionModel: AnyObject {
    var inputSampleRate: Int { get }
    func transcribe(audio: [Float], sampleRate: Int, language: String?) -> String
    func transcribeWithLanguage(audio: [Float], sampleRate: Int, language: String?) -> TranscriptionResult
}

Các kiểu tuân thủ: Qwen3ASRModel, WhisperASRModel, ParakeetASRModel, ParakeetStreamingASRModel, OmnilingualASRModel (CoreML), OmnilingualASRMLXModel (MLX)

SpeechGenerationModel

Giao thức cho các mô hình chuyển văn bản thành giọng nói.

public protocol SpeechGenerationModel: AnyObject {
    var sampleRate: Int { get }
    func generate(text: String, language: String?) async throws -> [Float]
    func generateStream(text: String, language: String?) -> AsyncThrowingStream<AudioChunk, Error>  // has default impl
}

generateStream() có một bản triển khai mặc định bọc generate() thành một chunk duy nhất. Các mô hình có streaming thật sự (ví dụ Qwen3-TTS) sẽ ghi đè nó.

Các kiểu tuân thủ: Qwen3TTSModel, CosyVoiceTTSModel, VoxCPM2TTSModel, KokoroTTSModel, IndexTTS2TTSModel

IndexTTS2TTSModel thêm overload generate có âm thanh tham chiếu cho nhân bản zero-shot và dùng IndexTTS2SynthesisOptions để điều khiển tốc độ nói và khoảng nghỉ.

ForcedAlignmentModel

Giao thức cho căn chỉnh dấu thời gian ở cấp từ.

public protocol ForcedAlignmentModel: AnyObject {
    func align(audio: [Float], text: String, sampleRate: Int, language: String?) -> [AlignedWord]
}

SpeechToSpeechModel

Giao thức cho các mô hình hội thoại giọng nói tới giọng nói.

public protocol SpeechToSpeechModel: AnyObject {
    var sampleRate: Int { get }
    func respond(userAudio: [Float]) -> [Float]
    func respondStream(userAudio: [Float]) -> AsyncThrowingStream<AudioChunk, Error>
}

Các kiểu tuân thủ: PersonaPlexModel

VoiceActivityDetectionModel

Giao thức cho phát hiện hoạt động giọng nói.

public protocol VoiceActivityDetectionModel: AnyObject {
    var inputSampleRate: Int { get }
    func detectSpeech(audio: [Float], sampleRate: Int) -> [SpeechSegment]
}

SpeakerEmbeddingModel

Giao thức cho trích xuất embedding người nói.

public protocol SpeakerEmbeddingModel: AnyObject {
    var inputSampleRate: Int { get }
    var embeddingDimension: Int { get }
    func embed(audio: [Float], sampleRate: Int) -> [Float]
}

Các kiểu tuân thủ: WeSpeakerModel

SpeakerDiarizationModel

Giao thức cho các mô hình phân tách người nói, gán nhãn người nói cho từng đoạn âm thanh.

public protocol SpeakerDiarizationModel: AnyObject {
    var inputSampleRate: Int { get }
    func diarize(audio: [Float], sampleRate: Int) -> [DiarizedSegment]
}

Các kiểu tuân thủ: DiarizationPipeline (Pyannote), SortformerDiarizer

SpeakerExtractionCapable

Giao thức mở rộng cho các engine phân tách có hỗ trợ trích xuất các đoạn của một người nói mục tiêu dựa trên embedding tham chiếu. Không phải engine nào cũng hỗ trợ (Sortformer chạy end-to-end và không tạo ra embedding người nói).

public protocol SpeakerExtractionCapable: SpeakerDiarizationModel {
    func extractSpeaker(audio: [Float], sampleRate: Int, targetEmbedding: [Float]) -> [SpeechSegment]
}

Các kiểu tuân thủ: DiarizationPipeline (chỉ Pyannote)

Các kiểu dùng chung

AudioChunk

public struct AudioChunk {
    public let samples: [Float]   // PCM samples
    public let sampleRate: Int    // Sample rate (e.g. 24000)
}

SpeechSegment

public struct SpeechSegment {
    public let startTime: Float   // Start time in seconds
    public let endTime: Float     // End time in seconds
}

AlignedWord

public struct AlignedWord {
    public let text: String       // The word
    public let startTime: Float   // Start time in seconds
    public let endTime: Float     // End time in seconds
}

DiarizedSegment

public struct DiarizedSegment {
    public let startTime: Float   // Start time in seconds
    public let endTime: Float     // End time in seconds
    public let speakerId: Int     // Speaker identifier (0-based)
}

DialogueSegment

Một đoạn đã được phân tích từ văn bản hội thoại nhiều người nói, có thẻ người nói và cảm xúc (tùy chọn). Dùng cùng DialogueParser và DialogueSynthesizer cho tổng hợp hội thoại của CosyVoice3.

public struct DialogueSegment: Sendable, Equatable {
    public let speaker: String?   // Speaker identifier ("S1", "S2"), nil for untagged
    public let emotion: String?   // Emotion tag ("happy", "whispers"), nil if none
    public let text: String       // Cleaned text to synthesize
}

DialogueParser

Phân tích văn bản hội thoại nhiều người nói với thẻ người nói inline ([S1]) và thẻ cảm xúc ((happy)).

public enum DialogueParser {
    static func parse(_ text: String) -> [DialogueSegment]
    static func emotionToInstruction(_ emotion: String) -> String
}

Các cảm xúc dựng sẵn: happy/excited, sad, angry, whispers/whispering, laughs/laughing, calm, surprised, serious. Các thẻ không xác định sẽ được truyền nguyên dạng như chỉ thị tự do.

DialogueSynthesizer

Điều phối tổng hợp hội thoại nhiều đoạn với nhân bản giọng theo từng người nói, khoảng lặng giữa các lượt và crossfade.

public enum DialogueSynthesizer {
    static func synthesize(
        segments: [DialogueSegment],
        speakerEmbeddings: [String: [Float]],
        model: CosyVoiceTTSModel,
        language: String,
        config: DialogueSynthesisConfig,
        verbose: Bool
    ) -> [Float]
}

DialogueSynthesisConfig

public struct DialogueSynthesisConfig: Sendable {
    public var turnGapSeconds: Float      // Default: 0.2
    public var crossfadeSeconds: Float    // Default: 0.0
    public var defaultInstruction: String // Default: "You are a helpful assistant."
    public var maxTokensPerSegment: Int   // Default: 500
}

PipelineLLM

Giao thức để tích hợp mô hình ngôn ngữ vào các pipeline giọng nói. Cầu nối một LLM tới luồng ASR → LLM → TTS của VoicePipeline.

public protocol PipelineLLM: AnyObject {
    func chat(messages: [(role: MessageRole, content: String)],
              onToken: @escaping (String, Bool) -> Void)
    func cancel()
}

Adapter dựng sẵn: Qwen3PipelineLLM kết nối Qwen35MLXChat tới giao thức này kèm dọn dẹp token, hủy thao tác và gom cụm chờ.

AudioIO

Trình quản lý I/O âm thanh có thể tái sử dụng, loại bỏ boilerplate AVAudioEngine. Xử lý thu micro, resampling, phát lại và đo mức âm thanh.

let audio = AudioIO()
try audio.startMicrophone(targetSampleRate: 16000) { samples in
    pipeline.pushAudio(samples)
}
audio.player.scheduleChunk(ttsOutput)
audio.stopMicrophone()

AudioIO bao gồm StreamingAudioPlayer cho đầu ra TTS và AudioRingBuffer để truyền âm thanh an toàn theo luồng giữa luồng thu và luồng suy luận.

SentencePieceModel

Trình đọc protobuf dùng chung cho các tệp .model của SentencePiece, nằm trong AudioCommon. Mọi module cần giải mã các piece SentencePiece (PersonaPlex, OmnilingualASR, các bản port ASR / TTS sau này) đều dựng decoder riêng dựa trên trình đọc duy nhất này thay vì cài lại định dạng wire của protobuf.

public struct SentencePieceModel: Sendable {
    public struct Piece: Sendable, Equatable {
        public let text: String
        public let score: Float
        public let type: Int32
        public var pieceType: PieceType? { get }
        public var isControlOrUnknown: Bool { get }
    }
    public enum PieceType: Int32 {
        case normal = 1, unknown = 2, control = 3,
             userDefined = 4, unused = 5, byte = 6
    }
    public let pieces: [Piece]
    public var count: Int { get }
    public subscript(_ id: Int) -> Piece? { get }
    public init(contentsOf url: URL) throws
    public init(modelPath: String) throws
    public init(data: Data) throws
}

Được dùng bởi: OmnilingualASR.OmnilingualVocabulary, PersonaPlex.SentencePieceDecoder. Được kiểm thử bằng 7 unit test trong Tests/AudioCommonTests/SentencePieceModelTests.

MLXCommon.SDPA

Các hàm hỗ trợ scaled dot-product attention dùng chung trên mọi module attention MLX (Qwen3-ASR / Qwen3-TTS / Qwen3-Chat / CosyVoice / PersonaPlex / OmnilingualASR). Mỗi module tự giữ projection của riêng mình — SDPA chỉ lo phần boilerplate reshape → attention → merge.

public enum SDPA {
    // Flat [B, T, H*D] input: project/reshape happens inside
    public static func multiHead(
        q: MLXArray, k: MLXArray, v: MLXArray,
        numHeads: Int, headDim: Int, scale: Float,
        mask: MLXArray? = nil
    ) -> MLXArray

    // GQA / MQA variant with separate query and KV head counts
    public static func multiHead(
        q: MLXArray, k: MLXArray, v: MLXArray,
        numQueryHeads: Int, numKVHeads: Int, headDim: Int, scale: Float,
        mask: MLXArray? = nil
    ) -> MLXArray

    // Already-shaped [B, H, T, D] (RoPE / KV cache paths)
    public static func attendAndMerge(
        qHeads: MLXArray, kHeads: MLXArray, vHeads: MLXArray,
        scale: Float,
        mask: MLXArray? = nil
    ) -> MLXArray

    // Same, with ScaledDotProductAttentionMaskMode enum (newer API)
    public static func attendAndMerge(
        qHeads: MLXArray, kHeads: MLXArray, vHeads: MLXArray,
        scale: Float,
        mask: MLXFast.ScaledDotProductAttentionMaskMode
    ) -> MLXArray

    // Low-level head merge: [B, H, T, D] → [B, T, H*D]
    public static func mergeHeads(_ attn: MLXArray) -> MLXArray
}

Tất cả các lệnh reshape đều dùng -1 cho chiều batch, nhờ đó các hàm hỗ trợ này có thể kết hợp với các đồ thị MLX.compile(shapeless:) mà có batch thay đổi tại runtime (ví dụ Qwen3-TTS Talker giải mã tự hồi quy).

Máy chủ HTTP API

Tệp thực thi speech-server phơi bày mọi mô hình trong speech-swift dưới dạng endpoint HTTP REST và một endpoint WebSocket triển khai OpenAI Realtime API. Các mô hình được tải lười ở lần yêu cầu đầu tiên; truyền --preload để khởi động sẵn tất cả ngay khi bật.

swift build -c release
.build/release/speech-server --port 8080

# Tải sẵn mọi mô hình khi khởi động
.build/release/speech-server --port 8080 --preload

Các endpoint REST

Endpoint	Method	Yêu cầu	Phản hồi
`/transcribe`	POST	Body `audio/wav`	JSON `{ text }` (Qwen3-ASR)
`/speak`	POST	JSON `{ text, engine?, language?, voice? }`	Body `audio/wav` (Qwen3-TTS, CosyVoice, Kokoro)
`/respond`	POST	Body `audio/wav`	Body `audio/wav` (PersonaPlex)
`/enhance`	POST	Body `audio/wav`	Body `audio/wav` (DeepFilterNet3)
`/vad`	POST	Body `audio/wav`	Danh sách JSON các đoạn
`/diarize`	POST	Body `audio/wav`	Danh sách JSON `DiarizedSegment`
`/embed-speaker`	POST	Body `audio/wav`	JSON `[Float]` (256 chiều)

# Chuyển một tệp thành văn bản
curl -X POST http://localhost:8080/transcribe \
  --data-binary @recording.wav \
  -H "Content-Type: audio/wav"

# Tổng hợp giọng nói
curl -X POST http://localhost:8080/speak \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "engine": "cosyvoice"}' \
  -o output.wav

# Vòng giọng nói tới giọng nói đầy đủ
curl -X POST http://localhost:8080/respond \
  --data-binary @question.wav \
  -o response.wav

OpenAI Realtime API (`/v1/realtime`)

Endpoint WebSocket tại ws://host:port/v1/realtime triển khai giao thức OpenAI Realtime. Mọi tin nhắn đều là JSON với trường phân biệt type; payload âm thanh là PCM16 mã hóa base64 ở 24 kHz mono.

Trong khi tải mô hình lạnh hoặc sinh dài, máy chủ phát các sự kiện JSON nhẹ realtime.keepalive và frame điều khiển websocket pong khoảng mỗi 15 giây cho đến khi đầu ra mô hình sẵn sàng. Client có thể bỏ qua các sự kiện này hoặc dùng chúng làm chỉ báo hoạt động.

Sự kiện Client → Server

Sự kiện	Mục đích
`session.update`	Cấu hình engine, ngôn ngữ, giọng và định dạng âm thanh
`input_audio_buffer.append`	Nối thêm một chunk PCM16 base64 vào buffer đầu vào
`input_audio_buffer.commit`	Commit âm thanh đã đệm để chuyển thành văn bản
`input_audio_buffer.clear`	Loại bỏ buffer đầu vào hiện tại
`response.create`	Yêu cầu tổng hợp TTS cho văn bản/chỉ thị được cung cấp

Sự kiện Server → Client

Sự kiện	Ý nghĩa
`session.created`	Bắt tay xong, cấu hình mặc định đã được phát
`session.updated`	Đã xác nhận `session.update` gần nhất
`input_audio_buffer.committed`	Âm thanh đã được chấp nhận và xếp hàng để chuyển thành văn bản
`conversation.item.input_audio_transcription.completed`	Kết quả ASR kèm văn bản chép cuối cùng
`response.audio.delta`	Chunk PCM16 base64 của âm thanh đã tổng hợp
`response.audio.done`	Không còn chunk âm thanh cho phản hồi này
`response.done`	Phản hồi hoàn tất (metadata + thống kê độ trễ)
`error`	Khung lỗi với `type` và `message`

const ws = new WebSocket('ws://localhost:8080/v1/realtime');

// ASR: push audio, request transcription
ws.send(JSON.stringify({ type: 'input_audio_buffer.append', audio: base64PCM16 }));
ws.send(JSON.stringify({ type: 'input_audio_buffer.commit' }));
// → conversation.item.input_audio_transcription.completed

// TTS: request synthesis and stream audio deltas
ws.send(JSON.stringify({
  type: 'response.create',
  response: { modalities: ['audio', 'text'], instructions: 'Hello world' }
}));
// → response.audio.delta (repeated), response.audio.done, response.done

Máy chủ nằm trong sản phẩm SPM AudioServer. Một client trình duyệt ví dụ được cung cấp tại Examples/websocket-client.html — mở nó song song với máy chủ đang chạy để vận hành toàn bộ vòng ASR + TTS.

Tải mô hình

Tất cả các mô hình được tải từ HuggingFace ở lần dùng đầu tiên và lưu trong ~/Library/Caches/qwen3-speech/. Module AudioCommon cung cấp HuggingFaceDownloader dùng chung để xử lý tải xuống, lưu cache và xác minh toàn vẹn.