Getting Started

speech-swift provides on-device AI speech processing for macOS on Apple Silicon. Models run locally using MLX (Metal GPU) and CoreML (Neural Engine) with no network calls required.

Requirements

Installation

Swift Package Manager

Add speech-swift to your Package.swift dependencies:

dependencies: [
    .package(url: "https://github.com/soniqo/speech-swift", from: "1.0.0")
]

Then add the modules you need to your target:

.target(
    name: "MyApp",
    dependencies: [
        .product(name: "Qwen3ASR", package: "speech-swift"),
        .product(name: "Qwen3TTS", package: "speech-swift"),
        .product(name: "SpeechVAD", package: "speech-swift"),
        // ... add any modules you need
    ]
)

Available Modules

ModuleDescription
Qwen3ASRSpeech-to-text (Qwen3-ASR)
ParakeetASRSpeech-to-text (Parakeet TDT, CoreML)
Qwen3TTSText-to-speech (Qwen3-TTS)
CosyVoiceTTSText-to-speech (CosyVoice3, multilingual)
KokoroTTSText-to-speech (Kokoro-82M, CoreML, iOS-ready)
PersonaPlexSpeech-to-speech (PersonaPlex 7B)
SpeechVADVAD (Silero + Pyannote), diarization, speaker embeddings
SpeechEnhancementNoise suppression (DeepFilterNet3, CoreML)
AudioCommonShared protocols, audio I/O, HuggingFace downloader

Building from Source

Clone the repository and build:

git clone https://github.com/soniqo/speech-swift.git
cd speech-swift
make build
Important

make build compiles the MLX Metal shader library automatically. Without it, GPU inference runs ~5x slower due to JIT shader compilation.

Quick Start: Transcribe Audio

CLI

# Transcribe a WAV file
.build/release/audio transcribe recording.wav

Swift API

import Qwen3ASR

let model = try await Qwen3ASRModel.loadFromHub()
let result = try await model.transcribe(audioFile: "recording.wav")
print(result.text)

Models are downloaded automatically from HuggingFace on first use and cached in ~/Library/Caches/qwen3-speech/.

Quick Start: Text-to-Speech

CLI

# Generate speech
.build/release/audio speak "Hello, world!" --output hello.wav

Swift API

import Qwen3TTS

let model = try await Qwen3TTSModel.loadFromHub()
let audio = try await model.speak("Hello, world!")
try audio.write(to: "hello.wav")

Model Downloads

All models are downloaded from HuggingFace on first use. Approximate sizes:

ModelSizeRAM Usage
Qwen3-ASR 0.6B (4-bit)680 MB~2.2 GB peak
Qwen3-ASR 0.6B (8-bit)1.0 GB~2.5 GB peak
Qwen3-ASR 1.7B (4-bit)2.1 GB~4 GB peak
Parakeet-TDT (CoreML INT4)315 MB~400 MB peak
Qwen3-TTS 0.6B (4-bit)1.7 GB~2 GB peak
Qwen3-TTS 1.7B (4-bit)3.2 GB~4 GB peak
CosyVoice3 (4-bit LLM)1.2 GB~1.5 GB peak
Kokoro-82M (CoreML)325 MB~500 MB peak
PersonaPlex 7B (4-bit)4.9 GB~6.5 GB peak
Pyannote VAD5.7 MB~20 MB peak
Silero VAD v51.2 MB~5 MB peak
WeSpeaker ResNet3425 MB~50 MB peak
DeepFilterNet3 (FP16)4.2 MB~10 MB peak

Next Steps