Getting Started
speech-swift provides on-device AI speech processing for macOS on Apple Silicon. Models run locally using MLX (Metal GPU) and CoreML (Neural Engine) with no network calls required.
Requirements
- macOS 14+ (Sonoma or later)
- Apple Silicon (M1, M2, M3, M4 series)
- Xcode 15.4+ / Swift 6.0+
- 8 GB RAM minimum (16 GB recommended for larger models)
Installation
Swift Package Manager
Add speech-swift to your Package.swift dependencies:
dependencies: [
.package(url: "https://github.com/soniqo/speech-swift", from: "1.0.0")
]
Then add the modules you need to your target:
.target(
name: "MyApp",
dependencies: [
.product(name: "Qwen3ASR", package: "speech-swift"),
.product(name: "Qwen3TTS", package: "speech-swift"),
.product(name: "SpeechVAD", package: "speech-swift"),
// ... add any modules you need
]
)
Available Modules
| Module | Description |
|---|---|
Qwen3ASR | Speech-to-text (Qwen3-ASR) |
ParakeetASR | Speech-to-text (Parakeet TDT, CoreML) |
Qwen3TTS | Text-to-speech (Qwen3-TTS) |
CosyVoiceTTS | Text-to-speech (CosyVoice3, multilingual) |
KokoroTTS | Text-to-speech (Kokoro-82M, CoreML, iOS-ready) |
PersonaPlex | Speech-to-speech (PersonaPlex 7B) |
SpeechVAD | VAD (Silero + Pyannote), diarization, speaker embeddings |
SpeechEnhancement | Noise suppression (DeepFilterNet3, CoreML) |
AudioCommon | Shared protocols, audio I/O, HuggingFace downloader |
Building from Source
Clone the repository and build:
git clone https://github.com/soniqo/speech-swift.git
cd speech-swift
make build
Important
make build compiles the MLX Metal shader library automatically. Without it, GPU inference runs ~5x slower due to JIT shader compilation.
Quick Start: Transcribe Audio
CLI
# Transcribe a WAV file
.build/release/audio transcribe recording.wav
Swift API
import Qwen3ASR
let model = try await Qwen3ASRModel.loadFromHub()
let result = try await model.transcribe(audioFile: "recording.wav")
print(result.text)
Models are downloaded automatically from HuggingFace on first use and cached in ~/Library/Caches/qwen3-speech/.
Quick Start: Text-to-Speech
CLI
# Generate speech
.build/release/audio speak "Hello, world!" --output hello.wav
Swift API
import Qwen3TTS
let model = try await Qwen3TTSModel.loadFromHub()
let audio = try await model.speak("Hello, world!")
try audio.write(to: "hello.wav")
Model Downloads
All models are downloaded from HuggingFace on first use. Approximate sizes:
| Model | Size | RAM Usage |
|---|---|---|
| Qwen3-ASR 0.6B (4-bit) | 680 MB | ~2.2 GB peak |
| Qwen3-ASR 0.6B (8-bit) | 1.0 GB | ~2.5 GB peak |
| Qwen3-ASR 1.7B (4-bit) | 2.1 GB | ~4 GB peak |
| Parakeet-TDT (CoreML INT4) | 315 MB | ~400 MB peak |
| Qwen3-TTS 0.6B (4-bit) | 1.7 GB | ~2 GB peak |
| Qwen3-TTS 1.7B (4-bit) | 3.2 GB | ~4 GB peak |
| CosyVoice3 (4-bit LLM) | 1.2 GB | ~1.5 GB peak |
| Kokoro-82M (CoreML) | 325 MB | ~500 MB peak |
| PersonaPlex 7B (4-bit) | 4.9 GB | ~6.5 GB peak |
| Pyannote VAD | 5.7 MB | ~20 MB peak |
| Silero VAD v5 | 1.2 MB | ~5 MB peak |
| WeSpeaker ResNet34 | 25 MB | ~50 MB peak |
| DeepFilterNet3 (FP16) | 4.2 MB | ~10 MB peak |
Next Steps
- CLI Reference — all available commands and options
- Qwen3-ASR Guide — detailed speech-to-text documentation
- Qwen3-TTS Guide — detailed text-to-speech documentation
- API & Protocols — shared protocols and types