Getting Started

speech-swift provides on-device AI speech processing for macOS and iOS on Apple Silicon. Models run locally using MLX (Metal GPU) and CoreML (Neural Engine).

Requirements

Installation

Homebrew (CLI)

The fastest way to try speech-swift on macOS. Installs both the audio CLI and the audio-server HTTP/WebSocket server (OpenAI-compatible /v1/realtime endpoint). Requires native ARM Homebrew (/opt/homebrew).

brew tap soniqo/speech https://github.com/soniqo/speech-swift
brew install speech

After install, both binaries are on your PATH:

audio transcribe recording.wav
audio speak "Hello, world!" --output hello.wav
audio-server --port 8080            # local HTTP / WebSocket server

Swift Package Manager

Add speech-swift to your Package.swift dependencies:

dependencies: [
    .package(url: "https://github.com/soniqo/speech-swift", from: "0.0.9")
]

Then add the modules you need to your target:

.target(
    name: "MyApp",
    dependencies: [
        .product(name: "Qwen3ASR", package: "speech-swift"),
        .product(name: "Qwen3TTS", package: "speech-swift"),
        .product(name: "SpeechVAD", package: "speech-swift"),
        // ... add any modules you need
    ]
)

Available Modules

ModuleDescription
Qwen3ASRSpeech-to-text (Qwen3-ASR)
ParakeetASRSpeech-to-text (Parakeet TDT, CoreML)
Qwen3TTSText-to-speech (Qwen3-TTS)
CosyVoiceTTSText-to-speech (CosyVoice3, multilingual)
KokoroTTSText-to-speech (Kokoro-82M, CoreML, iOS-ready)
Qwen3ChatOn-device LLM chat (Qwen3.5-0.8B, MLX + CoreML)
PersonaPlexSpeech-to-speech (PersonaPlex 7B)
SpeechVADVAD (Silero + Pyannote), diarization, speaker embeddings
SpeechEnhancementNoise suppression (DeepFilterNet3, CoreML)
AudioCommonShared protocols, audio I/O, HuggingFace downloader

Building from Source

Clone the repository and build:

git clone https://github.com/soniqo/speech-swift.git
cd speech-swift
make build
Important

make build compiles the MLX Metal shader library automatically. Without it, GPU inference runs ~5x slower due to JIT shader compilation.

Quick Start: Transcribe Audio

CLI

# Transcribe a WAV file
.build/release/audio transcribe recording.wav

Swift API

import Qwen3ASR

let model = try await Qwen3ASRModel.loadFromHub()
let result = try await model.transcribe(audioFile: "recording.wav")
print(result.text)

Models are downloaded automatically from HuggingFace on first use and cached in ~/Library/Caches/qwen3-speech/.

Quick Start: Text-to-Speech

CLI

# Generate speech
.build/release/audio speak "Hello, world!" --output hello.wav

Swift API

import Qwen3TTS

let model = try await Qwen3TTSModel.loadFromHub()
let audio = try await model.speak("Hello, world!")
try audio.write(to: "hello.wav")

Model Downloads

All models are downloaded from HuggingFace on first use. Approximate sizes:

ModelSizeRAM Usage
Qwen3-ASR 0.6B (4-bit)680 MB~2.2 GB peak
Qwen3-ASR 0.6B (8-bit)1.0 GB~2.5 GB peak
Qwen3-ASR 1.7B (4-bit)2.1 GB~4 GB peak
Parakeet-TDT (CoreML INT8)500 MB~600 MB peak
Qwen3-TTS 0.6B (4-bit)1.7 GB~2 GB peak
Qwen3-TTS 1.7B (4-bit)3.2 GB~4 GB peak
CosyVoice3 (4-bit LLM)1.2 GB~1.5 GB peak
Kokoro-82M (CoreML INT8)89 MB~200 MB peak
Qwen3.5-Chat 0.8B (INT4 MLX)418 MB~700 MB peak
Qwen3.5-Chat 0.8B (INT8 CoreML)981 MB~1.2 GB peak
PersonaPlex 7B (8-bit) recommended9.1 GB~11 GB peak
PersonaPlex 7B (4-bit)4.9 GB~6.5 GB peak
Pyannote VAD5.7 MB~20 MB peak
Silero VAD v51.2 MB~5 MB peak
WeSpeaker ResNet3425 MB~50 MB peak
DeepFilterNet3 (FP16)4.2 MB~10 MB peak

Next Steps