Getting Started

speech-swift provides on-device AI speech processing for macOS and iOS on Apple Silicon. Models run locally using MLX (Metal GPU) and CoreML (Neural Engine).

Requirements

Installation

Homebrew (CLI)

The fastest way to try speech-swift on macOS. Installs both the speech CLI and the speech-server HTTP/WebSocket server (OpenAI-compatible /v1/realtime endpoint). Requires native ARM Homebrew (/opt/homebrew).

brew install speech

After install, both binaries are on your PATH:

speech transcribe recording.wav
speech speak "Hello, world!" --output hello.wav
speech-server --port 8080            # local HTTP / WebSocket server

Swift Package Manager

Add speech-swift to your Package.swift dependencies:

dependencies: [
    .package(url: "https://github.com/soniqo/speech-swift", branch: "main")
]

Then add the modules you need to your target:

.target(
    name: "MyApp",
    dependencies: [
        .product(name: "Qwen3ASR", package: "speech-swift"),
        .product(name: "Qwen3TTS", package: "speech-swift"),
        .product(name: "SpeechVAD", package: "speech-swift"),
        // ... add any modules you need
    ]
)

Available Modules

ModuleDescription
Qwen3ASRSpeech-to-text (Qwen3-ASR)
ParakeetASRSpeech-to-text (Parakeet TDT, CoreML)
Qwen3TTSText-to-speech (Qwen3-TTS)
CosyVoiceTTSText-to-speech (CosyVoice3, multilingual)
KokoroTTSText-to-speech (Kokoro-82M, CoreML, iOS-ready)
Qwen3ChatOn-device LLM chat (Qwen3.5-0.8B, MLX + CoreML)
PersonaPlexSpeech-to-speech (PersonaPlex 7B)
SpeechVADVAD (Silero + Pyannote), diarization, speaker embeddings
SpeechEnhancementNoise suppression (DeepFilterNet3, CoreML)
AudioCommonShared protocols, audio I/O, HuggingFace downloader

Building from Source

Clone the repository and build:

git clone https://github.com/soniqo/speech-swift.git
cd speech-swift
make build
Important

make build compiles the MLX Metal shader library automatically. Without it, GPU inference runs ~5x slower due to JIT shader compilation.

Quick Start: Transcribe Audio

CLI

# Transcribe a WAV file
.build/release/speech transcribe recording.wav

Swift API

import Qwen3ASR

let model = try await Qwen3ASRModel.fromPretrained()
// audioSamples: [Float] PCM at 16 kHz (e.g. decoded from a WAV)
let text = model.transcribe(audio: audioSamples, sampleRate: 16000)
print(text)

Models are downloaded automatically from HuggingFace on first use and cached in ~/Library/Caches/qwen3-speech/.

Quick Start: Text-to-Speech

CLI

# Generate speech
.build/release/speech speak "Hello, world!" --output hello.wav

Swift API

import Qwen3TTS
import AudioCommon

let model = try await Qwen3TTSModel.fromPretrained()
let audio = model.synthesize(text: "Hello, world!", language: "english")
try WAVWriter.write(samples: audio, sampleRate: 24000, to: URL(filePath: "hello.wav"))

Model Downloads

All models are downloaded from HuggingFace on first use. Approximate sizes:

ModelSizeRAM Usage
Qwen3-ASR 0.6B (4-bit MLX)680 MB~1.0 GB peak
Qwen3-ASR 0.6B (8-bit MLX)1.0 GB~1.3 GB peak
Qwen3-ASR 0.6B (CoreML INT8)180 MB~1.4 GB peak
Qwen3-ASR 1.7B (4-bit MLX)2.1 GB~3 GB peak
Qwen3-ASR 1.7B (8-bit MLX)3.2 GB~2.7 GB peak
Parakeet-TDT v3 (CoreML INT8)500 MB~900 MB peak
Omnilingual CTC 300M (4-bit MLX)193 MB~400 MB peak
Omnilingual CTC 300M (CoreML INT8)312 MB~550 MB peak
Qwen3-TTS 0.6B (4-bit)1.7 GB~2 GB peak
Qwen3-TTS 1.7B (4-bit)3.2 GB~4 GB peak
CosyVoice3 (4-bit LLM)1.2 GB~1.5 GB peak
Kokoro-82M (CoreML INT8)89 MB~200 MB peak
Qwen3.5-Chat 0.8B (INT4 MLX)418 MB~700 MB peak
Qwen3.5-Chat 0.8B (INT8 CoreML)981 MB~1.2 GB peak
PersonaPlex 7B (8-bit) recommended9.1 GB~11 GB peak
PersonaPlex 7B (4-bit)4.9 GB~6.5 GB peak
Pyannote VAD5.7 MB~20 MB peak
Silero VAD v51.2 MB~5 MB peak
WeSpeaker ResNet3425 MB~50 MB peak
DeepFilterNet3 (FP16)4.2 MB~10 MB peak

Next Steps