Getting Started

Apple Android Linux Windows

speech-swift provides on-device AI speech processing for macOS and iOS on Apple Silicon. Models run locally using MLX (Metal GPU) and CoreML (Neural Engine).

Requirements

macOS 15+ (Sequoia or later)
Apple Silicon (M1, M2, M3, M4 series)
Xcode 16+ / Swift 6.0+
8 GB RAM minimum (16 GB recommended for larger models)

Installation

Homebrew (CLI)

The fastest way to try speech-swift on macOS. Installs both the speech CLI and the speech-server HTTP/WebSocket server (OpenAI-compatible /v1/realtime endpoint). Requires native ARM Homebrew (/opt/homebrew).

brew install speech

After install, both binaries are on your PATH:

speech transcribe recording.wav
speech speak "Hello, world!" --output hello.wav
speech-server --port 8080            # local HTTP / WebSocket server

Swift Package Manager

Add speech-swift to your Package.swift dependencies:

dependencies: [
    .package(url: "https://github.com/soniqo/speech-swift", branch: "main")
]

Then add the modules you need to your target:

.target(
    name: "MyApp",
    dependencies: [
        .product(name: "Qwen3ASR", package: "speech-swift"),
        .product(name: "Qwen3TTS", package: "speech-swift"),
        .product(name: "SpeechVAD", package: "speech-swift"),
        // ... add any modules you need
    ]
)

Available Modules

Module	Description
`Qwen3ASR`	Speech-to-text (Qwen3-ASR)
`ParakeetASR`	Speech-to-text (Parakeet TDT, CoreML)
`Qwen3TTS`	Text-to-speech (Qwen3-TTS)
`CosyVoiceTTS`	Text-to-speech (CosyVoice3, multilingual)
`KokoroTTS`	Text-to-speech (Kokoro-82M, CoreML, iOS-ready)
`Qwen3Chat`	On-device LLM chat (Qwen3.5-0.8B, MLX + CoreML)
`PersonaPlex`	Speech-to-speech (PersonaPlex 7B)
`SpeechVAD`	VAD (Silero + Pyannote), diarization, speaker embeddings
`SpeechEnhancement`	Noise suppression (DeepFilterNet3, CoreML)
`AudioCommon`	Shared protocols, audio I/O, HuggingFace downloader

Building from Source

Clone the repository and build:

git clone https://github.com/soniqo/speech-swift.git
cd speech-swift
make build

Important

make build compiles the MLX Metal shader library automatically. Without it, GPU inference runs ~5x slower due to JIT shader compilation.

Quick Start: Transcribe Audio

CLI

# Transcribe a WAV file
.build/release/speech transcribe recording.wav

Swift API

import Qwen3ASR

let model = try await Qwen3ASRModel.fromPretrained()
// audioSamples: [Float] PCM at 16 kHz (e.g. decoded from a WAV)
let text = model.transcribe(audio: audioSamples, sampleRate: 16000)
print(text)

Models are downloaded automatically from HuggingFace on first use and cached in ~/Library/Caches/qwen3-speech/.

Quick Start: Text-to-Speech

CLI

# Generate speech
.build/release/speech speak "Hello, world!" --output hello.wav

Swift API

import Qwen3TTS
import AudioCommon

let model = try await Qwen3TTSModel.fromPretrained()
let audio = model.synthesize(text: "Hello, world!", language: "english")
try WAVWriter.write(samples: audio, sampleRate: 24000, to: URL(filePath: "hello.wav"))

Model Downloads

All models are downloaded from HuggingFace on first use. Approximate sizes:

Model	Size	RAM Usage
Qwen3-ASR 0.6B (4-bit MLX)	680 MB	~1.0 GB peak
Qwen3-ASR 0.6B (8-bit MLX)	1.0 GB	~1.3 GB peak
Qwen3-ASR 0.6B (CoreML INT8)	180 MB	~1.4 GB peak
Qwen3-ASR 1.7B (4-bit MLX)	2.1 GB	~3 GB peak
Qwen3-ASR 1.7B (8-bit MLX)	3.2 GB	~2.7 GB peak
Parakeet-TDT v3 (CoreML INT8)	500 MB	~900 MB peak
Omnilingual CTC 300M (4-bit MLX)	193 MB	~400 MB peak
Omnilingual CTC 300M (CoreML INT8)	312 MB	~550 MB peak
Qwen3-TTS 0.6B (4-bit)	1.7 GB	~2 GB peak
Qwen3-TTS 1.7B (4-bit)	3.2 GB	~4 GB peak
CosyVoice3 (4-bit LLM)	1.2 GB	~1.5 GB peak
Kokoro-82M (CoreML INT8)	89 MB	~200 MB peak
Qwen3.5-Chat 0.8B (INT4 MLX)	418 MB	~700 MB peak
Qwen3.5-Chat 0.8B (INT8 CoreML)	981 MB	~1.2 GB peak
PersonaPlex 7B (8-bit) recommended	9.1 GB	~11 GB peak
PersonaPlex 7B (4-bit)	4.9 GB	~6.5 GB peak
Pyannote VAD	5.7 MB	~20 MB peak
Silero VAD v5	1.2 MB	~5 MB peak
WeSpeaker ResNet34	25 MB	~50 MB peak
DeepFilterNet3 (FP16)	4.2 MB	~10 MB peak

Next Steps

CLI Reference — all available commands and options
Qwen3-ASR Guide — detailed speech-to-text documentation
Qwen3-TTS Guide — detailed text-to-speech documentation
API & Protocols — shared protocols and types