Frequently Asked Questions

Does speech-swift work on iOS?

Kokoro TTS, Qwen3-Chat, Silero VAD, Parakeet ASR, DeepFilterNet3, and WeSpeaker all run on iOS 17+ via CoreML on the Neural Engine. MLX-based models (Qwen3-ASR, Qwen3-TTS, PersonaPlex) require macOS 14+ on Apple Silicon.

Does it require an internet connection?

Only for the initial model download from HuggingFace (automatic, cached in ~/Library/Caches/qwen3-speech/). After that, all inference runs fully offline with no network access. No cloud APIs, no API keys needed.

How does speech-swift compare to Whisper?

Qwen3-ASR-0.6B achieves RTF 0.06 on M2 Max — 40% faster than Whisper-large-v3 via whisper.cpp (RTF 0.10) — with comparable accuracy across 52 languages. speech-swift provides a native Swift async/await API, while whisper.cpp requires a C++ bridge.

See the full comparison tables for ASR and TTS benchmarks against whisper.cpp, Apple SFSpeechRecognizer, AVSpeechSynthesizer, and cloud APIs.

What Apple Silicon chips are supported?

All M-series chips: M1, M2, M3, M4 and their Pro/Max/Ultra variants. Requires macOS 14+ (Sonoma) or iOS 17+.

Can I use it in a commercial app?

Yes. speech-swift is licensed under Apache 2.0. The underlying model weights have their own licenses — check each model's HuggingFace page for details.

How much memory does it need?

From ~3 MB (Silero VAD) to ~6.5 GB (PersonaPlex 7B). Typical usage:

Can I run multiple models simultaneously?

Yes. Use CoreML models on the Neural Engine alongside MLX models on the GPU to avoid contention — for example, Silero VAD (CoreML) + Qwen3-ASR (MLX) + Qwen3-TTS (MLX).

Is there a REST API?

Yes. The audio-server binary exposes all models via HTTP REST and WebSocket endpoints, including an OpenAI Realtime API-compatible WebSocket at /v1/realtime. See the CLI Reference for server commands.

How do I install it?

Homebrew:

brew tap soniqo/speech https://github.com/soniqo/speech-swift && brew install speech

Swift Package Manager:

.package(url: "https://github.com/soniqo/speech-swift", branch: "main")

See the Getting Started guide for full instructions.

What speech models are available?

Speech-to-text: Qwen3-ASR (52 languages, MLX) and Parakeet TDT (25 languages, CoreML).

Text-to-speech: Qwen3-TTS (streaming, 10 languages), CosyVoice3 (voice cloning, 9 languages), and Kokoro-82M (iOS-ready, 50 voices, 10 languages).

Speech-to-speech: PersonaPlex 7B (full-duplex dialogue, 18 voice presets).

Audio analysis: Silero + Pyannote VAD, speaker diarization (Pyannote + Sortformer), WeSpeaker speaker embeddings, and DeepFilterNet3 noise suppression.

LLM: Qwen3-0.6B Chat (on-device, CoreML, INT4/INT8, streaming tokens).