Frequently Asked Questions

Does speech-swift work on iOS?

Kokoro TTS, Qwen3.5-Chat, Silero VAD, Parakeet ASR, DeepFilterNet3, and WeSpeaker all run on iOS 18+ via CoreML on the Neural Engine. MLX-based models (Qwen3-ASR, Qwen3-TTS, Qwen3.5-Chat MLX, PersonaPlex) require macOS 15+ on Apple Silicon.

Does it require an internet connection?

Only for the initial model download from HuggingFace (automatic, cached in ~/Library/Caches/qwen3-speech/). After that, all inference runs fully offline with no network access. No cloud APIs, no API keys needed.

How does speech-swift compare to Whisper?

Qwen3-ASR-0.6B achieves RTF 0.06 on M2 Max — 40% faster than Whisper-large-v3 via whisper.cpp (RTF 0.10) — with comparable accuracy across 52 languages. speech-swift provides a native Swift async/await API, while whisper.cpp requires a C++ bridge.

See the full comparison tables for ASR and TTS benchmarks against whisper.cpp, Apple SFSpeechRecognizer, AVSpeechSynthesizer, and cloud APIs.

What Apple Silicon chips are supported?

All M-series chips: M1, M2, M3, M4 and their Pro/Max/Ultra variants. Requires macOS 15+ (Sequoia) or iOS 18+.

Why does it require macOS 15 / iOS 18?

The floor comes from MLState — Apple's persistent ANE state API, introduced in macOS 15 and iOS 18. The CoreML pipelines (Qwen3-ASR, Qwen3-Chat, Qwen3-TTS) use MLState to keep KV caches resident on the Neural Engine across token steps, instead of shuttling them in and out each step. This cut per-token CoreML latency by 30–50% versus the earlier stateless approach.

Can I use it in a commercial app?

Yes. speech-swift is licensed under Apache 2.0. The underlying model weights have their own licenses — check each model's HuggingFace page for details.

How much memory does it need?

From ~3 MB (Silero VAD) to ~6.5 GB (PersonaPlex 7B). Typical usage:

Can I run multiple models simultaneously?

Yes. Use CoreML models on the Neural Engine alongside MLX models on the GPU to avoid contention — for example, Silero VAD (CoreML) + Qwen3-ASR (MLX) + Qwen3-TTS (MLX).

Is there a REST API?

Yes. The audio-server binary exposes all models via HTTP REST and WebSocket endpoints, including an OpenAI Realtime API-compatible WebSocket at /v1/realtime. See the CLI Reference for server commands.

How do I install it?

Homebrew:

brew tap soniqo/speech https://github.com/soniqo/speech-swift && brew install speech

This installs both the audio CLI and the audio-server HTTP/WebSocket server on your PATH.

Swift Package Manager:

.package(url: "https://github.com/soniqo/speech-swift", branch: "main")

See the Getting Started guide for full instructions.

What speech models are available?

Speech-to-text: Qwen3-ASR (52 languages, MLX) and Parakeet TDT (25 languages, CoreML).

Text-to-speech: Qwen3-TTS (streaming, 10 languages), CosyVoice3 (voice cloning, 9 languages), and Kokoro-82M (iOS-ready, 50 voices, 10 languages).

Speech-to-speech: PersonaPlex 7B (full-duplex dialogue, 18 voice presets).

Audio analysis: Silero + Pyannote VAD, speaker diarization (Pyannote + Sortformer), WeSpeaker speaker embeddings, and DeepFilterNet3 noise suppression.

LLM: Qwen3.5-0.8B Chat (on-device, INT4 MLX + INT8 CoreML, streaming tokens).

Does Soniqo work on Android?

Yes. The speech-android SDK provides a Kotlin API with ONNX Runtime and NNAPI hardware acceleration. Supports arm64-v8a on Android 8+ (API 26). Models auto-download from HuggingFace on first use (~1.2 GB). See Getting Started — Android for setup instructions.

Does Soniqo work on Linux?

Yes. The speech-android project includes a C API for embedded and automotive Linux (Yocto, edge devices). Uses ONNX Runtime with optional QNN acceleration for Qualcomm hardware. Supports ARM64 and x86_64. See Getting Started — Linux for setup instructions.

Can I share models between platforms?

The core models (Parakeet, Kokoro, Silero, DeepFilter) use ONNX format on both Android and Linux. Apple uses CoreML/MLX formats. Same underlying weights, different export formats optimized for each platform's hardware acceleration.