Architecture
speech-swift is organized as a modular Swift package with shared protocols, independent model modules, and a unified CLI. All inference runs on-device using MLX (Metal GPU) or CoreML (Neural Engine).
Module Dependency Graph
┌──────────┐
│ AudioCLI │ (entry point)
└────┬─────┘
│
┌──────┴──────┐
│ AudioCLILib │ (commands)
└──────┬──────┘
│
┌─────────┬───────┼───────┬──────────┬──────────────┐
│ │ │ │ │ │
┌────┴───┐ ┌──┴──┐ ┌──┴──┐ ┌─┴────┐ ┌───┴────┐ ┌──────┴───────┐
│Qwen3ASR│ │Qwen3│ │Cosy │ │Perso-│ │Speech- │ │ Speech- │
│Parakeet│ │ TTS │ │Voice│ │naPlex│ │ VAD │ │Enhancement │
└────┬───┘ └──┬──┘ └──┬──┘ └──┬───┘ └───┬───┘ └──────┬───────┘
│ │ │ │ │ │
└────────┴───────┼───────┴─────────┘ │
│ │
┌──────┴──────┐ │
│ Qwen3Common │ (shared layers) │
└──────┬──────┘ │
│ │
┌──────┴──────┐ │
│ AudioCommon │ ◄──────────────────────┘
└─────────────┘ (protocols, audio I/O)Inference Backends
| Backend | Hardware | Models |
|---|---|---|
| MLX | Metal GPU | Qwen3-ASR, Qwen3-TTS, CosyVoice3, PersonaPlex, Pyannote, Silero VAD, WeSpeaker |
| CoreML | Neural Engine | Qwen3-ASR encoder (hybrid), Parakeet TDT, Kokoro-82M, Sortformer diarization, DeepFilterNet3, Silero VAD (optional), WeSpeaker (optional) |
| Accelerate | CPU (SIMD) | Audio preprocessing (STFT, mel, FFT), signal processing |
Model Weight Format
MLX models use safetensors format with 4-bit or 8-bit quantization (group size 64). CoreML models use .mlmodelc compiled format. Conversion scripts in scripts/ convert from PyTorch checkpoints.
| Model | Params | Quantization | Size on Disk |
|---|---|---|---|
| Qwen3-ASR 0.6B (MLX) | ~600M | 4-bit / 8-bit | 680 MB / 1.0 GB |
| Qwen3-ASR 0.6B (CoreML) | ~186M (encoder) | INT8 | ~180 MB |
| Qwen3-ASR 1.7B (MLX) | ~1.7B | 4-bit / 8-bit | 2.1 GB / 3.2 GB |
| Parakeet-TDT 0.6B (CoreML) | ~600M | INT4 / INT8 | 315 MB / 500 MB |
| Qwen3-ForcedAligner 0.6B (MLX) | ~600M | 4-bit / 8-bit | 979 MB / 1.4 GB |
| Qwen3-ForcedAligner 0.6B (CoreML) | ~600M | INT4 / INT8 | 630 MB / 1.0 GB |
| Qwen3-TTS 0.6B (MLX) | ~600M | 4-bit / 8-bit | 1.7 GB / 2.4 GB |
| Qwen3-TTS 1.7B (MLX) | ~1.7B | 4-bit / 8-bit | 3.2 GB / 4.8 GB |
| CosyVoice3 0.5B (MLX) | ~500M | 4-bit LLM | ~1.2 GB |
| Kokoro-82M (CoreML) | 82M | float16 | ~325 MB |
| PersonaPlex 7B (MLX) | ~7B | 4-bit / 8-bit | 4.9 GB / 9.1 GB |
| Pyannote VAD (MLX) | ~1.49M | float32 | ~5.7 MB |
| Silero VAD v5 | ~309K | float32 | ~1.2 MB (MLX & CoreML) |
| WeSpeaker ResNet34 | ~6.6M | float32 | ~25 MB (MLX & CoreML) |
| Sortformer (CoreML) | — | float16 | ~50 MB |
| DeepFilterNet3 (CoreML) | ~2.1M | FP16 | ~4.2 MB |
Performance Optimizations
- MLX compile() — Kernel fusion for autoregressive loops. Talker uses
compile(shapeless: true), Code Predictor usescompile(shapeless: false)with fixed cache sizes. - Metal shader library — Pre-compiled metallib avoids ~5x JIT compilation overhead. Built via
scripts/build_mlx_metallib.sh. - Chunked codec decode — TTS decoder processes audio in 25-frame chunks with 10-frame context overlap to avoid GPU timeout.
- Batch-doubled CFG — CosyVoice3 DiT halves flow matching passes by batching conditional + unconditional together.
- Fused RoPE — Uses
MLXNN.RoPEbacked by Metal kernel instead of manual rotation. - BN fusion — WeSpeaker batch normalization fused into Conv2d weights at conversion time.
Audio Processing
All audio I/O uses Float32 PCM. Internal resampling handles format conversion:
| Model | Expected Rate | Format |
|---|---|---|
| Qwen3-ASR | 16 kHz | Mono Float32 |
| Qwen3-TTS | 24 kHz output | Mono Float32 |
| CosyVoice3 | 24 kHz output | Mono Float32 |
| Kokoro-82M | 24 kHz output | Mono Float32 |
| PersonaPlex | 24 kHz I/O | Mono Float32 |
| Pyannote VAD | 16 kHz | Mono Float32 |
| Silero VAD | 16 kHz | Mono Float32 |
| WeSpeaker | 16 kHz | Mono Float32 |
| DeepFilterNet3 | 48 kHz | Mono Float32 |
Source Structure
Sources/
AudioCommon/ Shared protocols, audio I/O, HuggingFace downloader
Qwen3Common/ Shared model components (KV cache, RoPE, quantization)
Qwen3ASR/ Qwen3-ASR speech-to-text
ParakeetASR/ Parakeet TDT speech-to-text (CoreML)
Qwen3TTS/ Qwen3-TTS text-to-speech
CosyVoiceTTS/ CosyVoice3 text-to-speech
KokoroTTS/ Kokoro-82M text-to-speech (CoreML)
PersonaPlex/ PersonaPlex speech-to-speech
SpeechVAD/ VAD (Silero + Pyannote), diarization, speaker embeddings
SpeechEnhancement/ DeepFilterNet3 noise suppression (CoreML)
AudioCLILib/ CLI command implementations
AudioCLI/ CLI entry point
scripts/ Model conversion (PyTorch → MLX/CoreML), benchmarking
Tests/ Unit and integration tests
Examples/ Demo apps (PersonaPlexDemo, SpeechDemo)