Architecture

speech-swift is organized as a modular Swift package with shared protocols, independent model modules, and a unified CLI. All inference runs on-device using MLX (Metal GPU) or CoreML (Neural Engine).

Module Dependency Graph

                    ┌──────────┐
                    │ AudioCLI │  (entry point)
                    └────┬─────┘
                         │
                  ┌──────┴──────┐
                  │ AudioCLILib │  (commands)
                  └──────┬──────┘
                         │
       ┌─────────┬───────┼───────┬──────────┬──────────────┐
       │         │       │       │          │              │
  ┌────┴───┐ ┌──┴──┐ ┌──┴──┐ ┌─┴────┐ ┌───┴────┐ ┌──────┴───────┐
  │Qwen3ASR│ │Qwen3│ │Cosy │ │Perso-│ │Speech- │ │  Speech-     │
  │Parakeet│ │ TTS │ │Voice│ │naPlex│ │  VAD   │ │Enhancement   │
  └────┬───┘ └──┬──┘ └──┬──┘ └──┬───┘ └───┬───┘ └──────┬───────┘
       │        │       │       │         │             │
       └────────┴───────┼───────┴─────────┘             │
                        │                               │
                 ┌──────┴──────┐                        │
                 │ Qwen3Common │  (shared layers)       │
                 └──────┬──────┘                        │
                        │                               │
                 ┌──────┴──────┐                        │
                 │ AudioCommon │ ◄──────────────────────┘
                 └─────────────┘  (protocols, audio I/O)

Inference Backends

BackendHardwareModels
MLX Metal GPU Qwen3-ASR, Qwen3-TTS, CosyVoice3, PersonaPlex, Pyannote, Silero VAD, WeSpeaker
CoreML Neural Engine Qwen3-ASR encoder (hybrid), Parakeet TDT, Kokoro-82M, Sortformer diarization, DeepFilterNet3, Silero VAD (optional), WeSpeaker (optional)
Accelerate CPU (SIMD) Audio preprocessing (STFT, mel, FFT), signal processing

Model Weight Format

MLX models use safetensors format with 4-bit or 8-bit quantization (group size 64). CoreML models use .mlmodelc compiled format. Conversion scripts in scripts/ convert from PyTorch checkpoints.

ModelParamsQuantizationSize on Disk
Qwen3-ASR 0.6B (MLX)~600M4-bit / 8-bit680 MB / 1.0 GB
Qwen3-ASR 0.6B (CoreML)~186M (encoder)INT8~180 MB
Qwen3-ASR 1.7B (MLX)~1.7B4-bit / 8-bit2.1 GB / 3.2 GB
Parakeet-TDT 0.6B (CoreML)~600MINT4 / INT8315 MB / 500 MB
Qwen3-ForcedAligner 0.6B (MLX)~600M4-bit / 8-bit979 MB / 1.4 GB
Qwen3-ForcedAligner 0.6B (CoreML)~600MINT4 / INT8630 MB / 1.0 GB
Qwen3-TTS 0.6B (MLX)~600M4-bit / 8-bit1.7 GB / 2.4 GB
Qwen3-TTS 1.7B (MLX)~1.7B4-bit / 8-bit3.2 GB / 4.8 GB
CosyVoice3 0.5B (MLX)~500M4-bit LLM~1.2 GB
Kokoro-82M (CoreML)82Mfloat16~325 MB
PersonaPlex 7B (MLX)~7B4-bit / 8-bit4.9 GB / 9.1 GB
Pyannote VAD (MLX)~1.49Mfloat32~5.7 MB
Silero VAD v5~309Kfloat32~1.2 MB (MLX & CoreML)
WeSpeaker ResNet34~6.6Mfloat32~25 MB (MLX & CoreML)
Sortformer (CoreML)float16~50 MB
DeepFilterNet3 (CoreML)~2.1MFP16~4.2 MB

Performance Optimizations

Audio Processing

All audio I/O uses Float32 PCM. Internal resampling handles format conversion:

ModelExpected RateFormat
Qwen3-ASR16 kHzMono Float32
Qwen3-TTS24 kHz outputMono Float32
CosyVoice324 kHz outputMono Float32
Kokoro-82M24 kHz outputMono Float32
PersonaPlex24 kHz I/OMono Float32
Pyannote VAD16 kHzMono Float32
Silero VAD16 kHzMono Float32
WeSpeaker16 kHzMono Float32
DeepFilterNet348 kHzMono Float32

Source Structure

Sources/
  AudioCommon/        Shared protocols, audio I/O, HuggingFace downloader
  Qwen3Common/        Shared model components (KV cache, RoPE, quantization)
  Qwen3ASR/           Qwen3-ASR speech-to-text
  ParakeetASR/        Parakeet TDT speech-to-text (CoreML)
  Qwen3TTS/           Qwen3-TTS text-to-speech
  CosyVoiceTTS/       CosyVoice3 text-to-speech
  KokoroTTS/          Kokoro-82M text-to-speech (CoreML)
  PersonaPlex/        PersonaPlex speech-to-speech
  SpeechVAD/          VAD (Silero + Pyannote), diarization, speaker embeddings
  SpeechEnhancement/  DeepFilterNet3 noise suppression (CoreML)
  AudioCLILib/        CLI command implementations
  AudioCLI/           CLI entry point

scripts/              Model conversion (PyTorch → MLX/CoreML), benchmarking
Tests/                Unit and integration tests
Examples/             Demo apps (PersonaPlexDemo, SpeechDemo)