Arquitetura

speech-swift e organizado como um pacote Swift modular com protocolos compartilhados, modulos de modelo independentes e um CLI unificado. Toda a inferencia roda no dispositivo usando MLX (GPU Metal) ou CoreML (Neural Engine).

Grafo de dependencias dos modulos

                    ┌──────────┐
                    │ AudioCLI │  (entry point)
                    └────┬─────┘
                         │
                  ┌──────┴──────┐
                  │ AudioCLILib │  (commands)
                  └──────┬──────┘
                         │
       ┌─────────┬───────┼───────┬──────────┬──────────────┐
       │         │       │       │          │              │
  ┌────┴───┐ ┌──┴──┐ ┌──┴──┐ ┌─┴────┐ ┌───┴────┐ ┌──────┴───────┐
  │Qwen3ASR│ │Qwen3│ │Cosy │ │Perso-│ │Speech- │ │  Speech-     │
  │Parakeet│ │ TTS │ │Voice│ │naPlex│ │  VAD   │ │Enhancement   │
  └────┬───┘ └──┬──┘ └──┬──┘ └──┬───┘ └───┬───┘ └──────┬───────┘
       │        │       │       │         │             │
       └────────┴───────┼───────┴─────────┘             │
                        │                               │
                 ┌──────┴──────┐                        │
                 │ Qwen3Common │  (shared layers)       │
                 └──────┬──────┘                        │
                        │                               │
                 ┌──────┴──────┐                        │
                 │ AudioCommon │ ◄──────────────────────┘
                 └─────────────┘  (protocols, audio I/O)

Backends de inferencia

BackendHardwareModelos
MLX GPU Metal Qwen3-ASR, Qwen3-TTS, CosyVoice3, Qwen3.5-Chat, PersonaPlex, Omnilingual ASR (300M / 1B / 3B / 7B), Pyannote, Silero VAD, WeSpeaker
CoreML Neural Engine Codificador Qwen3-ASR (hibrido), Parakeet TDT, streaming Parakeet EOU, Omnilingual ASR 300M, Kokoro-82M, Qwen3.5-Chat (opcional), diarizacao Sortformer, DeepFilterNet3, Silero VAD (opcional), WeSpeaker (opcional)
Accelerate CPU (SIMD) Pre-processamento de audio (STFT, mel, FFT), processamento de sinais

Formato dos pesos do modelo

Modelos MLX usam o formato safetensors com quantizacao de 4 bits ou 8 bits (tamanho de grupo 64). Modelos CoreML usam o formato compilado .mlmodelc. Scripts de conversao em scripts/ convertem a partir de checkpoints PyTorch.

ModeloParametrosQuantizacaoTamanho em disco
Qwen3-ASR 0.6B (MLX)~600M4-bit / 8-bit680 MB / 1.0 GB
Qwen3-ASR 0.6B (CoreML)~186M (codificador)INT8~180 MB
Qwen3-ASR 1.7B (MLX)~1.7B4-bit / 8-bit2.1 GB / 3.2 GB
Parakeet-TDT 0.6B (CoreML)~600MINT8500 MB
Parakeet-EOU 120M (CoreML)~120MINT8~120 MB
Omnilingual-ASR-CTC 300M (CoreML)326MINT8312 MB
Omnilingual-ASR-CTC 300M (MLX)326M4-bit / 8-bit193 MB / 342 MB
Omnilingual-ASR-CTC 1B (MLX)1.01B4-bit / 8-bit549 MB / 1006 MB
Omnilingual-ASR-CTC 3B (MLX)~3B4-bit / 8-bit1.71 GB / 3.16 GB
Omnilingual-ASR-CTC 7B (MLX)~7B4-bit / 8-bit3.55 GB / 6.63 GB
Qwen3-ForcedAligner 0.6B (MLX)~600M4-bit / 8-bit979 MB / 1.4 GB
Qwen3-ForcedAligner 0.6B (CoreML)~600MINT4 / INT8630 MB / 1.0 GB
Qwen3-TTS 0.6B (MLX)~600M4-bit / 8-bit1.7 GB / 2.4 GB
Qwen3-TTS 1.7B (MLX)~1.7B4-bit / 8-bit3.2 GB / 4.8 GB
CosyVoice3 0.5B (MLX)~500MLLM 4-bit~1.2 GB
Kokoro-82M (CoreML)82MINT8 (1 bucket)~89 MB
Qwen3.5-Chat 0.8B (MLX)~800MINT4418 MB
Qwen3.5-Chat 0.8B (CoreML)~800MINT8981 MB
PersonaPlex 7B (MLX)~7B4-bit / 8-bit4.9 GB / 9.1 GB
Pyannote VAD (MLX)~1.49Mfloat32~5.7 MB
Silero VAD v5~309Kfloat32~1.2 MB (MLX e CoreML)
WeSpeaker ResNet34~6.6Mfloat32~25 MB (MLX e CoreML)
Sortformer (CoreML)float16~50 MB
DeepFilterNet3 (CoreML)~2.1MFP16~4.2 MB

Otimizacoes de desempenho

Processamento de audio

Todo o I/O de audio usa PCM Float32. A reamostragem interna lida com a conversao de formato:

ModeloTaxa esperadaFormato
Qwen3-ASR16 kHzMono Float32
Qwen3-TTSsaida 24 kHzMono Float32
CosyVoice3saida 24 kHzMono Float32
Kokoro-82Msaida 24 kHzMono Float32
PersonaPlexI/O 24 kHzMono Float32
Pyannote VAD16 kHzMono Float32
Silero VAD16 kHzMono Float32
WeSpeaker16 kHzMono Float32
DeepFilterNet348 kHzMono Float32

Estrutura de codigo-fonte

Sources/
  AudioCommon/            Shared protocols, audio I/O, HuggingFace downloader,
                          SentencePieceModel (protobuf reader)
  MLXCommon/              MLX utilities: weight loading, QuantizedLinear helpers,
                          SDPA multi-head attention helper, metal budget
  Qwen3Common/            Shared model components (KV cache, RoPE, quantization)
  Qwen3ASR/               Qwen3-ASR speech-to-text
  ParakeetASR/            Parakeet TDT speech-to-text (CoreML)
  ParakeetStreamingASR/   Parakeet EOU 120M streaming dictation (CoreML)
  OmnilingualASR/         Meta wav2vec2 + CTC, 1,672 languages
                          (CoreML 300M + MLX 300M / 1B / 3B / 7B)
  Qwen3TTS/               Qwen3-TTS text-to-speech
  CosyVoiceTTS/           CosyVoice3 text-to-speech
  KokoroTTS/              Kokoro-82M text-to-speech (CoreML)
  Qwen3Chat/              Qwen3.5-0.8B on-device LLM chat (MLX + CoreML)
  PersonaPlex/            PersonaPlex speech-to-speech
  SpeechVAD/              VAD (Silero + Pyannote), diarization, speaker embeddings
  SpeechEnhancement/      DeepFilterNet3 noise suppression (CoreML)
  AudioCLILib/            CLI command implementations
  AudioCLI/               CLI entry point

scripts/              Model conversion (PyTorch → MLX/CoreML), benchmarking
Tests/                Unit and integration tests
Examples/             Demo apps (PersonaPlexDemo, SpeechDemo)