Architecture — Android

speech-android is a thin Kotlin SDK + JNI bridge over the speech-core C++ engine. All ML inference and pipeline orchestration live in speech-core; speech-android handles only the Android packaging. Linux / automotive (Yocto, Qualcomm SA8295P/SA8255P with QNN) is hosted directly in speech-core/examples/linux.

Stack

The model wrappers (Silero VAD, Parakeet STT, Kokoro TTS, DeepFilterNet3) directly implement the speech-core interfaces (VADInterface, STTInterface, TTSInterface, EnhancerInterface), so the JNI bridge constructs them and hands references to speech_core::VoicePipeline with no C-vtable adapter boilerplate.

┌──────────────────────────────────────────────┐
│     SpeechPipeline (Kotlin public API)       │
│             ↓ JNI                            │
│     jni_bridge.cpp (~250 lines)              │
└──────────────────┬───────────────────────────┘
                   │
┌──────────────────┴───────────────────────────┐
│       speech_core_models (git submodule)      │
│   Silero / Parakeet / Kokoro / DeepFilter     │
│       speech_core                             │
│   Turn detection · Interruptions · Context   │
└──┬────────┬────────┬────────┬────────────────┘
   │        │        │        │  direct interface impl
┌──┴──┐  ┌──┴──┐  ┌──┴──┐  ┌─┴────────┐
│ VAD │  │ STT │  │ TTS │  │ Enhancer │
└──┬──┘  └──┬──┘  └──┬──┘  └─┬────────┘
   └────────┴────────┴────────┘
       ONNX Runtime (CPU / NNAPI)

Pipeline

The speech pipeline runs three stages sequentially: VAD → STT → TTS. Voice Activity Detection triggers recording, audio is transcribed by STT, and TTS generates the response. Barge-in support allows interrupting TTS playback when the user starts speaking mid-response.

speech-core manages turn detection, interruption handling, and conversation context. The model wrappers implement speech-core's interfaces directly — no C-vtable adapter layer — making it equally easy to plug in non-ONNX backends (e.g. the CoreML / MLX implementations in speech-swift) that conform to the same interfaces.

Models

All models use ONNX format with INT8 quantization as the default. Models are hosted on HuggingFace under the aufklarer org and auto-download on first use via ModelManager.kt.

ModelTaskQuantizationSize
Parakeet TDT v3STT (114 languages, 8192 BPE vocab)INT8~500 MB
Kokoro-82MTTSINT8~89 MB
Silero VAD v5Voice Activity Detectionfloat32~1.2 MB
DeepFilterNet3Noise CancellationFP16~4.2 MB

Total model download is approximately 1.2 GB. After the initial download, all inference runs fully offline.

Inference: OnnxEngine

The onnx_engine.h wrapper provides platform-aware execution provider (EP) selection. It probes available EPs at runtime and falls back gracefully:

PlatformChipsetAcceleration
AndroidSnapdragon 8 Gen 1+NNAPI → Hexagon NPU
AndroidSamsung Exynos 2200+NNAPI → Samsung NPU
AndroidGoogle Tensor G2+NNAPI → Google TPU
Any AndroidCPU fallbackXNNPACK

For automotive Qualcomm SA8295P / SA8255P with QNN (Hexagon DSP), see speech-core/examples/linux.

Key C++ Files

FilePurpose
jni_bridge.cppConstructs speech_core::* model wrappers and hands references to VoicePipeline
parakeet_stt.cppSTT with TDT greedy decoder and per-feature mel normalization
kokoro_tts.cppTTS with E2E model and attention mask
kokoro_phonemizer.cppDictionary-based phonemizer for TTS input
silero_vad.cppVoice activity detection
deepfilter.cppNoise cancellation with STFT/ERB processing
onnx_engine.hPlatform-aware ONNX Runtime wrapper (NNAPI on Android, QNN on Linux, CPU fallback)

The model wrappers and onnx_engine.h moved into speech-core in the model-extraction refactor; see docs/models.md for the full inventory.

Source Structure

speech-android/
  speech-core/              C++ engine + ONNX model wrappers (git submodule)
  sdk/src/main/
    cpp/jni_bridge.cpp      Thin JNI bridge over speech_core::VoicePipeline
    cpp/CMakeLists.txt      Pulls speech-core via add_subdirectory(SPEECH_CORE_WITH_ONNX=ON)
    kotlin/.../speech/      Kotlin public SDK (SpeechPipeline, ModelManager)
  sdk/src/androidTest/      Instrumented e2e tests
  app/                      Demo application

Linux / automotive (C ABI, ALSA demo, CLI tools) lives at:
  speech-core/examples/linux/

Source code: github.com/soniqo/speech-android