Architecture — Android

speech-android is a thin Kotlin SDK + JNI bridge over the speech-core C++ engine. All ML inference and pipeline orchestration live in speech-core; speech-android handles only the Android packaging. Linux / automotive (Yocto, Qualcomm SA8295P/SA8255P with QNN) is hosted directly in speech-core/examples/linux.

Stack

The model wrappers (Silero VAD, Parakeet STT, Kokoro TTS, DeepFilterNet3) directly implement the speech-core interfaces (VADInterface, STTInterface, TTSInterface, EnhancerInterface), so the JNI bridge constructs them and hands references to speech_core::VoicePipeline with no C-vtable adapter boilerplate.

┌──────────────────────────────────────────────┐
│     SpeechPipeline (Kotlin public API)       │
│             ↓ JNI                            │
│     jni_bridge.cpp (~250 lines)              │
└──────────────────┬───────────────────────────┘
                   │
┌──────────────────┴───────────────────────────┐
│       speech_core_models (git submodule)      │
│   Silero / Parakeet / Kokoro / DeepFilter     │
│       speech_core                             │
│   Turn detection · Interruptions · Context   │
└──┬────────┬────────┬────────┬────────────────┘
   │        │        │        │  direct interface impl
┌──┴──┐  ┌──┴──┐  ┌──┴──┐  ┌─┴────────┐
│ VAD │  │ STT │  │ TTS │  │ Enhancer │
└──┬──┘  └──┬──┘  └──┬──┘  └─┬────────┘
   └────────┴────────┴────────┘
       ONNX Runtime (CPU / NNAPI)

Pipeline

The speech pipeline runs three stages sequentially: VAD → STT → TTS. Voice Activity Detection triggers recording, audio is transcribed by STT, and TTS generates the response. Barge-in support allows interrupting TTS playback when the user starts speaking mid-response.

speech-core manages turn detection, interruption handling, and conversation context. The model wrappers implement speech-core's interfaces directly — no C-vtable adapter layer — making it equally easy to plug in non-ONNX backends (e.g. the CoreML / MLX implementations in speech-swift) that conform to the same interfaces.

Models

All models use ONNX format with INT8 quantization as the default. Models are hosted on HuggingFace under the aufklarer org and auto-download on first use via ModelManager.kt.

Model	Task	Quantization	Size
Parakeet TDT v3	STT (114 languages, 8192 BPE vocab)	INT8	~500 MB
Kokoro-82M	TTS	INT8	~89 MB
Silero VAD v5	Voice Activity Detection	float32	~1.2 MB
DeepFilterNet3	Noise Cancellation	FP16	~4.2 MB

Total model download is approximately 1.2 GB. After the initial download, all inference runs fully offline.

Inference: OnnxEngine

The onnx_engine.h wrapper provides platform-aware execution provider (EP) selection. It probes available EPs at runtime and falls back gracefully:

Platform	Chipset	Acceleration
Android	Snapdragon 8 Gen 1+	NNAPI → Hexagon NPU
Android	Samsung Exynos 2200+	NNAPI → Samsung NPU
Android	Google Tensor G2+	NNAPI → Google TPU
Any Android	CPU fallback	XNNPACK

For automotive Qualcomm SA8295P / SA8255P with QNN (Hexagon DSP), see speech-core/examples/linux.

Key C++ Files

File	Purpose
`jni_bridge.cpp`	Constructs `speech_core::*` model wrappers and hands references to `VoicePipeline`
`parakeet_stt.cpp`	STT with TDT greedy decoder and per-feature mel normalization
`kokoro_tts.cpp`	TTS with E2E model and attention mask
`kokoro_phonemizer.cpp`	Dictionary-based phonemizer for TTS input
`silero_vad.cpp`	Voice activity detection
`deepfilter.cpp`	Noise cancellation with STFT/ERB processing
`onnx_engine.h`	Platform-aware ONNX Runtime wrapper (NNAPI on Android, QNN on Linux, CPU fallback)

The model wrappers and onnx_engine.h moved into speech-core in the model-extraction refactor; see docs/models.md for the full inventory.

Source Structure

speech-android/
  speech-core/              C++ engine + ONNX model wrappers (git submodule)
  sdk/src/main/
    cpp/jni_bridge.cpp      Thin JNI bridge over speech_core::VoicePipeline
    cpp/CMakeLists.txt      Pulls speech-core via add_subdirectory(SPEECH_CORE_WITH_ONNX=ON)
    kotlin/.../speech/      Kotlin public SDK (SpeechPipeline, ModelManager)
  sdk/src/androidTest/      Instrumented e2e tests
  app/                      Demo application

Linux / automotive (C ABI, ALSA demo, CLI tools) lives at:
  speech-core/examples/linux/

Source code: github.com/soniqo/speech-android