Architecture — Android
speech-android is a thin Kotlin SDK + JNI bridge over the speech-core C++ engine. All ML inference and pipeline orchestration live in speech-core; speech-android handles only the Android packaging. Linux / automotive (Yocto, Qualcomm SA8295P/SA8255P with QNN) is hosted directly in speech-core/examples/linux.
Stack
The model wrappers (Silero VAD, Parakeet STT, Kokoro TTS, DeepFilterNet3) directly implement the speech-core interfaces (VADInterface, STTInterface, TTSInterface, EnhancerInterface), so the JNI bridge constructs them and hands references to speech_core::VoicePipeline with no C-vtable adapter boilerplate.
┌──────────────────────────────────────────────┐
│ SpeechPipeline (Kotlin public API) │
│ ↓ JNI │
│ jni_bridge.cpp (~250 lines) │
└──────────────────┬───────────────────────────┘
│
┌──────────────────┴───────────────────────────┐
│ speech_core_models (git submodule) │
│ Silero / Parakeet / Kokoro / DeepFilter │
│ speech_core │
│ Turn detection · Interruptions · Context │
└──┬────────┬────────┬────────┬────────────────┘
│ │ │ │ direct interface impl
┌──┴──┐ ┌──┴──┐ ┌──┴──┐ ┌─┴────────┐
│ VAD │ │ STT │ │ TTS │ │ Enhancer │
└──┬──┘ └──┬──┘ └──┬──┘ └─┬────────┘
└────────┴────────┴────────┘
ONNX Runtime (CPU / NNAPI)Pipeline
The speech pipeline runs three stages sequentially: VAD → STT → TTS. Voice Activity Detection triggers recording, audio is transcribed by STT, and TTS generates the response. Barge-in support allows interrupting TTS playback when the user starts speaking mid-response.
speech-core manages turn detection, interruption handling, and conversation context. The model wrappers implement speech-core's interfaces directly — no C-vtable adapter layer — making it equally easy to plug in non-ONNX backends (e.g. the CoreML / MLX implementations in speech-swift) that conform to the same interfaces.
Models
All models use ONNX format with INT8 quantization as the default. Models are hosted on HuggingFace under the aufklarer org and auto-download on first use via ModelManager.kt.
| Model | Task | Quantization | Size |
|---|---|---|---|
| Parakeet TDT v3 | STT (114 languages, 8192 BPE vocab) | INT8 | ~500 MB |
| Kokoro-82M | TTS | INT8 | ~89 MB |
| Silero VAD v5 | Voice Activity Detection | float32 | ~1.2 MB |
| DeepFilterNet3 | Noise Cancellation | FP16 | ~4.2 MB |
Total model download is approximately 1.2 GB. After the initial download, all inference runs fully offline.
Inference: OnnxEngine
The onnx_engine.h wrapper provides platform-aware execution provider (EP) selection. It probes available EPs at runtime and falls back gracefully:
| Platform | Chipset | Acceleration |
|---|---|---|
| Android | Snapdragon 8 Gen 1+ | NNAPI → Hexagon NPU |
| Android | Samsung Exynos 2200+ | NNAPI → Samsung NPU |
| Android | Google Tensor G2+ | NNAPI → Google TPU |
| Any Android | CPU fallback | XNNPACK |
For automotive Qualcomm SA8295P / SA8255P with QNN (Hexagon DSP), see speech-core/examples/linux.
Key C++ Files
| File | Purpose |
|---|---|
jni_bridge.cpp | Constructs speech_core::* model wrappers and hands references to VoicePipeline |
parakeet_stt.cpp | STT with TDT greedy decoder and per-feature mel normalization |
kokoro_tts.cpp | TTS with E2E model and attention mask |
kokoro_phonemizer.cpp | Dictionary-based phonemizer for TTS input |
silero_vad.cpp | Voice activity detection |
deepfilter.cpp | Noise cancellation with STFT/ERB processing |
onnx_engine.h | Platform-aware ONNX Runtime wrapper (NNAPI on Android, QNN on Linux, CPU fallback) |
The model wrappers and onnx_engine.h moved into speech-core in the model-extraction refactor; see docs/models.md for the full inventory.
Source Structure
speech-android/
speech-core/ C++ engine + ONNX model wrappers (git submodule)
sdk/src/main/
cpp/jni_bridge.cpp Thin JNI bridge over speech_core::VoicePipeline
cpp/CMakeLists.txt Pulls speech-core via add_subdirectory(SPEECH_CORE_WITH_ONNX=ON)
kotlin/.../speech/ Kotlin public SDK (SpeechPipeline, ModelManager)
sdk/src/androidTest/ Instrumented e2e tests
app/ Demo application
Linux / automotive (C ABI, ALSA demo, CLI tools) lives at:
speech-core/examples/linux/Source code: github.com/soniqo/speech-android