Architecture — Android & Linux
speech-android provides on-device speech processing for Android and embedded Linux through a shared C++ core with platform-specific frontends. All inference runs locally using ONNX Runtime with hardware-accelerated execution providers.
Cross-Platform Stack
Android and Linux share the speech-core C++ submodule, which orchestrates the full speech pipeline. Each platform provides a thin frontend that delegates to speech-core through vtable-based interfaces:
┌──────────────────────────────────────────────┐
│ Android: SpeechPipeline (Kotlin/JNI) │
│ Linux: speech.h (C API) │
└──────────────────┬───────────────────────────┘
│
┌──────────────────┴───────────────────────────┐
│ speech-core (C++ submodule) │
│ Turn detection · Interruptions · Context │
└──┬────────┬────────┬────────┬────────────────┘
│ │ │ │ vtables
┌──┴──┐ ┌──┴──┐ ┌──┴──┐ ┌─┴────────┐
│ VAD │ │ STT │ │ TTS │ │ Enhancer │
└──┬──┘ └──┬──┘ └──┬──┘ └─┬────────┘
└────────┴────────┴────────┘
ONNX Runtime (CPU / NNAPI / QNN)Platform Paths
Android
The Kotlin SDK (SpeechPipeline.kt) provides the public API. It calls through JNI into jni_bridge.cpp, which registers vtable callbacks with speech-core. ONNX Runtime runs with the NNAPI execution provider for hardware acceleration on Qualcomm, Samsung, and Google chipsets.
Kotlin SDK → JNI bridge → speech-core → ONNX Runtime (NNAPI)Linux
The C API (speech.h) exposes the same pipeline for embedded Linux targets (automotive, Yocto). On Qualcomm automotive platforms (SA8295P, SA8255P), ONNX Runtime uses the QNN execution provider for Hexagon DSP acceleration.
C API → speech-core → ONNX Runtime (QNN)Pipeline
The speech pipeline runs three stages sequentially: VAD → STT → TTS. Voice Activity Detection triggers recording, audio is transcribed by STT, and TTS generates the response. Barge-in support allows interrupting TTS playback when the user starts speaking mid-response.
speech-core manages turn detection, interruption handling, and conversation context. The model implementations (VAD, STT, TTS, Enhancer) are plugged in through C vtable interfaces, making the core pipeline logic platform-agnostic.
Models
All models use ONNX format with INT8 quantization as the default. Models are hosted on HuggingFace under the aufklarer org and auto-download on first use via ModelManager.kt.
| Model | Task | Quantization | Size |
|---|---|---|---|
| Parakeet TDT v3 | STT (114 languages, 8192 BPE vocab) | INT8 | ~500 MB |
| Kokoro-82M | TTS | INT8 | ~89 MB |
| Silero VAD v5 | Voice Activity Detection | float32 | ~1.2 MB |
| DeepFilterNet3 | Noise Cancellation | FP16 | ~4.2 MB |
Total model download is approximately 1.2 GB. After the initial download, all inference runs fully offline.
Inference: OnnxEngine
The onnx_engine.h wrapper provides platform-aware execution provider (EP) selection. It probes available EPs at runtime and falls back gracefully:
| Platform | Chipset | Acceleration |
|---|---|---|
| Android | Snapdragon 8 Gen 1+ | NNAPI → Hexagon NPU |
| Android | Samsung Exynos 2200+ | NNAPI → Samsung NPU |
| Android | Google Tensor G2+ | NNAPI → Google TPU |
| Automotive | SA8295P / SA8255P | QNN → Hexagon DSP |
| Any | CPU fallback | XNNPACK |
Key C++ Files
| File | Purpose |
|---|---|
jni_bridge.cpp | Wires ONNX model implementations to speech-core C API via vtables |
parakeet_stt.cpp | STT with TDT greedy decoder and per-feature mel normalization |
kokoro_tts.cpp | TTS with E2E model and attention mask |
kokoro_phonemizer.cpp | Dictionary-based phonemizer for TTS input |
silero_vad.cpp | Voice activity detection |
deepfilter.cpp | Noise cancellation with STFT/ERB processing |
onnx_engine.h | Platform-aware ONNX Runtime wrapper (NNAPI on Android, QNN on Linux, CPU fallback) |
linux/src/speech.cpp | Linux C API implementation |
linux/include/speech.h | Linux public C header |
Source Structure
speech-android/
speech-core/ C++ submodule (pipeline orchestration)
sdk/src/main/
cpp/ ONNX Runtime model implementations, JNI bridge, audio DSP
kotlin/.../speech/ Kotlin public SDK (SpeechPipeline, ModelManager)
sdk/src/androidTest/ Instrumented e2e tests (23 tests, 5 suites)
linux/
include/speech.h Public C header
src/speech.cpp Linux C API implementation
tests/ Linux test suite (11 tests)
app/ Demo applicationSource code: github.com/soniqo/speech-android