Architecture — Android & Linux

speech-android provides on-device speech processing for Android and embedded Linux through a shared C++ core with platform-specific frontends. All inference runs locally using ONNX Runtime with hardware-accelerated execution providers.

Cross-Platform Stack

Android and Linux share the speech-core C++ submodule, which orchestrates the full speech pipeline. Each platform provides a thin frontend that delegates to speech-core through vtable-based interfaces:

┌──────────────────────────────────────────────┐
│   Android: SpeechPipeline (Kotlin/JNI)       │
│   Linux:   speech.h (C API)                  │
└──────────────────┬───────────────────────────┘
                   │
┌──────────────────┴───────────────────────────┐
│            speech-core (C++ submodule)        │
│   Turn detection · Interruptions · Context   │
└──┬────────┬────────┬────────┬────────────────┘
   │        │        │        │  vtables
┌──┴──┐  ┌──┴──┐  ┌──┴──┐  ┌─┴────────┐
│ VAD │  │ STT │  │ TTS │  │ Enhancer │
└──┬──┘  └──┬──┘  └──┬──┘  └─┬────────┘
   └────────┴────────┴────────┘
       ONNX Runtime (CPU / NNAPI / QNN)

Platform Paths

Android

The Kotlin SDK (SpeechPipeline.kt) provides the public API. It calls through JNI into jni_bridge.cpp, which registers vtable callbacks with speech-core. ONNX Runtime runs with the NNAPI execution provider for hardware acceleration on Qualcomm, Samsung, and Google chipsets.

Kotlin SDK → JNI bridge → speech-core → ONNX Runtime (NNAPI)

Linux

The C API (speech.h) exposes the same pipeline for embedded Linux targets (automotive, Yocto). On Qualcomm automotive platforms (SA8295P, SA8255P), ONNX Runtime uses the QNN execution provider for Hexagon DSP acceleration.

C API → speech-core → ONNX Runtime (QNN)

Pipeline

The speech pipeline runs three stages sequentially: VAD → STT → TTS. Voice Activity Detection triggers recording, audio is transcribed by STT, and TTS generates the response. Barge-in support allows interrupting TTS playback when the user starts speaking mid-response.

speech-core manages turn detection, interruption handling, and conversation context. The model implementations (VAD, STT, TTS, Enhancer) are plugged in through C vtable interfaces, making the core pipeline logic platform-agnostic.

Models

All models use ONNX format with INT8 quantization as the default. Models are hosted on HuggingFace under the aufklarer org and auto-download on first use via ModelManager.kt.

ModelTaskQuantizationSize
Parakeet TDT v3STT (114 languages, 8192 BPE vocab)INT8~500 MB
Kokoro-82MTTSINT8~89 MB
Silero VAD v5Voice Activity Detectionfloat32~1.2 MB
DeepFilterNet3Noise CancellationFP16~4.2 MB

Total model download is approximately 1.2 GB. After the initial download, all inference runs fully offline.

Inference: OnnxEngine

The onnx_engine.h wrapper provides platform-aware execution provider (EP) selection. It probes available EPs at runtime and falls back gracefully:

PlatformChipsetAcceleration
AndroidSnapdragon 8 Gen 1+NNAPI → Hexagon NPU
AndroidSamsung Exynos 2200+NNAPI → Samsung NPU
AndroidGoogle Tensor G2+NNAPI → Google TPU
AutomotiveSA8295P / SA8255PQNN → Hexagon DSP
AnyCPU fallbackXNNPACK

Key C++ Files

FilePurpose
jni_bridge.cppWires ONNX model implementations to speech-core C API via vtables
parakeet_stt.cppSTT with TDT greedy decoder and per-feature mel normalization
kokoro_tts.cppTTS with E2E model and attention mask
kokoro_phonemizer.cppDictionary-based phonemizer for TTS input
silero_vad.cppVoice activity detection
deepfilter.cppNoise cancellation with STFT/ERB processing
onnx_engine.hPlatform-aware ONNX Runtime wrapper (NNAPI on Android, QNN on Linux, CPU fallback)
linux/src/speech.cppLinux C API implementation
linux/include/speech.hLinux public C header

Source Structure

speech-android/
  speech-core/              C++ submodule (pipeline orchestration)
  sdk/src/main/
    cpp/                    ONNX Runtime model implementations, JNI bridge, audio DSP
    kotlin/.../speech/      Kotlin public SDK (SpeechPipeline, ModelManager)
  sdk/src/androidTest/      Instrumented e2e tests (23 tests, 5 suites)
  linux/
    include/speech.h        Public C header
    src/speech.cpp          Linux C API implementation
    tests/                  Linux test suite (11 tests)
  app/                      Demo application

Source code: github.com/soniqo/speech-android