Speech Core

Open-source C++17 speech engine for voice agents — voice activity detection, batch and real-time streaming speech-to-text, speaker diarization, and text-to-speech, all running on-device on Linux, Windows, and Android. Apache 2.0.

github.com/soniqo/speech-core 🤗 Models Discord Apache 2.0

What it is

Speech Core is a small orchestration core — state machine, turn detection, interruption handling, audio utilities, with zero ML dependencies — plus a set of abstract interfaces for speech models. Inference runs locally on CPU; audio never leaves the machine, and there is no Python at inference time. Model inference is opt-in through two interchangeable backends you can enable independently, or you can bring your own implementations of the interfaces.

Voice-agent orchestration — VoicePipeline composes VAD, streaming STT, an LLM, and TTS into a full-duplex agent loop with barge-in, turn detection, and a tool-call loop. See docs/pipeline.md and the voice agents overview.
Speaker diarization in pure C++ — DiarizationPipeline composes a segmenter and an embedder into speaker-labelled segments, with no ML-runtime dependency of its own.
Powers the rest of the stack — speech-android is a Kotlin SDK + JNI bridge over Speech Core, and Speech Studio uses its LiteRT VoxCPM2 engine on Windows and Linux. On Apple platforms, the sibling library is speech-swift.

Platforms & backends

Backend	Platforms	Hardware acceleration
ONNX Runtime (`SPEECH_CORE_WITH_ONNX`)	Linux, macOS, Windows, Android	NNAPI on Android, QNN on Qualcomm Linux, optional NVIDIA CUDA / TensorRT (`-DSPEECH_CORE_WITH_CUDA=ON`)
LiteRT (`SPEECH_CORE_WITH_LITERT`)	Linux x86_64, Windows x86_64, Android, macOS arm64	CPU today

Enable either backend, both, or neither — the orchestration core builds with no ML runtime at all.

Supported models

Model	Task	ONNX	LiteRT
Silero VAD v5	Voice activity detection	✓	✓
Parakeet TDT v3 (0.6B)	Speech-to-text (114 languages)	✓	✓
Nemotron Speech Streaming (0.6B)	Streaming speech-to-text (English)	✓	✓
Nemotron-3.5 ASR Streaming Multilingual (0.6B)	Streaming speech-to-text (multilingual, prompt-conditioned)	✓	✓
Whisper Small v3	Speech-to-text (multilingual Whisper v3)	✓	—
Whisper Medium v3	Speech-to-text (multilingual Whisper v3)	✓	—
Whisper Large v3	Speech-to-text (multilingual Whisper v3)	✓	—
Whisper Large-v3 Turbo	Speech-to-text (multilingual Whisper v3)	✓	—
Omnilingual ASR CTC (300M)	Speech-to-text (multilingual)	—	✓
FunctionGemma 270M	On-device LLM — structured function / tool calls (LiteRT-LM; CoreML on Apple)	—	✓
Pyannote Segmentation 3.0	Diarization (segmentation)	—	✓
WeSpeaker ResNet34-LM	Speaker embedding	—	✓
VoxCPM2 (2B)	Text-to-speech (48 kHz, voice cloning)	—	✓
Kokoro 82M	Text-to-speech	✓	—
Pocket TTS 100M	Streaming text-to-speech (English, fixed Alba voice)	✓	—
DeepFilterNet3	Speech enhancement	✓	—
PersonaPlex 7B	Full-duplex speech-to-speech (CUDA)	✓	—

Quick start

Build the core plus the LiteRT backend (the runtime library is extracted from the ai-edge-litert wheel — no TensorFlow build):

git clone https://github.com/soniqo/speech-core && cd speech-core
scripts/fetch_litert.sh build/litert
cmake -B build -DCMAKE_BUILD_TYPE=Release \
    -DSPEECH_CORE_WITH_LITERT=ON -DLITERT_DIR=$PWD/build/litert
cmake --build build

Then link the targets you need:

target_link_libraries(my_app PRIVATE speech_core)                            # orchestration only
target_link_libraries(my_app PRIVATE speech_core speech_core_models)         # + ONNX models
target_link_libraries(my_app PRIVATE speech_core speech_core_models_litert)  # + LiteRT models

Transcribing an audio buffer is a few lines:

#include <speech_core/models/litert_parakeet_stt.h>

speech_core::LiteRTParakeetStt stt(
    "parakeet-encoder.tflite", "parakeet-decoder-joint.tflite", "vocab.json");

auto r = stt.transcribe(audio, n_samples, 16000);   // r.text / r.language / r.confidence

Embedded & automotive Linux

A reference Linux build — libspeech.so with a small C ABI, an ALSA demo CLI, and transcribe/synthesize/phonemize tools — lives at examples/linux. It targets embedded ARM64 (Yocto, Qualcomm SA8295P / SA8255P) and any Linux dev box. Setup steps are in the Linux getting-started guide.

Building for Android or Apple?

On Android, use speech-android — a Kotlin SDK that packages Speech Core behind a JNI bridge (implementation("audio.soniqo:speech:0.0.9")). On macOS and iOS, use speech-swift, which runs the models on CoreML, MLX, and the Apple Neural Engine.

Documentation

docs/ — full in-repo documentation
docs/pipeline.md — the VoicePipeline state machine, AEC integration, and tool-call loop
docs/models.md — full model inventory
Soniqo ONNX collection — Whisper, Kokoro, DeepFilterNet3, PersonaPlex, and other ONNX bundles
huggingface.co/soniqo — converted model weights (ONNX, LiteRT)

Feedback

Open an issue at github.com/soniqo/speech-core/issues, or join the Discord.