Speech Core
Open-source C++17 speech engine for voice agents — voice activity detection, batch and real-time streaming speech-to-text, speaker diarization, and text-to-speech, all running on-device on Linux, Windows, and Android. Apache 2.0.
What it is
Speech Core is a small orchestration core — state machine, turn detection, interruption handling, audio utilities, with zero ML dependencies — plus a set of abstract interfaces for speech models. Inference runs locally on CPU; audio never leaves the machine, and there is no Python at inference time. Model inference is opt-in through two interchangeable backends you can enable independently, or you can bring your own implementations of the interfaces.
- Voice-agent orchestration —
VoicePipelinecomposes VAD, streaming STT, an LLM, and TTS into a full-duplex agent loop with barge-in, turn detection, and a tool-call loop. See docs/pipeline.md and the voice agents overview. - Speaker diarization in pure C++ —
DiarizationPipelinecomposes a segmenter and an embedder into speaker-labelled segments, with no ML-runtime dependency of its own. - Powers the rest of the stack — speech-android is a Kotlin SDK + JNI bridge over Speech Core, and Speech Studio uses its LiteRT VoxCPM2 engine on Windows and Linux. On Apple platforms, the sibling library is speech-swift.
Platforms & backends
| Backend | Platforms | Hardware acceleration |
|---|---|---|
ONNX Runtime (SPEECH_CORE_WITH_ONNX) | Linux, macOS, Windows, Android | NNAPI on Android, QNN on Qualcomm Linux, optional NVIDIA CUDA / TensorRT (-DSPEECH_CORE_WITH_CUDA=ON) |
LiteRT (SPEECH_CORE_WITH_LITERT) | Linux x86_64, Windows x86_64, Android, macOS arm64 | CPU today |
Enable either backend, both, or neither — the orchestration core builds with no ML runtime at all.
Supported models
| Model | Task | ONNX | LiteRT |
|---|---|---|---|
| Silero VAD v5 | Voice activity detection | ✓ | ✓ |
| Parakeet TDT v3 (0.6B) | Speech-to-text (114 languages) | ✓ | ✓ |
| Nemotron Speech Streaming (0.6B) | Streaming speech-to-text (English) | ✓ | ✓ |
| Nemotron-3.5 ASR Streaming Multilingual (0.6B) | Streaming speech-to-text (multilingual, prompt-conditioned) | ✓ | ✓ |
| Omnilingual ASR CTC (300M) | Speech-to-text (multilingual) | — | ✓ |
| Pyannote Segmentation 3.0 | Diarization (segmentation) | — | ✓ |
| WeSpeaker ResNet34-LM | Speaker embedding | — | ✓ |
| VoxCPM2 (2B) | Text-to-speech (48 kHz, voice cloning) | — | ✓ |
| Kokoro 82M | Text-to-speech | ✓ | — |
| DeepFilterNet3 | Speech enhancement | ✓ | — |
| PersonaPlex 7B | Full-duplex speech-to-speech (CUDA) | ✓ | — |
Quick start
Build the core plus the LiteRT backend (the runtime library is extracted from the ai-edge-litert wheel — no TensorFlow build):
git clone https://github.com/soniqo/speech-core && cd speech-core
scripts/fetch_litert.sh build/litert
cmake -B build -DCMAKE_BUILD_TYPE=Release \
-DSPEECH_CORE_WITH_LITERT=ON -DLITERT_DIR=$PWD/build/litert
cmake --build build
Then link the targets you need:
target_link_libraries(my_app PRIVATE speech_core) # orchestration only
target_link_libraries(my_app PRIVATE speech_core speech_core_models) # + ONNX models
target_link_libraries(my_app PRIVATE speech_core speech_core_models_litert) # + LiteRT models
Transcribing an audio buffer is a few lines:
#include <speech_core/models/litert_parakeet_stt.h>
speech_core::LiteRTParakeetStt stt(
"parakeet-encoder.tflite", "parakeet-decoder-joint.tflite", "vocab.json");
auto r = stt.transcribe(audio, n_samples, 16000); // r.text / r.language / r.confidence
A reference Linux build — libspeech.so with a small C ABI, an ALSA demo CLI, and transcribe/synthesize/phonemize tools — lives at examples/linux. It targets embedded ARM64 (Yocto, Qualcomm SA8295P / SA8255P) and any Linux dev box. Setup steps are in the Linux getting-started guide.
On Android, use speech-android — a Kotlin SDK that packages Speech Core behind a JNI bridge (implementation("audio.soniqo:speech:0.0.9")). On macOS and iOS, use speech-swift, which runs the models on CoreML, MLX, and the Apple Neural Engine.
Documentation
- docs/ — full in-repo documentation
- docs/pipeline.md — the
VoicePipelinestate machine, AEC integration, and tool-call loop - docs/models.md — full model inventory
- huggingface.co/soniqo — converted model weights (ONNX, LiteRT)
Feedback
Open an issue at github.com/soniqo/speech-core/issues, or join the Discord.