Open source · Apache 2.0 · Fully offline

On-device speech.
For real products.

Diarized transcription, zero-shot voice cloning, long-form speech synthesis — running on Apple Silicon, Android, Windows, and embedded Linux. No cloud APIs, no per-minute pricing, no data leaving the device.

Get started GitHub

Apple · Homebrew

brew install speech

Android · Gradle

implementation("audio.soniqo:speech:0.0.9")

Local Speech AI on a MacBook — 4-minute library tour

A four-minute open-source library tour: realtime transcription with Nemotron Streaming, local speech-to-speech with PersonaPlex, and 48 kHz voice cloning with VoxCPM2 — every demo runs on the laptop.

Watch

July 7, 2026 · Soniqo’s Blog

On-device voice agents: one pipeline, three memory budgets.

VAD → STT → LLM → TTS on phone and Mac · ~1.2 GB iPhone · ~1.5 GB S23 · <4 GB desktop

The same VAD → STT → LLM → TTS voice-agent pipeline on iPhone, Galaxy S23, and a Mac — with measured on-device memory for each budget.

Read

July 2, 2026 · Soniqo’s Blog

Voice cloning models, measured across five languages.

8 engines: cosine, WER, UTMOS, RTF, peak RSS · New rows: F5-TTS, Higgs TTS 3, IndexTTS2 · Reference vs clone audio per language

Ten FLEURS reference/target pairs per language: speaker similarity, WER/CER, UTMOS quality, speed, and memory for every local cloning engine, with audio to verify by ear.

Read

YouTube · 30 sec

Which voice is real? Local clone vs ElevenLabs, in 30 seconds

A 30-second blind comparison: a real voice, the same voice cloned locally by Speech Studio on a MacBook, and the ElevenLabs cloud clone.

Watch

What you can build

Three on-device use-case groups.

Each group spans several sub-use-cases stitched from Soniqo components. Drop in your audio, get conversation, transcripts, or generated speech back — locally, in real time.

Conversational

Voice Agents

Build voice-first interfaces — from full-duplex speech-to-speech to wake-word-driven compositional pipelines, all running locally.

Learn more

Audio understanding

Transcription

Turn audio into structured text — realtime streaming for live captions and dictation, batch high-accuracy for archives, diarized to name each speaker.

Learn more

Content creation

Speech Generation

Synthesize speech in any voice — clone a voice in seconds, narrate audiobooks for hours, or cast multi-speaker podcasts, fully offline.

Learn more

All components

Thirty-plus models. One stack.

The use-case pipelines above are stitched from these models. Pick a component to read its architecture, CLI, Swift API, and benchmarks. All run on Apple Silicon, most also on Android and Linux.

Speech-to-Text

Qwen3-ASR

52 langs, RTF 0.06, 4-/8-bit

On-device speech.For real products.

Local Speech AI on a MacBook — 4-minute library tour

On-device voice agents: one pipeline, three memory budgets.

Voice cloning models, measured across five languages.

Which voice is real? Local clone vs ElevenLabs, in 30 seconds

Three on-device use-case groups.

Voice Agents

Transcription

Speech Generation

Thirty-plus models. One stack.

Speech-to-Text

Text-to-Speech

Audio Analysis

Music & Audio Production

LLM & Speech-to-Speech

Avatar

On-device speech.
For real products.