Speech Studio
Open-source Mac app for local voice cloning and multi-speaker dialog generation. Drop a voice sample, clone it, write a scene, synthesize — all on your laptop. No API keys, no cloud, no per-character pricing.
A 30-second blind test: a real voice, the same voice cloned locally by Speech Studio, and the same voice cloned by ElevenLabs in the cloud. Can you tell which is which?
What it does
- Voice cloning from a short reference — drop in a few seconds of speech, clone the voice locally.
- Multi-speaker dialog generation — write a scene with multiple speakers, synthesize all of them in one pass.
- Runs entirely on your Mac — VoxCPM2 via MLX, DeepFilterNet3 for noise suppression, no network required.
- Open source under Apache 2.0 — fork it, embed it, build on it.
Requirements
- macOS 15+ (Sequoia or later)
- Apple Silicon (M1, M2, M3, M4 series)
- 8 GB RAM minimum (16 GB recommended)
- ~3 GB disk for the voice cloning + denoising models (downloaded from HuggingFace on first use)
Install
Download the latest .dmg from GitHub Releases, open it, drag Speech Studio to /Applications, and launch it:
On first launch macOS Gatekeeper will warn that the developer can't be verified — open it via System Settings → Privacy & Security → Open anyway until notarized builds ship. First-run also downloads ~2.75 GB of VoxCPM2 weights from HuggingFace into ~/.cache/huggingface/hub/; subsequent launches reuse the cache.
The same voice cloning pipeline ships in the speech CLI: brew install soniqo/tap/speech, then speech speak --engine voxcpm2 --voxcpm2-ref-audio reference.wav -o cloned.wav "Hello, this is my cloned voice." — useful for scripting or pre-rendering batches. See the voice cloning guide for the full flow.
Speech Studio is in active preview (v0.0.2). The source repo at github.com/soniqo/speech-studio tracks the GUI app; star/watch it for notarized release notifications. Linux and Windows builds compile today via speech-core's LiteRT VoxCPM2 engine; on-device runtime is wired but not yet hardware-validated.
What it's built on
Speech Studio is a thin GUI on top of speech-swift, the open-source Swift library that ships every model used in the demo:
- VoxCPM2 — the voice cloning model (zero-shot, short reference)
- DeepFilterNet3 — denoise the reference + cloned output
- Qwen3-ASR — align speech to text (used in the demo's blind-test build pipeline)
- Forced Alignment — word-level timestamps for editing
- Voice Cloning guide — full overview of the pipeline
Roadmap
- Today: Mac (Apple Silicon).
- Next: Linux (CUDA + CPU), Windows.
- After that: deeper editing surface, plugin support for swappable cloning models.
Feedback
Open an issue at github.com/soniqo/speech-studio/issues — every one gets read.