CLI Reference

The speech binary is the main entry point for all speech processing tasks. Build with make build, then run from .build/release/speech.

transcribe

Transcribe audio files to text.

speech transcribe <file> [options]

Option	Default	Description
`<file>`		Audio file to transcribe (WAV, M4A, MP3, CAF)
`--engine`	`qwen3`	ASR engine: `qwen3`, `qwen3-coreml`, `parakeet`, `nemotron`, or `omnilingual`
`--model, -m`	`0.6B`	Model variant: `0.6B`, `1.7B`, or full HuggingFace model ID (qwen3 only)
`--language`		Language hint (optional, ignored by omnilingual)
`--window`	`10`	`[omnilingual]` CoreML window size in seconds: `5` or `10`
`--backend`	`coreml`	`[omnilingual]` Backend: `coreml` (Neural Engine) or `mlx` (Metal GPU)
`--variant`	`300M`	`[omnilingual mlx]` Size: `300M`, `1B`, `3B`, or `7B`
`--bits`	`4`	`[omnilingual mlx]` Quantisation bits: `4` or `8`
`--stream`		Enable streaming transcription with VAD
`--max-segment`	`10`	Maximum segment duration in seconds (streaming)
`--partial`		Emit partial results during speech (streaming)

Examples:

# Basic transcription
speech transcribe recording.wav

# Use larger model
speech transcribe recording.wav --model 1.7B

# CoreML encoder (Neural Engine + MLX decoder)
speech transcribe recording.wav --engine qwen3-coreml

# Use Parakeet (CoreML) engine
speech transcribe recording.wav --engine parakeet

# Use Nemotron Streaming (CoreML, English with native punctuation)
speech transcribe recording.wav --engine nemotron                                 # batch
speech transcribe recording.wav --engine nemotron --stream --partial              # streaming

# Omnilingual (CoreML, 1,672 languages)
speech transcribe recording.wav --engine omnilingual                              # 10 s window
speech transcribe recording.wav --engine omnilingual --window 5                     # 5 s window

# Omnilingual (MLX, any length up to 40 s)
speech transcribe recording.wav --engine omnilingual --backend mlx                              # 300M @ 4-bit
speech transcribe recording.wav --engine omnilingual --backend mlx --variant 1B                  # 1B @ 4-bit
speech transcribe recording.wav --engine omnilingual --backend mlx --variant 3B --bits 8         # 3B @ 8-bit
speech transcribe recording.wav --engine omnilingual --backend mlx --variant 7B                  # 7B @ 4-bit

# Streaming with VAD
speech transcribe recording.wav --stream --partial

align

Word-level forced alignment — get precise timestamps for every word.

speech align <file> [options]

Option	Default	Description
`<file>`		Audio file
`--text, -t`		Text to align (if omitted, transcribes first)
`--model, -m`	`0.6B`	ASR model for transcription: `0.6B`, `1.7B`, or full ID
`--aligner-model`		Forced aligner model ID
`--language`		Language hint

Examples:

# Auto-transcribe then align
speech align recording.wav

# Align with known text
speech align recording.wav --text "Can you guarantee that the replacement part will be shipped tomorrow?"

speak

Text-to-speech synthesis.

speech speak "<text>" [options]

Option	Default	Description
`<text>`		Text to synthesize (optional if using `--batch-file`)
`--engine`	`qwen3`	TTS engine: `qwen3`, `cosyvoice`, `voxcpm2`, or `magpie`
`--output, -o`	`output.wav`	Output WAV file path
`--language`	`english`	Language. Omit to use speaker's native dialect when `--speaker` is set.
`--stream`		Enable streaming synthesis
`--voice-sample`		Reference audio for voice cloning (works with both `qwen3` and `cosyvoice` engines)
`--verbose`		Show detailed timing info

Qwen3-TTS Options

Option	Default	Description
`--model`	`base`	Model variant: `base`, `customVoice`, or full HF model ID
`--speaker`		Speaker voice (requires `--model customVoice`)
`--instruct`		Style instruction (CustomVoice model)
`--list-speakers`		List available speakers and exit
`--temperature`	`0.3`	Sampling temperature
`--top-k`	`50`	Top-k sampling
`--max-tokens`	`500`	Maximum tokens (500 = ~40s audio)
`--batch-file`		File with one text per line for batch synthesis
`--batch-size`	`4`	Max batch size for parallel generation
`--first-chunk-frames`	`3`	Codec frames in first streamed chunk
`--chunk-frames`	`25`	Codec frames per streamed chunk

CosyVoice3 Options

Option	Default	Description
`--speakers`		Speaker mapping for multi-speaker dialogue: `s1=alice.wav,s2=bob.wav`
`--cosy-instruct`		Style instruction (overrides default). Controls voice style for CosyVoice3.
`--turn-gap`	`0.2`	Silence gap between dialogue turns in seconds
`--crossfade`	`0.0`	Crossfade overlap between turns in seconds
`--model-id`		HuggingFace model ID

VoxCPM2 Options

Option	Default	Description
`--voxcpm2-variant`	`bf16`	Quantisation variant: `bf16`, `int8`, or `int4`. Resolves to `aufklarer/VoxCPM2-MLX-<variant>`.
`--voxcpm2-instruct`		Natural-language voice description (voice design), e.g. "a young woman, warm and gentle".
`--voxcpm2-ref-audio`		Reference audio file for cloning (16 kHz mono, resampled internally).
`--voxcpm2-prompt-audio` / `--voxcpm2-prompt-text`		"Ultimate cloning" pair — reference clip + its transcript for prosody-preserving cloning.
`--voxcpm2-cfg-value`	`2.0`	Classifier-free guidance scale for the diffusion sampler.
`--voxcpm2-timesteps`	`10`	Euler solver steps per generated audio patch.
`--voxcpm2-max-tokens`	`2000`	Max generated patches before forced stop.
`--voxcpm2-min-tokens`	`2`	Min patches before the stop head is allowed to fire.
`--seed`		Seed MLX RNG before synthesis (deterministic across runs).

Magpie Options

NVIDIA Magpie-TTS Multilingual 357M, 9 languages with 5 baked speakers. Pick the backend with --engine magpie (MLX, default) or --engine magpie-coreml (CoreML for the big models with MLX driving the LocalTransformer + audio embeddings). See the Magpie guide for the full per-language G2P breakdown. Voice cloning is not supported: --voice-sample, --speaker, and --instruct are rejected with a helpful error pointing at --magpie-speaker instead.

Option	Default	Description
`--magpie-variant`	`int4`	MLX-only. Quantisation: `int4` (247 MB) or `int8` (411 MB). Resolves to `aufklarer/Magpie-TTS-Multilingual-357M-MLX-<variant>`. The CoreML engine uses the INT8 CoreML bundle and ignores this flag.
`--magpie-speaker`	`sofia`	Baked speaker: `sofia`, `aria`, `jason`, `leo`, or `john`. Identity is consistent across all 9 languages and both backends.
`--magpie-temperature`	`0.6`	Sampling temperature (0 = greedy). Use `0.6` for Japanese — greedy gets stuck on the first phrase.
`--magpie-top-k`	`80`	Top-k filter for sampling.
`--magpie-max-frames`	`500`	Hard cap on codec frames (~23 s).
`--magpie-min-frames`	`4`	Minimum frames before EOS allowed.
`--magpie-prephonemized`		Treat input as IPA / phoneme stream; skip per-language G2P.
`--list-speakers`		Print the 5 baked speakers and exit.

magpie-coreml caveats: the bundled NanoCodec is traced at a fixed 64-frame window, so --stream is rejected. --language ja auto-routes to the MLX backend with a stderr note (the CoreML bundle doesn't ship JA tokenizer assets yet). The CoreML engine lazy-loads the MLX bundle on first synthesis to drive the LocalTransformer and average audio embeddings; pure-CoreML deployment is tracked as a follow-up.

Examples:

# Basic TTS
speech speak "Hello, world!" --output hello.wav

# Voice cloning (Qwen3-TTS)
speech speak "Hello in your voice" --voice-sample reference.wav -o cloned.wav

# Voice cloning (CosyVoice)
speech speak "Hello in your voice" --engine cosyvoice --voice-sample reference.wav -o cloned.wav

# CosyVoice multilingual
speech speak "Hallo Welt" --engine cosyvoice --language german -o hallo.wav

# Multi-speaker dialogue
speech speak "[S1] Hello there! [S2] Hey, how are you?" \
    --engine cosyvoice --speakers s1=alice.wav,s2=bob.wav -o dialogue.wav

# Inline emotion/style tags
speech speak "(excited) Wow, amazing! (sad) But I have to go..." \
    --engine cosyvoice -o emotion.wav

# Combined: dialogue + emotions + voice cloning
speech speak "[S1] (happy) Great news! [S2] (surprised) Really?" \
    --engine cosyvoice --speakers s1=alice.wav,s2=bob.wav -o combined.wav

# Custom style instruction
speech speak "Hello world" --engine cosyvoice --cosy-instruct "Speak cheerfully" -o cheerful.wav

# Magpie multilingual TTS — same Aria voice across 9 languages
speech speak "Hello, world." --engine magpie --magpie-speaker aria \
    --magpie-temperature 0 -o en.wav
speech speak "Hola mundo." --engine magpie --language es --magpie-speaker aria \
    --magpie-temperature 0 -o es.wav
# Japanese needs stochastic sampling
speech speak "こんにちは世界、これは音声合成システムです。" \
    --engine magpie --language ja --magpie-temperature 0.6 \
    --magpie-top-k 80 --seed 42 -o ja.wav
speech speak --engine magpie --list-speakers

# Magpie CoreML backend (ANE-accelerated, 8 languages, no streaming)
speech speak "Hello world." --engine magpie-coreml --magpie-speaker aria -o en.wav
speech speak "Hola mundo." --engine magpie-coreml --language es \
    --magpie-speaker leo -o es.wav
# Japanese auto-routes to MLX (CoreML bundle has no JA tokenizer)
speech speak "こんにちは。" --engine magpie-coreml --language ja -o ja.wav

# Streaming synthesis
speech speak "Long text here..." --stream

# Batch synthesis from file
speech speak --batch-file texts.txt --batch-size 4

# VoxCPM2 — 48 kHz studio output
speech speak "Hello there." --engine voxcpm2 --voxcpm2-variant int8 -o hi.wav

# VoxCPM2 — voice design
speech speak "Welcome to the show." --engine voxcpm2 \
    --voxcpm2-instruct "A young woman, warm and gentle voice." -o design.wav

# VoxCPM2 — single-reference cloning
speech speak "This is a cloned voice." --engine voxcpm2 \
    --voice-sample speaker.wav -o clone.wav

kokoro

Lightweight text-to-speech using Kokoro-82M on Neural Engine (CoreML). Non-autoregressive — single forward pass, ~45ms latency.

speech kokoro "<text>" [options]

Option	Default	Description
`<text>`		Text to synthesize
`--voice`	`af_heart`	Voice preset (50 available across 10 languages)
`--language`	`en`	Language code: en, es, fr, hi, it, ja, pt, zh, ko, de
`--output, -o`	`kokoro_output.wav`	Output WAV file path
`--list-voices`		List all available voices and exit
`--model, -m`		HuggingFace model ID

Examples:

# Basic Kokoro TTS
speech kokoro "Hello, world!" --voice af_heart -o hello.wav

# French voice
speech kokoro "Bonjour le monde" --voice ff_siwis --language fr -o bonjour.wav

# List all 50 voices
speech kokoro --list-voices

respond

Full-duplex speech-to-speech dialogue using PersonaPlex 7B.

speech respond [options]

Option	Default	Description
`--input, -i`		Input audio WAV file (24kHz mono) (required)
`--output, -o`	`response.wav`	Output response WAV file
`--voice`	`NATM0`	Voice preset (e.g. NATM0, NATF1, VARF0)
`--system-prompt`	`assistant`	Preset: `assistant`, `focused`, `customer-service`, `teacher`
`--system-prompt-text`		Custom system prompt text (overrides preset)
`--max-steps`	`200`	Max generation steps at 12.5Hz (~16s)
`--stream`		Emit audio chunks during generation
`--compile`		Enable compiled transformer (warmup + kernel fusion)
`--list-voices`		List available voice presets
`--list-prompts`		List available system prompt presets
`--transcript`		Print the model's inner monologue text
`--json`		Output as JSON (transcript, latency, audio path)
`--verbose`		Show detailed timing info

Sampling Overrides

Option	Default	Description
`--audio-temp`	`0.8`	Audio sampling temperature
`--text-temp`	`0.7`	Text sampling temperature
`--audio-top-k`	`250`	Audio top-k candidates
`--repetition-penalty`	`1.2`	Audio repetition penalty (1.0 = disabled)
`--text-repetition-penalty`	`1.2`	Text repetition penalty (1.0 = disabled)
`--repetition-window`	`30`	Repetition penalty window in frames
`--silence-early-stop`	`15`	Silence frames before early stop (0 = disabled)
`--entropy-threshold`	`0`	Text entropy threshold for early stop (0 = disabled)
`--entropy-window`	`10`	Consecutive low-entropy steps before early stop

Examples:

# Basic speech-to-speech
speech respond --input question.wav

# Use a female voice with compiled transformer
speech respond -i question.wav --voice NATF1 --compile

# Stream response and show transcript
speech respond -i question.wav --stream --transcript --verbose

audio-translate

Streaming speech-to-speech translation using Kyutai Hibiki Zero-3B. FR / ES / PT / DE → EN, single binary, no cloud. Full guide →

speech audio-translate <input.wav> [options]

Option	Default	Description
`<input>`		Source audio WAV file (mono, resampled to 24 kHz internally) (required)
`--output, -o`	`translated.wav`	Output 24 kHz English WAV file
`--source-lang`	`fr`	Source language hint (`fr`, `es`, `pt`, `de`). Auto-detected; metadata only. FR + ES are strict E2E canaries; PT + DE are best-effort.
`--quantization`	`4bit`	Variant: `4bit` (~2.7 GB) or `8bit` (~3.9 GB)
`--model-id`		HuggingFace model id override (takes precedence over `--quantization`)
`--compile`		Run the temporal transformer warm-up pass before translating
`--verbose`		Print per-phase timings (Mimi encode, generation, Mimi decode)
`--transcript`		Print the model's inner-monologue raw SPM token IDs (SPM decode wiring is a follow-up)

Environment Variables

Variable	Effect
`HIBIKI_GREEDY=1`	Force argmax decoding for text + target audio. Reproducible — used by the strict CI canaries.
`HIBIKI_MODEL_ID`	Override the default `aufklarer/Hibiki-Zero-3B-MLX-4bit` repo at runtime.

Examples:

# Translate a French clip to English
speech audio-translate input_fr.wav -o out_en.wav --source-lang fr

# Spanish, 8-bit, verbose
speech audio-translate input_es.wav -o out.wav --source-lang es --quantization 8bit --verbose

# Deterministic mode (matches the CI regression canaries)
HIBIKI_GREEDY=1 speech audio-translate input_fr.wav -o out.wav --source-lang fr

vad

Offline voice activity detection using Pyannote segmentation.

speech vad <file> [options]

Option	Description
`<file>`	Audio file to analyze
`--model, -m`	HuggingFace model ID
`--onset`	Onset threshold (speech start)
`--offset`	Offset threshold (speech end)
`--min-speech`	Minimum speech duration in seconds
`--min-silence`	Minimum silence duration in seconds
`--json`	Output as JSON

vad-stream

Streaming voice activity detection using Silero VAD v5. Processes speech in 32ms chunks.

speech vad-stream <file> [options]

Option	Description
`<file>`	Audio file to analyze
`--engine`	VAD engine: `mlx` (default) or `coreml`
`--model, -m`	HuggingFace model ID (auto-selected by engine)
`--onset`	Onset threshold
`--offset`	Offset threshold
`--min-speech`	Minimum speech duration in seconds
`--min-silence`	Minimum silence duration in seconds
`--json`	Output as JSON

wake

On-device wake-word / keyword spotting using the KWS Zipformer (3.49M params, CoreML INT8, 26× real-time, English only).

speech wake <file> [options]

Option	Description
`<file>`	Audio file to analyze
`--keywords`	One or more keywords. Formats: `"hey soniqo"` (greedy BPE), `"hey soniqo:0.15:0.5"` (with threshold/boost), or `"LIGHT UP\|▁ L IGHT ▁UP:0.25:2.0"` (sherpa-onnx-style explicit BPE pieces)
`--keywords-file`	Keyword file, one entry per line (same syntax as `--keywords`); `#` for comments
`--model, -m`	HuggingFace model ID. Defaults to `aufklarer/KWS-Zipformer-3M-CoreML-INT8`
`--json`	Output as JSON

# Plain phrase, tuned defaults
speech wake recording.wav --keywords "hey soniqo"

# Explicit BPE pieces for phrases the greedy tokenizer gets wrong
speech wake recording.wav --keywords "LIGHT UP|▁ L IGHT ▁UP:0.25:2.0"

# Multiple phrases + JSON output
speech wake recording.wav \
  --keywords "lovely child|▁LOVE LY ▁CHI L D:0.25:2.0" \
             "for ever|▁FOR E VER:0.25:2.0" \
  --json

diarize

Speaker diarization — identify who spoke when.

speech diarize <file> [options]

Option	Default	Description
`<file>`		Audio file to analyze
`--engine`	`pyannote`	Diarization engine: `pyannote` (segmentation + speaker chaining) or `sortformer` (end-to-end CoreML)
`--target-speaker`		Enrollment audio for target speaker extraction (pyannote only)
`--embedding-engine`	`mlx`	Speaker embedding engine: `mlx` or `coreml` (pyannote only)
`--vad-filter`		Pre-filter with Silero VAD (pyannote only)
`--rttm`		Output in RTTM format
`--json`		Output as JSON
`--score-against`		Reference RTTM file to compute DER

Examples:

# Basic diarization (pyannote, default)
speech diarize meeting.wav

# End-to-end Sortformer (CoreML, Neural Engine)
speech diarize meeting.wav --engine sortformer

# RTTM output for evaluation
speech diarize meeting.wav --rttm

# Target speaker extraction (pyannote only)
speech diarize meeting.wav --target-speaker enrollment.wav

# Score against reference
speech diarize meeting.wav --score-against reference.rttm

embed-speaker

Extract a speaker embedding vector from audio.

speech embed-speaker <file> [options]

Option	Description
`<file>`	Audio file containing speaker voice
`--engine`	Inference engine: `mlx` (default), `coreml` (WeSpeaker 256-dim), or `camplusplus` (CAM++ CoreML 192-dim)
`--json`	Output as JSON

denoise

Remove background noise using DeepFilterNet3 on Neural Engine.

speech denoise <file> [options]

Option	Default	Description
`<file>`		Input audio file
`--output, -o`	`input_clean.wav`	Output file path
`--model, -m`		HuggingFace model ID

Example:

speech denoise noisy-recording.wav -o clean.wav

compose

Generate 30 s of music from a text prompt using MAGNeT on MLX.

speech compose <prompt> [options]

Option	Default	Description
`<prompt>`		Text prompt describing the music to generate (e.g. "happy rock")
`--output, -o`	`magnet.wav`	Output WAV path (32 kHz mono)
`--variant`	`small-int4`	Model variant: `small-int4`, `small-int8`, `medium-int4`, or `medium-int8`. Resolves to `aufklarer/MAGNeT-{Small,Medium}-30secs-MLX-{4,8}bit`.
`--temperature`	`3.0`	Sampling temperature, annealed linearly per stage.
`--top-p`	`0.9`	Nucleus sampling threshold.
`--cfg-max`	`10.0`	Max classifier-free guidance coefficient.
`--cfg-min`	`1.0`	Min CFG coefficient (annealed alongside the mask schedule).
`--steps`	`20,10,10,10`	Comma-separated decoding iterations per codebook (4 values).
`--seed`		Random seed for reproducible output.

Examples:

# Default: small-int4, ~10 s wall on M-series for a 30 s clip
speech compose "happy rock" -o happy_rock.wav

# Larger model — better prompt following, slower
speech compose "lo-fi hip hop with mellow piano" --variant medium-int4 -o lofi.wav

# Reproducible
speech compose "energetic EDM with synth lead" --seed 42 -o edm.wav