CLI Reference
The audio binary is the main entry point for all speech processing tasks. Build with make build, then run from .build/release/audio.
transcribe
Transcribe audio files to text.
audio transcribe <file> [options]
| Option | Default | Description |
|---|---|---|
<file> | Audio file to transcribe (WAV, M4A, MP3, CAF) | |
--engine | qwen3 | ASR engine: qwen3, qwen3-coreml, or parakeet |
--model, -m | 0.6B | Model variant: 0.6B, 1.7B, or full HuggingFace model ID (qwen3 only) |
--language | Language hint (optional) | |
--stream | Enable streaming transcription with VAD | |
--max-segment | 10 | Maximum segment duration in seconds (streaming) |
--partial | Emit partial results during speech (streaming) |
Examples:
# Basic transcription
audio transcribe recording.wav
# Use larger model
audio transcribe recording.wav --model 1.7B
# CoreML encoder (Neural Engine + MLX decoder)
audio transcribe recording.wav --engine qwen3-coreml
# Use Parakeet (CoreML) engine
audio transcribe recording.wav --engine parakeet
# Streaming with VAD
audio transcribe recording.wav --stream --partial
align
Word-level forced alignment — get precise timestamps for every word.
audio align <file> [options]
| Option | Default | Description |
|---|---|---|
<file> | Audio file | |
--text, -t | Text to align (if omitted, transcribes first) | |
--model, -m | 0.6B | ASR model for transcription: 0.6B, 1.7B, or full ID |
--aligner-model | Forced aligner model ID | |
--language | Language hint |
Examples:
# Auto-transcribe then align
audio align recording.wav
# Align with known text
audio align recording.wav --text "Can you guarantee that the replacement part will be shipped tomorrow?"
speak
Text-to-speech synthesis.
audio speak "<text>" [options]
| Option | Default | Description |
|---|---|---|
<text> | Text to synthesize (optional if using --batch-file) | |
--engine | qwen3 | TTS engine: qwen3 or cosyvoice |
--output, -o | output.wav | Output WAV file path |
--language | english | Language. Omit to use speaker's native dialect when --speaker is set. |
--stream | Enable streaming synthesis | |
--voice-sample | Reference audio for voice cloning (works with both qwen3 and cosyvoice engines) | |
--verbose | Show detailed timing info |
Qwen3-TTS Options
| Option | Default | Description |
|---|---|---|
--model | base | Model variant: base, customVoice, or full HF model ID |
--speaker | Speaker voice (requires --model customVoice) | |
--instruct | Style instruction (CustomVoice model) | |
--list-speakers | List available speakers and exit | |
--temperature | 0.3 | Sampling temperature |
--top-k | 50 | Top-k sampling |
--max-tokens | 500 | Maximum tokens (500 = ~40s audio) |
--batch-file | File with one text per line for batch synthesis | |
--batch-size | 4 | Max batch size for parallel generation |
--first-chunk-frames | 3 | Codec frames in first streamed chunk |
--chunk-frames | 25 | Codec frames per streamed chunk |
CosyVoice3 Options
| Option | Default | Description |
|---|---|---|
--speakers | Speaker mapping for multi-speaker dialogue: s1=alice.wav,s2=bob.wav | |
--cosy-instruct | Style instruction (overrides default). Controls voice style for CosyVoice3. | |
--turn-gap | 0.2 | Silence gap between dialogue turns in seconds |
--crossfade | 0.0 | Crossfade overlap between turns in seconds |
--model-id | HuggingFace model ID |
Examples:
# Basic TTS
audio speak "Hello, world!" --output hello.wav
# Voice cloning (Qwen3-TTS)
audio speak "Hello in your voice" --voice-sample reference.wav -o cloned.wav
# Voice cloning (CosyVoice)
audio speak "Hello in your voice" --engine cosyvoice --voice-sample reference.wav -o cloned.wav
# CosyVoice multilingual
audio speak "Hallo Welt" --engine cosyvoice --language german -o hallo.wav
# Multi-speaker dialogue
audio speak "[S1] Hello there! [S2] Hey, how are you?" \
--engine cosyvoice --speakers s1=alice.wav,s2=bob.wav -o dialogue.wav
# Inline emotion/style tags
audio speak "(excited) Wow, amazing! (sad) But I have to go..." \
--engine cosyvoice -o emotion.wav
# Combined: dialogue + emotions + voice cloning
audio speak "[S1] (happy) Great news! [S2] (surprised) Really?" \
--engine cosyvoice --speakers s1=alice.wav,s2=bob.wav -o combined.wav
# Custom style instruction
audio speak "Hello world" --engine cosyvoice --cosy-instruct "Speak cheerfully" -o cheerful.wav
# Streaming synthesis
audio speak "Long text here..." --stream
# Batch synthesis from file
audio speak --batch-file texts.txt --batch-size 4
kokoro
Lightweight text-to-speech using Kokoro-82M on Neural Engine (CoreML). Non-autoregressive — single forward pass, ~45ms latency.
audio kokoro "<text>" [options]
| Option | Default | Description |
|---|---|---|
<text> | Text to synthesize | |
--voice | af_heart | Voice preset (50 available across 10 languages) |
--language | en | Language code: en, es, fr, hi, it, ja, pt, zh, ko, de |
--output, -o | kokoro_output.wav | Output WAV file path |
--list-voices | List all available voices and exit | |
--model, -m | HuggingFace model ID |
Examples:
# Basic Kokoro TTS
audio kokoro "Hello, world!" --voice af_heart -o hello.wav
# French voice
audio kokoro "Bonjour le monde" --voice ff_siwis --language fr -o bonjour.wav
# List all 50 voices
audio kokoro --list-voices
respond
Full-duplex speech-to-speech dialogue using PersonaPlex 7B.
audio respond [options]
| Option | Default | Description |
|---|---|---|
--input, -i | Input audio WAV file (24kHz mono) (required) | |
--output, -o | response.wav | Output response WAV file |
--voice | NATM0 | Voice preset (e.g. NATM0, NATF1, VARF0) |
--system-prompt | assistant | Preset: assistant, focused, customer-service, teacher |
--max-steps | 200 | Max generation steps at 12.5Hz (~16s) |
--stream | Emit audio chunks during generation | |
--compile | Enable compiled transformer (warmup + kernel fusion) | |
--list-voices | List available voice presets | |
--list-prompts | List available system prompt presets | |
--transcript | Print the model's inner monologue text | |
--json | Output as JSON (transcript, latency, audio path) | |
--verbose | Show detailed timing info |
Sampling Overrides
| Option | Default | Description |
|---|---|---|
--audio-temp | 0.8 | Audio sampling temperature |
--text-temp | 0.7 | Text sampling temperature |
--audio-top-k | 250 | Audio top-k candidates |
--repetition-penalty | 1.2 | Audio repetition penalty (1.0 = disabled) |
--text-repetition-penalty | 1.2 | Text repetition penalty (1.0 = disabled) |
--repetition-window | 30 | Repetition penalty window in frames |
--silence-early-stop | 15 | Silence frames before early stop (0 = disabled) |
--entropy-threshold | 0 | Text entropy threshold for early stop (0 = disabled) |
--entropy-window | 10 | Consecutive low-entropy steps before early stop |
Examples:
# Basic speech-to-speech
audio respond --input question.wav
# Use a female voice with compiled transformer
audio respond -i question.wav --voice NATF1 --compile
# Stream response and show transcript
audio respond -i question.wav --stream --transcript --verbose
vad
Offline voice activity detection using Pyannote segmentation.
audio vad <file> [options]
| Option | Description |
|---|---|
<file> | Audio file to analyze |
--model, -m | HuggingFace model ID |
--onset | Onset threshold (speech start) |
--offset | Offset threshold (speech end) |
--min-speech | Minimum speech duration in seconds |
--min-silence | Minimum silence duration in seconds |
--json | Output as JSON |
vad-stream
Streaming voice activity detection using Silero VAD v5. Processes audio in 32ms chunks.
audio vad-stream <file> [options]
| Option | Description |
|---|---|
<file> | Audio file to analyze |
--engine | VAD engine: mlx (default) or coreml |
--model, -m | HuggingFace model ID (auto-selected by engine) |
--onset | Onset threshold |
--offset | Offset threshold |
--min-speech | Minimum speech duration in seconds |
--min-silence | Minimum silence duration in seconds |
--json | Output as JSON |
diarize
Speaker diarization — identify who spoke when.
audio diarize <file> [options]
| Option | Default | Description |
|---|---|---|
<file> | Audio file to analyze | |
--engine | pyannote | Diarization engine: pyannote (segmentation + speaker chaining) or sortformer (end-to-end CoreML) |
--target-speaker | Enrollment audio for target speaker extraction (pyannote only) | |
--embedding-engine | mlx | Speaker embedding engine: mlx or coreml (pyannote only) |
--vad-filter | Pre-filter with Silero VAD (pyannote only) | |
--rttm | Output in RTTM format | |
--json | Output as JSON | |
--score-against | Reference RTTM file to compute DER |
Examples:
# Basic diarization (pyannote, default)
audio diarize meeting.wav
# End-to-end Sortformer (CoreML, Neural Engine)
audio diarize meeting.wav --engine sortformer
# RTTM output for evaluation
audio diarize meeting.wav --rttm
# Target speaker extraction (pyannote only)
audio diarize meeting.wav --target-speaker enrollment.wav
# Score against reference
audio diarize meeting.wav --score-against reference.rttm
embed-speaker
Extract a 256-dimensional speaker embedding vector.
audio embed-speaker <file> [options]
| Option | Description |
|---|---|
<file> | Audio file containing speaker voice |
--engine | Inference engine: mlx (default) or coreml |
--json | Output as JSON |
denoise
Remove background noise using DeepFilterNet3 on Neural Engine.
audio denoise <file> [options]
| Option | Default | Description |
|---|---|---|
<file> | Input audio file | |
--output, -o | input_clean.wav | Output file path |
--model, -m | HuggingFace model ID |
Example:
audio denoise noisy-recording.wav -o clean.wav