CLI Reference

The audio binary is the main entry point for all speech processing tasks. Build with make build, then run from .build/release/audio.

transcribe

Transcribe audio files to text.

audio transcribe <file> [options]
OptionDefaultDescription
<file>Audio file to transcribe (WAV, M4A, MP3, CAF)
--engineqwen3ASR engine: qwen3, qwen3-coreml, or parakeet
--model, -m0.6BModel variant: 0.6B, 1.7B, or full HuggingFace model ID (qwen3 only)
--languageLanguage hint (optional)
--streamEnable streaming transcription with VAD
--max-segment10Maximum segment duration in seconds (streaming)
--partialEmit partial results during speech (streaming)

Examples:

# Basic transcription
audio transcribe recording.wav

# Use larger model
audio transcribe recording.wav --model 1.7B

# CoreML encoder (Neural Engine + MLX decoder)
audio transcribe recording.wav --engine qwen3-coreml

# Use Parakeet (CoreML) engine
audio transcribe recording.wav --engine parakeet

# Streaming with VAD
audio transcribe recording.wav --stream --partial

align

Word-level forced alignment — get precise timestamps for every word.

audio align <file> [options]
OptionDefaultDescription
<file>Audio file
--text, -tText to align (if omitted, transcribes first)
--model, -m0.6BASR model for transcription: 0.6B, 1.7B, or full ID
--aligner-modelForced aligner model ID
--languageLanguage hint

Examples:

# Auto-transcribe then align
audio align recording.wav

# Align with known text
audio align recording.wav --text "Can you guarantee that the replacement part will be shipped tomorrow?"

speak

Text-to-speech synthesis.

audio speak "<text>" [options]
OptionDefaultDescription
<text>Text to synthesize (optional if using --batch-file)
--engineqwen3TTS engine: qwen3 or cosyvoice
--output, -ooutput.wavOutput WAV file path
--languageenglishLanguage. Omit to use speaker's native dialect when --speaker is set.
--streamEnable streaming synthesis
--voice-sampleReference audio for voice cloning (works with both qwen3 and cosyvoice engines)
--verboseShow detailed timing info

Qwen3-TTS Options

OptionDefaultDescription
--modelbaseModel variant: base, customVoice, or full HF model ID
--speakerSpeaker voice (requires --model customVoice)
--instructStyle instruction (CustomVoice model)
--list-speakersList available speakers and exit
--temperature0.3Sampling temperature
--top-k50Top-k sampling
--max-tokens500Maximum tokens (500 = ~40s audio)
--batch-fileFile with one text per line for batch synthesis
--batch-size4Max batch size for parallel generation
--first-chunk-frames3Codec frames in first streamed chunk
--chunk-frames25Codec frames per streamed chunk

CosyVoice3 Options

OptionDefaultDescription
--speakersSpeaker mapping for multi-speaker dialogue: s1=alice.wav,s2=bob.wav
--cosy-instructStyle instruction (overrides default). Controls voice style for CosyVoice3.
--turn-gap0.2Silence gap between dialogue turns in seconds
--crossfade0.0Crossfade overlap between turns in seconds
--model-idHuggingFace model ID

Examples:

# Basic TTS
audio speak "Hello, world!" --output hello.wav

# Voice cloning (Qwen3-TTS)
audio speak "Hello in your voice" --voice-sample reference.wav -o cloned.wav

# Voice cloning (CosyVoice)
audio speak "Hello in your voice" --engine cosyvoice --voice-sample reference.wav -o cloned.wav

# CosyVoice multilingual
audio speak "Hallo Welt" --engine cosyvoice --language german -o hallo.wav

# Multi-speaker dialogue
audio speak "[S1] Hello there! [S2] Hey, how are you?" \
    --engine cosyvoice --speakers s1=alice.wav,s2=bob.wav -o dialogue.wav

# Inline emotion/style tags
audio speak "(excited) Wow, amazing! (sad) But I have to go..." \
    --engine cosyvoice -o emotion.wav

# Combined: dialogue + emotions + voice cloning
audio speak "[S1] (happy) Great news! [S2] (surprised) Really?" \
    --engine cosyvoice --speakers s1=alice.wav,s2=bob.wav -o combined.wav

# Custom style instruction
audio speak "Hello world" --engine cosyvoice --cosy-instruct "Speak cheerfully" -o cheerful.wav

# Streaming synthesis
audio speak "Long text here..." --stream

# Batch synthesis from file
audio speak --batch-file texts.txt --batch-size 4

kokoro

Lightweight text-to-speech using Kokoro-82M on Neural Engine (CoreML). Non-autoregressive — single forward pass, ~45ms latency.

audio kokoro "<text>" [options]
OptionDefaultDescription
<text>Text to synthesize
--voiceaf_heartVoice preset (50 available across 10 languages)
--languageenLanguage code: en, es, fr, hi, it, ja, pt, zh, ko, de
--output, -okokoro_output.wavOutput WAV file path
--list-voicesList all available voices and exit
--model, -mHuggingFace model ID

Examples:

# Basic Kokoro TTS
audio kokoro "Hello, world!" --voice af_heart -o hello.wav

# French voice
audio kokoro "Bonjour le monde" --voice ff_siwis --language fr -o bonjour.wav

# List all 50 voices
audio kokoro --list-voices

respond

Full-duplex speech-to-speech dialogue using PersonaPlex 7B.

audio respond [options]
OptionDefaultDescription
--input, -iInput audio WAV file (24kHz mono) (required)
--output, -oresponse.wavOutput response WAV file
--voiceNATM0Voice preset (e.g. NATM0, NATF1, VARF0)
--system-promptassistantPreset: assistant, focused, customer-service, teacher
--max-steps200Max generation steps at 12.5Hz (~16s)
--streamEmit audio chunks during generation
--compileEnable compiled transformer (warmup + kernel fusion)
--list-voicesList available voice presets
--list-promptsList available system prompt presets
--transcriptPrint the model's inner monologue text
--jsonOutput as JSON (transcript, latency, audio path)
--verboseShow detailed timing info

Sampling Overrides

OptionDefaultDescription
--audio-temp0.8Audio sampling temperature
--text-temp0.7Text sampling temperature
--audio-top-k250Audio top-k candidates
--repetition-penalty1.2Audio repetition penalty (1.0 = disabled)
--text-repetition-penalty1.2Text repetition penalty (1.0 = disabled)
--repetition-window30Repetition penalty window in frames
--silence-early-stop15Silence frames before early stop (0 = disabled)
--entropy-threshold0Text entropy threshold for early stop (0 = disabled)
--entropy-window10Consecutive low-entropy steps before early stop

Examples:

# Basic speech-to-speech
audio respond --input question.wav

# Use a female voice with compiled transformer
audio respond -i question.wav --voice NATF1 --compile

# Stream response and show transcript
audio respond -i question.wav --stream --transcript --verbose

vad

Offline voice activity detection using Pyannote segmentation.

audio vad <file> [options]
OptionDescription
<file>Audio file to analyze
--model, -mHuggingFace model ID
--onsetOnset threshold (speech start)
--offsetOffset threshold (speech end)
--min-speechMinimum speech duration in seconds
--min-silenceMinimum silence duration in seconds
--jsonOutput as JSON

vad-stream

Streaming voice activity detection using Silero VAD v5. Processes audio in 32ms chunks.

audio vad-stream <file> [options]
OptionDescription
<file>Audio file to analyze
--engineVAD engine: mlx (default) or coreml
--model, -mHuggingFace model ID (auto-selected by engine)
--onsetOnset threshold
--offsetOffset threshold
--min-speechMinimum speech duration in seconds
--min-silenceMinimum silence duration in seconds
--jsonOutput as JSON

diarize

Speaker diarization — identify who spoke when.

audio diarize <file> [options]
OptionDefaultDescription
<file>Audio file to analyze
--enginepyannoteDiarization engine: pyannote (segmentation + speaker chaining) or sortformer (end-to-end CoreML)
--target-speakerEnrollment audio for target speaker extraction (pyannote only)
--embedding-enginemlxSpeaker embedding engine: mlx or coreml (pyannote only)
--vad-filterPre-filter with Silero VAD (pyannote only)
--rttmOutput in RTTM format
--jsonOutput as JSON
--score-againstReference RTTM file to compute DER

Examples:

# Basic diarization (pyannote, default)
audio diarize meeting.wav

# End-to-end Sortformer (CoreML, Neural Engine)
audio diarize meeting.wav --engine sortformer

# RTTM output for evaluation
audio diarize meeting.wav --rttm

# Target speaker extraction (pyannote only)
audio diarize meeting.wav --target-speaker enrollment.wav

# Score against reference
audio diarize meeting.wav --score-against reference.rttm

embed-speaker

Extract a 256-dimensional speaker embedding vector.

audio embed-speaker <file> [options]
OptionDescription
<file>Audio file containing speaker voice
--engineInference engine: mlx (default) or coreml
--jsonOutput as JSON

denoise

Remove background noise using DeepFilterNet3 on Neural Engine.

audio denoise <file> [options]
OptionDefaultDescription
<file>Input audio file
--output, -oinput_clean.wavOutput file path
--model, -mHuggingFace model ID

Example:

audio denoise noisy-recording.wav -o clean.wav