CLI 参考

speech 二进制是所有语音处理任务的主要入口。先用 make build 构建，然后从 .build/release/speech 运行。

transcribe

将音频文件转写为文字。

speech transcribe <file> [options]

选项	默认值	说明
`<file>`		要转写的音频文件（WAV、M4A、MP3、CAF）
`--engine`	`qwen3`	ASR 引擎：`qwen3`、`qwen3-coreml`、`parakeet`、`nemotron` 或 `omnilingual`
`--model, -m`	`0.6B`	模型变体：`0.6B`、`1.7B` 或完整 HuggingFace 模型 ID（仅 qwen3）
`--language`		语言提示（可选，omnilingual 会忽略）
`--window`	`10`	`[omnilingual]` CoreML 窗口大小（秒）：`5` 或 `10`
`--backend`	`coreml`	`[omnilingual]` 后端：`coreml`（Neural Engine）或 `mlx`（Metal GPU）
`--variant`	`300M`	`[omnilingual mlx]` 规模：`300M`、`1B`、`3B` 或 `7B`
`--bits`	`4`	`[omnilingual mlx]` 量化位宽：`4` 或 `8`
`--stream`		启用基于 VAD 的 streaming 转写
`--max-segment`	`10`	最大分段时长（秒，streaming）
`--partial`		在说话过程中输出部分结果（streaming）

示例：

# Basic transcription
speech transcribe recording.wav

# Use larger model
speech transcribe recording.wav --model 1.7B

# CoreML encoder (Neural Engine + MLX decoder)
speech transcribe recording.wav --engine qwen3-coreml

# Use Parakeet (CoreML) engine
speech transcribe recording.wav --engine parakeet

# Use Nemotron Streaming (CoreML, English with native punctuation)
speech transcribe recording.wav --engine nemotron                                 # batch
speech transcribe recording.wav --engine nemotron --stream --partial              # streaming

# Omnilingual (CoreML, 1,672 languages)
speech transcribe recording.wav --engine omnilingual                              # 10 s window
speech transcribe recording.wav --engine omnilingual --window 5                     # 5 s window

# Omnilingual (MLX, any length up to 40 s)
speech transcribe recording.wav --engine omnilingual --backend mlx                              # 300M @ 4-bit
speech transcribe recording.wav --engine omnilingual --backend mlx --variant 1B                  # 1B @ 4-bit
speech transcribe recording.wav --engine omnilingual --backend mlx --variant 3B --bits 8         # 3B @ 8-bit
speech transcribe recording.wav --engine omnilingual --backend mlx --variant 7B                  # 7B @ 4-bit

# Streaming with VAD
speech transcribe recording.wav --stream --partial

align

词级强制对齐——获取每个词的精确时间戳。

speech align <file> [options]

选项	默认值	说明
`<file>`		音频文件
`--text, -t`		要对齐的文本（省略则先执行转写）
`--model, -m`	`0.6B`	用于转写的 ASR 模型：`0.6B`、`1.7B` 或完整 ID
`--aligner-model`		强制对齐器模型 ID
`--language`		语言提示

示例：

# Auto-transcribe then align
speech align recording.wav

# Align with known text
speech align recording.wav --text "Can you guarantee that the replacement part will be shipped tomorrow?"

speak

文本转语音合成。

speech speak "<text>" [options]

选项	默认值	说明
`<text>`		要合成的文本（使用 `--batch-file` 时可选）
`--engine`	`qwen3`	TTS 引擎：`qwen3` 或 `cosyvoice`
`--output, -o`	`output.wav`	输出 WAV 文件路径
`--language`	`english`	语言。若已设 `--speaker`，省略则使用该说话人的原生方言。
`--stream`		启用 streaming 合成
`--voice-sample`		用于声音克隆的参考音频（`qwen3` 和 `cosyvoice` 引擎均支持）
`--verbose`		显示详细耗时信息

Qwen3-TTS 选项

选项	默认值	说明
`--model`	`base`	模型变体：`base`、`customVoice` 或完整 HF 模型 ID
`--speaker`		说话人音色（需 `--model customVoice`）
`--instruct`		风格指令（CustomVoice 模型）
`--list-speakers`		列出可用说话人并退出
`--temperature`	`0.3`	采样温度
`--top-k`	`50`	Top-k 采样
`--max-tokens`	`500`	最大 token 数（500 ≈ 40 秒音频）
`--batch-file`		每行一条文本的批处理输入文件
`--batch-size`	`4`	并行生成的最大 batch 大小
`--first-chunk-frames`	`3`	流式首个 chunk 的 codec 帧数
`--chunk-frames`	`25`	每个流式 chunk 的 codec 帧数

CosyVoice3 选项

选项	默认值	说明
`--speakers`		多说话人对话的说话人映射：`s1=alice.wav,s2=bob.wav`
`--cosy-instruct`		风格指令（覆盖默认值）。控制 CosyVoice3 的语音风格。
`--turn-gap`	`0.2`	对话轮次之间的静默间隔（秒）
`--crossfade`	`0.0`	轮次之间的交叉淡化重叠（秒）
`--model-id`		HuggingFace 模型 ID

示例：

# Basic TTS
speech speak "Hello, world!" --output hello.wav

# Voice cloning (Qwen3-TTS)
speech speak "Hello in your voice" --voice-sample reference.wav -o cloned.wav

# Voice cloning (CosyVoice)
speech speak "Hello in your voice" --engine cosyvoice --voice-sample reference.wav -o cloned.wav

# CosyVoice multilingual
speech speak "Hallo Welt" --engine cosyvoice --language german -o hallo.wav

# Multi-speaker dialogue
speech speak "[S1] Hello there! [S2] Hey, how are you?" \
    --engine cosyvoice --speakers s1=alice.wav,s2=bob.wav -o dialogue.wav

# Inline emotion/style tags
speech speak "(excited) Wow, amazing! (sad) But I have to go..." \
    --engine cosyvoice -o emotion.wav

# Combined: dialogue + emotions + voice cloning
speech speak "[S1] (happy) Great news! [S2] (surprised) Really?" \
    --engine cosyvoice --speakers s1=alice.wav,s2=bob.wav -o combined.wav

# Custom style instruction
speech speak "Hello world" --engine cosyvoice --cosy-instruct "Speak cheerfully" -o cheerful.wav

# Streaming synthesis
speech speak "Long text here..." --stream

# Batch synthesis from file
speech speak --batch-file texts.txt --batch-size 4

kokoro

使用 Kokoro-82M 在 Neural Engine（CoreML）上的轻量级文本转语音。非自回归——单次前向传播，延迟约 45ms。

speech kokoro "<text>" [options]

选项	默认值	说明
`<text>`		要合成的文本
`--voice`	`af_heart`	音色预设（10 种语言共 50 种）
`--language`	`en`	语言代码：en、es、fr、hi、it、ja、pt、zh、ko、de
`--output, -o`	`kokoro_output.wav`	输出 WAV 文件路径
`--list-voices`		列出所有可用音色并退出
`--model, -m`		HuggingFace 模型 ID

示例：

# Basic Kokoro TTS
speech kokoro "Hello, world!" --voice af_heart -o hello.wav

# French voice
speech kokoro "Bonjour le monde" --voice ff_siwis --language fr -o bonjour.wav

# List all 50 voices
speech kokoro --list-voices

respond

基于 PersonaPlex 7B 的 full-duplex 语音到语音对话。

speech respond [options]

选项	默认值	说明
`--input, -i`		输入 WAV 音频文件（24kHz 单声道）（必填）
`--output, -o`	`response.wav`	输出响应 WAV 文件
`--voice`	`NATM0`	音色预设（例如 NATM0、NATF1、VARF0）
`--system-prompt`	`assistant`	预设：`assistant`、`focused`、`customer-service`、`teacher`
`--system-prompt-text`		自定义系统提示词（覆盖预设）
`--max-steps`	`200`	12.5Hz 下的最大生成步数（约 16 秒）
`--stream`		在生成过程中输出音频 chunks
`--compile`		启用编译后的 transformer（预热 + kernel 融合）
`--list-voices`		列出可用音色预设
`--list-prompts`		列出可用系统提示词预设
`--transcript`		打印模型的内心独白文本
`--json`		以 JSON 输出（转录文本、延迟、音频路径）
`--verbose`		显示详细耗时信息

采样覆盖

选项	默认值	说明
`--audio-temp`	`0.8`	音频采样温度
`--text-temp`	`0.7`	文本采样温度
`--audio-top-k`	`250`	音频 top-k 候选数
`--repetition-penalty`	`1.2`	音频重复惩罚（1.0 = 禁用）
`--text-repetition-penalty`	`1.2`	文本重复惩罚（1.0 = 禁用）
`--repetition-window`	`30`	重复惩罚窗口（帧数）
`--silence-early-stop`	`15`	提前停止前的静默帧数（0 = 禁用）
`--entropy-threshold`	`0`	用于提前停止的文本熵阈值（0 = 禁用）
`--entropy-window`	`10`	提前停止前连续低熵步数

示例：

# Basic speech-to-speech
speech respond --input question.wav

# Use a female voice with compiled transformer
speech respond -i question.wav --voice NATF1 --compile

# Stream response and show transcript
speech respond -i question.wav --stream --transcript --verbose

vad

使用 Pyannote 分段的离线语音活动检测。

speech vad <file> [options]

选项	说明
`<file>`	要分析的音频文件
`--model, -m`	HuggingFace 模型 ID
`--onset`	起始阈值（语音开始）
`--offset`	结束阈值（语音结束）
`--min-speech`	最小语音时长（秒）
`--min-silence`	最小静默时长（秒）
`--json`	以 JSON 输出

vad-stream

使用 Silero VAD v5 的 streaming 语音活动检测。以 32ms chunks 处理音频。

speech vad-stream <file> [options]

选项	说明
`<file>`	要分析的音频文件
`--engine`	VAD 引擎：`mlx`（默认）或 `coreml`
`--model, -m`	HuggingFace 模型 ID（由引擎自动选择）
`--onset`	起始阈值
`--offset`	结束阈值
`--min-speech`	最小语音时长（秒）
`--min-silence`	最小静默时长（秒）
`--json`	以 JSON 输出

wake

使用 KWS Zipformer 进行设备端关键词识别（3.49M 参数，CoreML INT8，26× 实时，仅支持英语）。

speech wake <file> [options]

选项	说明
`<file>`	要分析的音频文件
`--keywords`	一个或多个关键词。格式：`"hey soniqo"`、`"hey soniqo:0.15:0.5"` 或 `"LIGHT UP\|▁ L IGHT ▁UP:0.25:2.0"`（sherpa-onnx 风格的显式 BPE 片段）
`--keywords-file`	关键词文件，每行一个条目
`--model, -m`	HuggingFace 模型 ID。默认：`aufklarer/KWS-Zipformer-3M-CoreML-INT8`
`--json`	以 JSON 输出

diarize

说话人分离——识别谁在什么时间说话。

speech diarize <file> [options]

选项	默认值	说明
`<file>`		要分析的音频文件
`--engine`	`pyannote`	说话人分离引擎：`pyannote`（分段 + 说话人串联）或 `sortformer`（端到端 CoreML）
`--target-speaker`		用于目标说话人提取的注册音频（仅 pyannote）
`--embedding-engine`	`mlx`	说话人 embedding 引擎：`mlx` 或 `coreml`（仅 pyannote）
`--vad-filter`		用 Silero VAD 预过滤（仅 pyannote）
`--rttm`		以 RTTM 格式输出
`--json`		以 JSON 输出
`--score-against`		用于计算 DER 的参考 RTTM 文件

示例：

# Basic diarization (pyannote, default)
speech diarize meeting.wav

# End-to-end Sortformer (CoreML, Neural Engine)
speech diarize meeting.wav --engine sortformer

# RTTM output for evaluation
speech diarize meeting.wav --rttm

# Target speaker extraction (pyannote only)
speech diarize meeting.wav --target-speaker enrollment.wav

# Score against reference
speech diarize meeting.wav --score-against reference.rttm

embed-speaker

从音频中提取说话人 embedding 向量。

speech embed-speaker <file> [options]

选项	说明
`<file>`	包含说话人声音的音频文件
`--engine`	推理引擎：`mlx`（默认）、`coreml`（WeSpeaker 256 维）或 `camplusplus`（CAM++ CoreML 192 维）
`--json`	以 JSON 输出

denoise

使用 Neural Engine 上的 DeepFilterNet3 去除背景噪声。

speech denoise <file> [options]

选项	默认值	说明
`<file>`		输入音频文件
`--output, -o`	`input_clean.wav`	输出文件路径
`--model, -m`		HuggingFace 模型 ID

示例：

speech denoise noisy-recording.wav -o clean.wav

compose

Generate 30 s of music from a text prompt using MAGNeT on MLX.

speech compose <prompt> [options]

Option	Default	Description
`<prompt>`		Text prompt describing the music to generate (e.g. "happy rock")
`--output, -o`	`magnet.wav`	Output WAV path (32 kHz mono)
`--variant`	`small-int4`	Model variant: `small-int4`, `small-int8`, `medium-int4`, or `medium-int8`. Resolves to `aufklarer/MAGNeT-{Small,Medium}-30secs-MLX-{4,8}bit`.
`--temperature`	`3.0`	Sampling temperature, annealed linearly per stage.
`--top-p`	`0.9`	Nucleus sampling threshold.
`--cfg-max`	`10.0`	Max classifier-free guidance coefficient.
`--cfg-min`	`1.0`	Min CFG coefficient (annealed alongside the mask schedule).
`--steps`	`20,10,10,10`	Comma-separated decoding iterations per codebook (4 values).
`--seed`		Random seed for reproducible output.

Examples:

# Default: small-int4, ~10 s wall on M-series for a 30 s clip
speech compose "happy rock" -o happy_rock.wav

# Larger model — better prompt following, slower
speech compose "lo-fi hip hop with mellow piano" --variant medium-int4 -o lofi.wav

# Reproducible
speech compose "energetic EDM with synth lead" --seed 42 -o edm.wav