CLI 레퍼런스

speech 바이너리는 모든 음성 처리 작업의 메인 엔트리 포인트입니다. make build로 빌드한 뒤 .build/release/speech에서 실행하세요.

transcribe

오디오 파일을 텍스트로 전사합니다.

speech transcribe <file> [options]

옵션	기본값	설명
`<file>`		전사할 오디오 파일 (WAV, M4A, MP3, CAF)
`--engine`	`qwen3`	ASR 엔진: `qwen3`, `qwen3-coreml`, `parakeet`, `nemotron`, `omnilingual`
`--model, -m`	`0.6B`	모델 변형: `0.6B`, `1.7B`, 또는 전체 HuggingFace 모델 ID (qwen3 전용)
`--language`		언어 힌트 (선택, omnilingual에서는 무시)
`--window`	`10`	`[omnilingual]` CoreML 윈도 크기(초): `5` 또는 `10`
`--backend`	`coreml`	`[omnilingual]` 백엔드: `coreml` (Neural Engine) 또는 `mlx` (Metal GPU)
`--variant`	`300M`	`[omnilingual mlx]` 크기: `300M`, `1B`, `3B`, `7B`
`--bits`	`4`	`[omnilingual mlx]` 양자화 비트: `4` 또는 `8`
`--stream`		VAD를 활용한 스트리밍 전사 활성화
`--max-segment`	`10`	최대 세그먼트 길이(초, 스트리밍)
`--partial`		발화 중 부분 결과 출력 (스트리밍)

예시:

# 기본 전사
speech transcribe recording.wav

# 더 큰 모델 사용
speech transcribe recording.wav --model 1.7B

# CoreML 인코더 (Neural Engine + MLX 디코더)
speech transcribe recording.wav --engine qwen3-coreml

# Parakeet (CoreML) 엔진 사용
speech transcribe recording.wav --engine parakeet

# Omnilingual (CoreML, 1,672개 언어)
speech transcribe recording.wav --engine omnilingual                              # 10초 윈도
speech transcribe recording.wav --engine omnilingual --window 5                     # 5초 윈도

# Omnilingual (MLX, 최대 40초까지 임의의 길이)
speech transcribe recording.wav --engine omnilingual --backend mlx                              # 300M @ 4-bit
speech transcribe recording.wav --engine omnilingual --backend mlx --variant 1B                  # 1B @ 4-bit
speech transcribe recording.wav --engine omnilingual --backend mlx --variant 3B --bits 8         # 3B @ 8-bit
speech transcribe recording.wav --engine omnilingual --backend mlx --variant 7B                  # 7B @ 4-bit

# VAD 기반 스트리밍
speech transcribe recording.wav --stream --partial

align

단어 수준 강제 정렬 — 모든 단어에 대한 정확한 타임스탬프를 얻습니다.

speech align <file> [options]

옵션	기본값	설명
`<file>`		오디오 파일
`--text, -t`		정렬할 텍스트 (생략 시 먼저 전사)
`--model, -m`	`0.6B`	전사용 ASR 모델: `0.6B`, `1.7B`, 또는 전체 ID
`--aligner-model`		강제 정렬 모델 ID
`--language`		언어 힌트

예시:

# 자동 전사 후 정렬
speech align recording.wav

# 알려진 텍스트로 정렬
speech align recording.wav --text "Can you guarantee that the replacement part will be shipped tomorrow?"

speak

텍스트-음성 합성.

speech speak "<text>" [options]

옵션	기본값	설명
`<text>`		합성할 텍스트 (`--batch-file` 사용 시 선택)
`--engine`	`qwen3`	TTS 엔진: `qwen3` 또는 `cosyvoice`
`--output, -o`	`output.wav`	출력 WAV 파일 경로
`--language`	`english`	언어. `--speaker` 설정 시 생략하면 화자의 원어 방언 사용.
`--stream`		스트리밍 합성 활성화
`--voice-sample`		음성 복제용 레퍼런스 오디오 (`qwen3` 및 `cosyvoice` 엔진 모두 지원)
`--verbose`		상세한 타이밍 정보 표시

Qwen3-TTS 옵션

옵션	기본값	설명
`--model`	`base`	모델 변형: `base`, `customVoice`, 또는 전체 HF 모델 ID
`--speaker`		화자 음색 (`--model customVoice` 필요)
`--instruct`		스타일 지시 (CustomVoice 모델)
`--list-speakers`		사용 가능한 화자 목록 출력 후 종료
`--temperature`	`0.3`	샘플링 온도
`--top-k`	`50`	Top-k 샘플링
`--max-tokens`	`500`	최대 토큰 수 (500 = 약 40초 오디오)
`--batch-file`		배치 합성용 파일 (한 줄에 하나의 텍스트)
`--batch-size`	`4`	병렬 생성 최대 배치 크기
`--first-chunk-frames`	`3`	첫 스트리밍 chunk의 코덱 프레임 수
`--chunk-frames`	`25`	스트리밍 chunk당 코덱 프레임 수

CosyVoice3 옵션

옵션	기본값	설명
`--speakers`		다화자 대화용 화자 매핑: `s1=alice.wav,s2=bob.wav`
`--cosy-instruct`		스타일 지시 (기본값 재정의). CosyVoice3의 음성 스타일 제어.
`--turn-gap`	`0.2`	대화 턴 간 무음 간격(초)
`--crossfade`	`0.0`	턴 간 크로스페이드 오버랩(초)
`--model-id`		HuggingFace 모델 ID

예시:

# 기본 TTS
speech speak "Hello, world!" --output hello.wav

# 음성 복제 (Qwen3-TTS)
speech speak "Hello in your voice" --voice-sample reference.wav -o cloned.wav

# 음성 복제 (CosyVoice)
speech speak "Hello in your voice" --engine cosyvoice --voice-sample reference.wav -o cloned.wav

# CosyVoice 다국어
speech speak "Hallo Welt" --engine cosyvoice --language german -o hallo.wav

# 다화자 대화
speech speak "[S1] Hello there! [S2] Hey, how are you?" \
    --engine cosyvoice --speakers s1=alice.wav,s2=bob.wav -o dialogue.wav

# 인라인 감정/스타일 태그
speech speak "(excited) Wow, amazing! (sad) But I have to go..." \
    --engine cosyvoice -o emotion.wav

# 결합: 대화 + 감정 + 음성 복제
speech speak "[S1] (happy) Great news! [S2] (surprised) Really?" \
    --engine cosyvoice --speakers s1=alice.wav,s2=bob.wav -o combined.wav

# 커스텀 스타일 지시
speech speak "Hello world" --engine cosyvoice --cosy-instruct "Speak cheerfully" -o cheerful.wav

# 스트리밍 합성
speech speak "Long text here..." --stream

# 파일로부터 배치 합성
speech speak --batch-file texts.txt --batch-size 4

kokoro

Neural Engine(CoreML)의 Kokoro-82M을 이용한 경량 텍스트-음성 변환. Non-autoregressive — 단일 forward pass, 약 45ms 지연.

speech kokoro "<text>" [options]

옵션	기본값	설명
`<text>`		합성할 텍스트
`--voice`	`af_heart`	음색 프리셋 (10개 언어에 걸쳐 50개 제공)
`--language`	`en`	언어 코드: en, es, fr, hi, it, ja, pt, zh, ko, de
`--output, -o`	`kokoro_output.wav`	출력 WAV 파일 경로
`--list-voices`		사용 가능한 음색 전체 목록 출력 후 종료
`--model, -m`		HuggingFace 모델 ID

예시:

# 기본 Kokoro TTS
speech kokoro "Hello, world!" --voice af_heart -o hello.wav

# 프랑스어 음색
speech kokoro "Bonjour le monde" --voice ff_siwis --language fr -o bonjour.wav

# 50개 음색 모두 나열
speech kokoro --list-voices

respond

PersonaPlex 7B를 사용한 full-duplex 음성-음성 대화.

speech respond [options]

옵션	기본값	설명
`--input, -i`		입력 오디오 WAV 파일 (24kHz 모노) (필수)
`--output, -o`	`response.wav`	출력 응답 WAV 파일
`--voice`	`NATM0`	음색 프리셋 (예: NATM0, NATF1, VARF0)
`--system-prompt`	`assistant`	프리셋: `assistant`, `focused`, `customer-service`, `teacher`
`--system-prompt-text`		커스텀 시스템 프롬프트 텍스트 (프리셋 재정의)
`--max-steps`	`200`	12.5Hz에서 최대 생성 스텝 수 (약 16초)
`--stream`		생성 중 오디오 chunk 스트리밍 출력
`--compile`		컴파일된 transformer 활성화 (warmup + kernel fusion)
`--list-voices`		사용 가능한 음색 프리셋 나열
`--list-prompts`		사용 가능한 시스템 프롬프트 프리셋 나열
`--transcript`		모델의 inner monologue 텍스트 출력
`--json`		JSON으로 출력 (전사, 지연, 오디오 경로)
`--verbose`		상세 타이밍 정보 표시

샘플링 재정의

옵션	기본값	설명
`--audio-temp`	`0.8`	오디오 샘플링 온도
`--text-temp`	`0.7`	텍스트 샘플링 온도
`--audio-top-k`	`250`	오디오 top-k 후보
`--repetition-penalty`	`1.2`	오디오 반복 페널티 (1.0 = 비활성)
`--text-repetition-penalty`	`1.2`	텍스트 반복 페널티 (1.0 = 비활성)
`--repetition-window`	`30`	반복 페널티 윈도 (프레임)
`--silence-early-stop`	`15`	조기 종료 전 무음 프레임 수 (0 = 비활성)
`--entropy-threshold`	`0`	조기 종료용 텍스트 엔트로피 임계값 (0 = 비활성)
`--entropy-window`	`10`	조기 종료 전 연속 저엔트로피 스텝 수

예시:

# 기본 음성-음성
speech respond --input question.wav

# 컴파일된 transformer로 여성 음색 사용
speech respond -i question.wav --voice NATF1 --compile

# 응답 스트리밍 및 전사 표시
speech respond -i question.wav --stream --transcript --verbose

vad

Pyannote 세그먼테이션을 사용한 오프라인 음성 활동 감지.

speech vad <file> [options]

옵션	설명
`<file>`	분석할 오디오 파일
`--model, -m`	HuggingFace 모델 ID
`--onset`	Onset 임계값 (발화 시작)
`--offset`	Offset 임계값 (발화 종료)
`--min-speech`	최소 발화 길이(초)
`--min-silence`	최소 무음 길이(초)
`--json`	JSON으로 출력

vad-stream

Silero VAD v5를 사용한 스트리밍 음성 활동 감지. 32ms chunks로 오디오를 처리합니다.

speech vad-stream <file> [options]

옵션	설명
`<file>`	분석할 오디오 파일
`--engine`	VAD 엔진: `mlx` (기본) 또는 `coreml`
`--model, -m`	HuggingFace 모델 ID (엔진에 따라 자동 선택)
`--onset`	Onset 임계값
`--offset`	Offset 임계값
`--min-speech`	최소 발화 길이(초)
`--min-silence`	최소 무음 길이(초)
`--json`	JSON으로 출력

wake

KWS Zipformer를 사용한 온디바이스 키워드 감지 (3.49M 매개변수, CoreML INT8, 실시간의 26배, 영어만 지원).

speech wake <file> [options]

옵션	설명
`<file>`	분석할 오디오 파일
`--keywords`	하나 이상의 키워드. 형식: `"hey soniqo"`, `"hey soniqo:0.15:0.5"` 또는 `"LIGHT UP\|▁ L IGHT ▁UP:0.25:2.0"` (명시적 BPE 조각이 있는 sherpa-onnx 스타일)
`--keywords-file`	키워드 파일, 한 줄에 한 항목
`--model, -m`	HuggingFace 모델 ID. 기본값: `aufklarer/KWS-Zipformer-3M-CoreML-INT8`
`--json`	JSON으로 출력

diarize

화자 분리 — 누가 언제 말했는지 식별합니다.

speech diarize <file> [options]

옵션	기본값	설명
`<file>`		분석할 오디오 파일
`--engine`	`pyannote`	화자 분리 엔진: `pyannote` (segmentation + speaker chaining) 또는 `sortformer` (엔드투엔드 CoreML)
`--target-speaker`		대상 화자 추출용 등록 오디오 (pyannote 전용)
`--embedding-engine`	`mlx`	화자 임베딩 엔진: `mlx` 또는 `coreml` (pyannote 전용)
`--vad-filter`		Silero VAD로 사전 필터링 (pyannote 전용)
`--rttm`		RTTM 형식으로 출력
`--json`		JSON으로 출력
`--score-against`		DER 계산용 참조 RTTM 파일

예시:

# 기본 화자 분리 (pyannote, 기본값)
speech diarize meeting.wav

# 엔드투엔드 Sortformer (CoreML, Neural Engine)
speech diarize meeting.wav --engine sortformer

# 평가용 RTTM 출력
speech diarize meeting.wav --rttm

# 대상 화자 추출 (pyannote 전용)
speech diarize meeting.wav --target-speaker enrollment.wav

# 참조와 비교 점수 계산
speech diarize meeting.wav --score-against reference.rttm

embed-speaker

오디오에서 화자 임베딩 벡터를 추출합니다.

speech embed-speaker <file> [options]

옵션	설명
`<file>`	화자 음성을 포함한 오디오 파일
`--engine`	추론 엔진: `mlx` (기본), `coreml` (WeSpeaker 256차원), `camplusplus` (CAM++ CoreML 192차원)
`--json`	JSON으로 출력

denoise

Neural Engine에서 DeepFilterNet3를 사용해 배경 소음을 제거합니다.

speech denoise <file> [options]

옵션	기본값	설명
`<file>`		입력 오디오 파일
`--output, -o`	`input_clean.wav`	출력 파일 경로
`--model, -m`		HuggingFace 모델 ID

예시:

speech denoise noisy-recording.wav -o clean.wav

compose

Generate 30 s of music from a text prompt using MAGNeT on MLX.

speech compose <prompt> [options]

Option	Default	Description
`<prompt>`		Text prompt describing the music to generate (e.g. "happy rock")
`--output, -o`	`magnet.wav`	Output WAV path (32 kHz mono)
`--variant`	`small-int4`	Model variant: `small-int4`, `small-int8`, `medium-int4`, or `medium-int8`. Resolves to `aufklarer/MAGNeT-{Small,Medium}-30secs-MLX-{4,8}bit`.
`--temperature`	`3.0`	Sampling temperature, annealed linearly per stage.
`--top-p`	`0.9`	Nucleus sampling threshold.
`--cfg-max`	`10.0`	Max classifier-free guidance coefficient.
`--cfg-min`	`1.0`	Min CFG coefficient (annealed alongside the mask schedule).
`--steps`	`20,10,10,10`	Comma-separated decoding iterations per codebook (4 values).
`--seed`		Random seed for reproducible output.

Examples:

# Default: small-int4, ~10 s wall on M-series for a 30 s clip
speech compose "happy rock" -o happy_rock.wav

# Larger model — better prompt following, slower
speech compose "lo-fi hip hop with mellow piano" --variant medium-int4 -o lofi.wav

# Reproducible
speech compose "energetic EDM with synth lead" --seed 42 -o edm.wav