Hibiki Zero-3B Speech Translation (FR / ES / PT / DE → EN)
Hibiki Zero-3B is Kyutai's streaming speech-to-speech translation model — input is a 24 kHz audio stream in French, Spanish, Portuguese, or German; output is a 24 kHz English audio stream plus a parallel English text transcript at the same 12.5 Hz frame rate. Built on the Moshi/Mimi multistream architecture: a single decoder-only transformer jointly models the source-audio codec stream and the target text+audio streams, so there's no separate ASR + MT + TTS pipeline. The Soniqo build runs as quantized MLX safetensors (INT4 default, INT8 available) entirely on Apple Silicon. CC-BY-4.0.
Pipe-style ASR + MADLAD (speech transcribe | speech translate) gets you 400+ languages but adds the round-trip latency of three models. Hibiki is one model end-to-end and preserves prosody — pick it when you need live speech in the target language rather than just text.
Quick Start
import HibikiTranslate
import AudioCommon
let model = try await HibikiTranslateModel.fromPretrained()
let pcm = try AudioFileLoader.load(url: input, targetSampleRate: 24000)
let (englishAudio, textTokens) = model.translate(
sourceAudio: pcm,
sourceLanguage: .fr // .fr / .es / .pt / .de — auto-detected but pass for the metadata
)
try WAVWriter.write(samples: englishAudio, sampleRate: 24000, to: output)
CLI
speech audio-translate input_fr.wav -o out_en.wav --source-lang fr
speech audio-translate input_es.wav -o out_en.wav --source-lang es --quantization 8bit
speech audio-translate input_pt.wav -o out_en.wav --source-lang pt --verbose
# Deterministic mode (used by the CI regression canaries)
HIBIKI_GREEDY=1 speech audio-translate input_fr.wav -o out_en.wav --source-lang fr
# Inner-monologue text token IDs (raw — SPM decode is a follow-up)
speech audio-translate input.wav -o out.wav --transcript
Architecture
Hibiki Zero-3B is a 3.1B-parameter decoder-only multistream transformer. The model jointly attends over 33 streams per frame: one text stream, 16 target-audio codebooks (the agent's output), and 16 source-audio codebooks (the user's input). At each 80 ms frame the model samples one text token plus 16 audio codes via a small 6-layer depformer that runs 16 sub-steps per frame, one per target codebook, with a 9-slice scheduled MultiLinear projection.
The audio codec is Mimi at 12.5 Hz / 16 codebooks. Source audio is encoded into the 16 source-stream codebooks (delay [0, 2, 2, …, 2]); generated target audio fills the 16 target-stream codebooks (same delay pattern); per-codebook un-shift is applied before Mimi decodes the target back to 24 kHz English PCM. The temporal backbone is 28 GQA layers (dim = 2048, 16 query heads, 8 KV heads, kv_repeat = 2, split-half RoPE rope_concat, no conditioner — Zero is the unconditional variant).
Decode Loop
Hibiki emits SPM padding tokens (id 3) while it accumulates enough source context to translate, then content text tokens with matching target audio, and finally text-EOS (id 2). The Swift driver runs until EOS is sampled past the source window, capped at max(tSrc × 5/2, tSrc + 20) steps as a safety bound. Output runs roughly 1.0–1.6× the input duration on FLEURS-style clips; callers should not assume output_duration == input_duration.
The autoregressive feedback path is non-obvious: at step t the transformer reads tokens at cache index step (uniform across all 33 streams, with init-token substitution when step ≤ delays[k]); the sampled text + 16 target codes are written at index step + 1. This mirrors upstream Moshi lm.py where state.offsets += 1 happens before the cache scatter. The text_emb row for EOS (id 2) is aliased to row 3 (PAD) at weight-load time, mirroring Kyutai's loaders.py:312 "implicitly replace early EOS with PAD" patch — any EOS sampled during the audio-streaming window is harmless, only post-source EOS terminates the loop.
Model Variants
| Variant | Quantization | Size | Compute | HuggingFace |
|---|---|---|---|---|
| Hibiki Zero-3B | INT4 | ~2.7 GB | Metal GPU (MLX) | aufklarer/Hibiki-Zero-3B-MLX-4bit |
| Hibiki Zero-3B | INT8 | ~3.9 GB | Metal GPU (MLX) | aufklarer/Hibiki-Zero-3B-MLX-8bit |
Language Coverage
Hibiki Zero-3B is trained on French, Spanish, Portuguese, and German → English. The Swift driver auto-detects the source language; the --source-lang flag is metadata only.
| Source | Status | Sample greedy output |
|---|---|---|
| FR | Strict E2E canary | "so it's a ski route." (from "Pensez à l'itinéraire de ski…") |
| ES | Strict E2E canary | "gentlemen, the data is worrying." (Hibiki europarl sample) |
| PT | Warn-only (content-faithful, lower keyword recall) | "the fifth c is p of the martyr." (FLEURS PT) |
| DE | Warn-only (content-faithful, lower keyword recall) | "that didn't seem to me to be useful." (FLEURS DE) |
16 kHz human-recorded FLEURS Spanish clips trigger degenerate generation in both the Python upstream and the Swift port (Python emits 1643 steps / ~131 s of broken audio without sampling EOS). The Swift ES regression canary uses a 5 s trimmed excerpt from Kyutai's own samples space (kyutai/hibiki-zero-samples) at 24 kHz TTS-generated audio, which matches the training distribution and produces clean English. If you're feeding Hibiki Spanish in production, pre-resample to 24 kHz and stick to longer clips (5 s+).
Environment Variables
| Variable | Effect |
|---|---|
HIBIKI_GREEDY=1 | Force argmax decoding for both text and target audio. Reproducible — used by the strict CI canaries. |
HIBIKI_E2E=1 | Enable the E2E test cases (requires the ~2.7 GB model download). |
HIBIKI_STRICT_ALL=1 | Promote PT/DE tests from warn-only to strict. |
HIBIKI_LENIENT=1 | Demote FR/ES tests from strict to warn-only (debugging only). |
HIBIKI_MODEL_ID=<repo> | Override the default aufklarer/Hibiki-Zero-3B-MLX-4bit model id. |
Performance (M2 Max, MLX 4-bit)
| Metric | Greedy | Sampled |
|---|---|---|
| Per-step latency | ~75 ms | ~95 ms |
| Wall-clock for 3.54 s FR source | ~5 s | ~7 s |
| Output duration | 1.0–1.6× source | 1.0–1.6× source |
Known Limitations
translateStream()emits a single final chunk. The streaming entry point currently wraps the offlinetranslate(). True per-chunk Mimi streaming decode is a v2 follow-up.- No SentencePiece text decoder. The
--transcriptflag prints raw token IDs. SPM decode wiring is a follow-up. - Sampled mode is noticeably noisier than greedy. Use
HIBIKI_GREEDY=1for reproducible runs. - Quantization-only. The repo currently ships Zero-3B 4-bit and 8-bit; the 1B and 2B Hibiki variants exist in the converter (
models/hibiki/export/convert.py) but the Swift driver targets Zero-3B's GQA + rope_concat + non-conditioned layout.
References
- Paper: High-Fidelity Simultaneous Speech-to-Speech Translation (Kyutai, 2025)
- Upstream code: kyutai-labs/hibiki
- Samples: kyutai/hibiki-zero-samples
- License: CC-BY-4.0