MADLAD-400 Translation (On-Device, 400+ Languages)
MADLAD-400-3B-MT is a T5 v1.1 encoder-decoder model from Google trained for many-to-many machine translation across 400+ languages. Apache 2.0. The Soniqo build runs as quantized MLX safetensors (INT4 / INT8) on Apple Silicon with no cloud calls. Drop it after ASR for live captioning, after TTS prep for multilingual voice agents, or use it standalone.
audio transcribe meeting.wav | audio translate --to es — same binary, target language is the only required input. Source language is auto-detected by the encoder; you only specify what you want it translated into.
Quick Start
import MADLADTranslation
let translator = try await MADLADTranslator.fromPretrained()
// Greedy decode (recommended default)
let es = try translator.translate("Hello, how are you?", to: "es")
// → "Hola, ¿cómo estás?"
let zh = try translator.translate("Where is the library?", to: "zh")
// → "图书馆在哪里?"
// Streaming
for try await piece in translator.translateStream("Good morning", to: "fr") {
print(piece, terminator: "")
}
CLI
audio translate "Hello, how are you?" --to es
audio translate "Bonjour" --to en --quantization int8
audio translate "Hello world" --to es --stream
audio translate --to fr --json # JSON with timing metrics
# Pipe from ASR
audio transcribe meeting.wav | audio translate --to es
Architecture
T5 v1.1 encoder-decoder, ~3B parameters. 32 encoder + 32 decoder layers, d_model = 1024, d_kv = 128, num_heads = 16, gated GeLU FFN (d_ff = 8192). Position information arrives via a learned relative position bias (32 buckets, max distance 128) instead of position embeddings — bidirectional in the encoder, unidirectional (past-only) in the decoder. The bias table lives only on the first layer of each stack and is propagated through subsequent layers. Attention scores are not scaled by 1/√d_k — that's a T5 quirk.
Cross-attention K/V is computed once from the encoder output and reused for every decode step (cached in DecoderLayerCache.crossAttn). Decoder self-attention KV cache grows per generated token. Greedy decoding is the default and recommended for translation; temperature / top-k / top-p sampling is exposed for paraphrase-style use.
Model Variants
| Variant | Quantization | Size | Compute | HuggingFace |
|---|---|---|---|---|
| MADLAD-400-3B-MT | INT4 | ~1.7 GB | Metal GPU (MLX) | aufklarer/MADLAD400-3B-MT-MLX (int4/) |
| MADLAD-400-3B-MT | INT8 | ~3.1 GB | Metal GPU (MLX) | aufklarer/MADLAD400-3B-MT-MLX (int8/) |
Target Languages
Specify the target as the language code used in MADLAD's vocabulary (typically ISO 639-1 like es, fr, zh, ja, plus 400+ regional variants such as yue for Cantonese, min_nan for Hokkien). The tokenizer resolves <2{lang}> via direct vocabulary lookup and throws MADLADTranslationError.unsupportedLanguage if the code isn't recognized. Source language is auto-detected from the input — you do not specify it.
// Errors out if the language code isn't in MADLAD's vocab.
do {
let _ = try translator.translate("Hello", to: "xx")
} catch MADLADTranslationError.unsupportedLanguage(let code) {
print("MADLAD doesn't support: \(code)")
}
Sampling Configuration
let sampling = TranslationSamplingConfig(
temperature: 0.0, // greedy (default, recommended for MT)
topK: 0, // disabled
topP: 1.0, // disabled
maxTokens: 256,
repetitionPenalty: 1.0
)
let result = try translator.translate("Long-form text…", to: "es", sampling: sampling)
| Parameter | Default | Description |
|---|---|---|
temperature | 0.0 | 0 = greedy. Bump to 0.6–0.8 for paraphrase variation. |
topK | 0 | Top-K cutoff (0 = disabled). |
topP | 1.0 | Nucleus sampling cutoff. |
maxTokens | 256 | Hard cap on output length. |
repetitionPenalty | 1.0 | >1 penalizes recently-generated tokens. |
Conversion
The MLX safetensors at aufklarer/MADLAD400-3B-MT-MLX are quantized from the original google/madlad400-3b-mt via mx.quantize (group size 64). The q/k/v/o, wi_0/wi_1/wo, lm_head, and shared embedding are quantized; layer norm scales and the relative-position-bias table are kept as fp16. MADLAD's only HF embedding key (decoder.embed_tokens.weight) is renamed to shared.weight at conversion time so both encoder and decoder reuse it.
License
Apache 2.0 (inherited from google/madlad400-3b-mt). Model card lists the full set of supported language codes.