MADLAD-400 Translation (On-Device, 400+ Languages)

MADLAD-400-3B-MT is a T5 v1.1 encoder-decoder model from Google trained for many-to-many machine translation across 400+ languages. Apache 2.0. The Soniqo build runs as quantized MLX safetensors (INT4 / INT8) on Apple Silicon with no cloud calls. Drop it after ASR for live captioning, after TTS prep for multilingual voice agents, or use it standalone.

Pipe from ASR

audio transcribe meeting.wav | audio translate --to es — same binary, target language is the only required input. Source language is auto-detected by the encoder; you only specify what you want it translated into.

Quick Start

import MADLADTranslation

let translator = try await MADLADTranslator.fromPretrained()

// Greedy decode (recommended default)
let es = try translator.translate("Hello, how are you?", to: "es")
// → "Hola, ¿cómo estás?"

let zh = try translator.translate("Where is the library?", to: "zh")
// → "图书馆在哪里？"

// Streaming
for try await piece in translator.translateStream("Good morning", to: "fr") {
    print(piece, terminator: "")
}

CLI

audio translate "Hello, how are you?" --to es
audio translate "Bonjour" --to en --quantization int8
audio translate "Hello world" --to es --stream
audio translate --to fr --json    # JSON with timing metrics

# Pipe from ASR
audio transcribe meeting.wav | audio translate --to es

Architecture

T5 v1.1 encoder-decoder, ~3B parameters. 32 encoder + 32 decoder layers, d_model = 1024, d_kv = 128, num_heads = 16, gated GeLU FFN (d_ff = 8192). Position information arrives via a learned relative position bias (32 buckets, max distance 128) instead of position embeddings — bidirectional in the encoder, unidirectional (past-only) in the decoder. The bias table lives only on the first layer of each stack and is propagated through subsequent layers. Attention scores are not scaled by 1/√d_k — that's a T5 quirk.

Cross-attention K/V is computed once from the encoder output and reused for every decode step (cached in DecoderLayerCache.crossAttn). Decoder self-attention KV cache grows per generated token. Greedy decoding is the default and recommended for translation; temperature / top-k / top-p sampling is exposed for paraphrase-style use.

Model Variants

Variant	Quantization	Size	Compute	HuggingFace
MADLAD-400-3B-MT	INT4	~1.7 GB	Metal GPU (MLX)	aufklarer/MADLAD400-3B-MT-MLX (int4/)
MADLAD-400-3B-MT	INT8	~3.1 GB	Metal GPU (MLX)	aufklarer/MADLAD400-3B-MT-MLX (int8/)

Target Languages

Specify the target as the language code used in MADLAD's vocabulary (typically ISO 639-1 like es, fr, zh, ja, plus 400+ regional variants such as yue for Cantonese, min_nan for Hokkien). The tokenizer resolves <2{lang}> via direct vocabulary lookup and throws MADLADTranslationError.unsupportedLanguage if the code isn't recognized. Source language is auto-detected from the input — you do not specify it.

// Errors out if the language code isn't in MADLAD's vocab.
do {
    let _ = try translator.translate("Hello", to: "xx")
} catch MADLADTranslationError.unsupportedLanguage(let code) {
    print("MADLAD doesn't support: \(code)")
}

Sampling Configuration

let sampling = TranslationSamplingConfig(
    temperature: 0.0,        // greedy (default, recommended for MT)
    topK: 0,                 // disabled
    topP: 1.0,               // disabled
    maxTokens: 256,
    repetitionPenalty: 1.0
)
let result = try translator.translate("Long-form text…", to: "es", sampling: sampling)

Parameter	Default	Description
`temperature`	0.0	0 = greedy. Bump to 0.6–0.8 for paraphrase variation.
`topK`	0	Top-K cutoff (0 = disabled).
`topP`	1.0	Nucleus sampling cutoff.
`maxTokens`	256	Hard cap on output length.
`repetitionPenalty`	1.0	>1 penalizes recently-generated tokens.

Conversion

The MLX safetensors at aufklarer/MADLAD400-3B-MT-MLX are quantized from the original google/madlad400-3b-mt via mx.quantize (group size 64). The q/k/v/o, wi_0/wi_1/wo, lm_head, and shared embedding are quantized; layer norm scales and the relative-position-bias table are kept as fp16. MADLAD's only HF embedding key (decoder.embed_tokens.weight) is renamed to shared.weight at conversion time so both encoder and decoder reuse it.

License

Apache 2.0 (inherited from google/madlad400-3b-mt). Model card lists the full set of supported language codes.