Omnilingual ASR
Meta's Omnilingual ASR is a language-agnostic speech recognition family covering 1,672 languages across 32 distinct scripts — the broadest language coverage of any on-device ASR model on Apple Silicon. Soniqo ports the CTC variant to both CoreML (Neural Engine) and MLX (Metal GPU), with four model sizes from 300M to 7B parameters.
Unlike Qwen3-ASR or Parakeet TDT, Omnilingual CTC does not take a language hint at inference time — it uses a shared 10,288-entry SentencePiece vocabulary spanning every supported language. Pass any audio in any supported language and the model produces the correct script automatically.
Architecture
Omnilingual CTC is a supervised fine-tune of Meta's wav2vec 2.0 backbone with a linear CTC head over a shared multilingual vocabulary. The pipeline is parallel and non-autoregressive — one forward pass per utterance, no decoder loop.
raw audio [1, samples]
→ wav2vec2 feature extractor (7 strided CNN layers, ×320 downsample)
→ weight-normalised Conv1d positional encoder
→ N × pre-norm Transformer encoder layers
→ final layer norm
→ linear CTC head → logits [1, T, 10288]
At 16 kHz input, the 320× downsampled encoder produces 50 Hz frame rate. A 10-second clip produces 499 frames of logits. Greedy CTC decoding collapses consecutive duplicates and skips special tokens via the SentencePiece tokenizer.
Model Variants
Ten variants are published under the aufklarer namespace on HuggingFace — two CoreML window sizes plus eight MLX quantisation combinations:
| Variant | Layers | Dim | Heads | Size | Runtime |
|---|---|---|---|---|---|
| CTC-300M-CoreML-INT8 (5 s window) | 24 | 1024 | 16 | 312 MB | Neural Engine |
| CTC-300M-CoreML-INT8 (10 s window) | 24 | 1024 | 16 | 312 MB | Neural Engine |
| CTC-300M-MLX-4bit | 24 | 1024 | 16 | 193 MB | Metal GPU |
| CTC-300M-MLX-8bit | 24 | 1024 | 16 | 342 MB | Metal GPU |
| CTC-1B-MLX-4bit | 48 | 1280 | 20 | 549 MB | Metal GPU |
| CTC-1B-MLX-8bit | 48 | 1280 | 20 | 1006 MB | Metal GPU |
| CTC-3B-MLX-4bit | 60 | 2048 | 32 | 1.71 GB | Metal GPU |
| CTC-3B-MLX-8bit | 60 | 2048 | 32 | 3.16 GB | Metal GPU |
| CTC-7B-MLX-4bit | 128 | 2048 | 32 | 3.55 GB | Metal GPU |
| CTC-7B-MLX-8bit | 128 | 2048 | 32 | 6.63 GB | Metal GPU |
CLI Usage
CoreML (Neural Engine)
# 10 s window (default)
.build/release/audio transcribe recording.wav --engine omnilingual
# 5 s window (lower memory, faster cold start)
.build/release/audio transcribe recording.wav --engine omnilingual --window 5
MLX (Metal GPU)
# 300M @ 4-bit (default MLX variant)
.build/release/audio transcribe recording.wav --engine omnilingual --backend mlx
# 1B @ 4-bit — higher accuracy
.build/release/audio transcribe recording.wav --engine omnilingual --backend mlx --variant 1B
# 3B @ 8-bit — approaching reference quality
.build/release/audio transcribe recording.wav --engine omnilingual --backend mlx --variant 3B --bits 8
# 7B @ 4-bit — largest CTC variant, best accuracy
.build/release/audio transcribe recording.wav --engine omnilingual --backend mlx --variant 7B
Swift API
CoreML backend
import OmnilingualASR
import AudioCommon
let model = try await OmnilingualASRModel.fromPretrained()
let audio = try AudioFileLoader.load(url: url, targetSampleRate: 16000)
let text = try model.transcribeAudio(audio, sampleRate: 16000)
print(text)
MLX backend
import OmnilingualASR
// Default: 300M @ 4-bit
let model = try await OmnilingualASRMLXModel.fromPretrained()
// Larger variants
let big = try await OmnilingualASRMLXModel.fromPretrained(variant: .b1, bits: 4)
let huge = try await OmnilingualASRMLXModel.fromPretrained(variant: .b3, bits: 8)
let max = try await OmnilingualASRMLXModel.fromPretrained(variant: .b7, bits: 4)
let text = try model.transcribeAudio(audio, sampleRate: 16000)
CoreML vs MLX
Both backends produce essentially identical transcripts modulo 1-2 characters (quantisation and runtime differences). Choose based on your deployment target:
| CoreML | MLX | |
|---|---|---|
| Compute target | Neural Engine | Metal GPU |
| Input length | Fixed window (5 s or 10 s) | Any length up to 40 s |
| Model sizes | 300M only | 300M / 1B / 3B / 7B |
| Quantisation | INT8 palettization | 4-bit or 8-bit QuantizedLinear |
| Runs concurrently with GPU TTS | Yes (ANE is independent) | Contends with GPU TTS |
| iOS support | iOS 17+ | Any Apple Silicon iOS |
Preprocessing Details
Omnilingual requires utterance-level layer-norm on the raw waveform, matching fairseq2's apply_audio_normalization:
normalized = (audio - mean(audio)) / sqrt(variance(audio) + 1e-5)
The Swift port normalises the real audio content before zero-padding to the CoreML fixed window, so sub-window inputs match reference pipeline statistics exactly. This is the single most common port pitfall — HuggingFace's wav2vec2 implementation does group-norm per-feature, not utterance layer-norm.
The reference pipeline enforces a 40-second hard cap (MAX_ALLOWED_AUDIO_SEC) on input audio. The Swift port enforces the same limit — longer inputs throw a clear error pointing to SpeechVAD or ParakeetStreamingASR for segmented processing.
Language Coverage
Omnilingual supports 1,672 languages across 32 scripts, including 500+ low-resource languages added via community data collection. Sample coverage:
| Script | Languages | Sample codes |
|---|---|---|
| Latin | 1,398 | eng_Latn, fra_Latn, spa_Latn, pcm_Latn, swh_Latn, zul_Latn, … |
| Arabic | 70 | arb_Arab, arz_Arab, ary_Arab, arq_Arab, fas_Arab, urd_Arab, ckb_Arab, uig_Arab, … |
| Devanagari | 65 | hin_Deva, mar_Deva, nep_Deva, bho_Deva, mai_Deva, awa_Deva, brx_Deva, … |
| Cyrillic | 51 | rus_Cyrl, ukr_Cyrl, bel_Cyrl, bul_Cyrl, srp_Cyrl, mkd_Cyrl, kaz_Cyrl, … |
| Ethiopic | 10 | amh_Ethi, tir_Ethi, wal_Ethi, sgw_Ethi, … |
| Bengali | 8 | ben_Beng, asm_Beng, mni_Beng, … |
| Thai / Lao / Myanmar / Tibetan | 9 / 1 / 3 / 6 | tha_Thai, lao_Laoo, mya_Mymr, bod_Tibt, dzo_Tibt, … |
| Han (simplified / traditional) | 6 | cmn_Hans, cmn_Hant, yue_Hans, yue_Hant, cdo_Hans, … |
| Japanese / Korean | 1 / 1 | jpn_Jpan, kor_Hang |
| Armenian, Georgian, Hebrew, Greek, Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Sinhala, Tamil, Telugu, Tifinagh, Thaana, plus 4 more | 48 | See full list → |
Full ISO 639-3 + ISO 15924 code list with English names in the lang_ids.py source, and grouped by script with country hints in the model card.
Verified Output
Transcripts from the Swift port on the FLEURS benchmark, CoreML 300M:
| Language | Reference | Output |
|---|---|---|
| English | Fellow wrestlers also paid tribute to Luna. | fellow wrestlers also paid tribute to luna |
| Arabic | كما أثنى الزملاء المصارعون على لونا | كما أثنى الزملاء المصارعون على لونا |
| Hindi | लूना को साथी पहलवानों ने भी श्रद्धांजलि दी | लूना को साथी पहलवानों ने भी सरधांजलीदी |
| French | Pensez à l'itinéraire de ski comme à un itinéraire de randonnée similaire. | pense à létineraire desqui comme un étineraire de rent donner similaire |
The MLX 300M-4bit variant produces essentially identical output modulo 1-2 characters. Larger variants (1B, 3B, 7B) reduce residual errors progressively.
Reference
- Paper: Omnilingual ASR (arXiv 2511.09690)
- Upstream implementation: facebookresearch/omnilingual-asr (Apache 2.0)
- Swift port source: speech-swift/Sources/OmnilingualASR
- Model card: docs/models/omnilingual-asr.md