Qwen3-ASR

Qwen3-ASR is a state-of-the-art multilingual automatic speech recognition model. It runs on-device using Metal GPU acceleration via MLX, with 4-bit quantization for efficient memory usage. Available in 0.6B and 1.7B parameter variants.

Pipeline

The Qwen3-ASR inference pipeline processes audio through four stages:

Stage	Description
Audio Input	Raw audio resampled to 16 kHz mono
Mel Spectrogram	128-bin mel filterbank features extracted from the waveform
Audio Encoder	18-layer transformer with block attention, processes mel frames into audio embeddings
Text Decoder	28-layer Qwen3 transformer with grouped-query attention (GQA) and rotary position embeddings (RoPE), autoregressively generates text tokens

Performance

Backend	RTF	Peak Memory	Notes
Qwen3-ASR 1.7B MLX 8-bit	0.033	2.7 GB	WER 1.52%
Qwen3-ASR 0.6B MLX 8-bit	0.015	1.3 GB	WER 1.82%
Qwen3-ASR 0.6B MLX 4-bit	0.012	1.0 GB	WER 2.20%
Qwen3-ASR 0.6B CoreML INT8	0.098	1.4 GB	WER 3.02% (chunked-attn rebuild)

Apple M5 Pro, 48 GB. LibriSpeech test-clean n=200, isolated per-engine. RTF < 1.0 = faster than real-time.

Model Variants

Model	Backend	Size	HuggingFace
Qwen3-ASR-0.6B (4-bit)	MLX	680 MB	aufklarer/Qwen3-ASR-0.6B-MLX-4bit
Qwen3-ASR-0.6B (8-bit)	MLX	1.0 GB	aufklarer/Qwen3-ASR-0.6B-MLX-8bit
Qwen3-ASR-0.6B (CoreML INT8)	CoreML	180 MB	aufklarer/Qwen3-ASR-CoreML
Qwen3-ASR-1.7B (4-bit)	MLX	2.1 GB	aufklarer/Qwen3-ASR-1.7B-MLX-4bit
Qwen3-ASR-1.7B (8-bit)	MLX	3.2 GB	aufklarer/Qwen3-ASR-1.7B-MLX-8bit

CLI Usage

Transcribe an audio file with the default Qwen3-ASR model:

.build/release/speech transcribe recording.wav

Options

# Use the larger 1.7B model
.build/release/speech transcribe recording.wav --model 1.7b

# Specify language
.build/release/speech transcribe recording.wav --language en

# Streaming mode with partial results
.build/release/speech transcribe recording.wav --stream --partial

Swift API

Use the Qwen3ASR module to transcribe audio programmatically:

import Qwen3ASR

// Load the model (downloads from HuggingFace on first use)
let model = try await Qwen3ASRModel.fromPretrained()

// Transcribe audio samples (16 kHz mono Float)
let text = model.transcribe(audio: audioSamples, sampleRate: 16000)
print(text)

CoreML Encoder (Neural Engine)

Run the audio encoder on the Neural Engine via CoreML, with the text decoder on GPU via MLX. This hybrid approach lowers power consumption and frees the GPU for concurrent workloads.

import Qwen3ASR

let encoder = try await CoreMLASREncoder.fromPretrained()
let model = try await Qwen3ASRModel.fromPretrained()
let text = try model.transcribe(
    audio: samples, sampleRate: 16000,
    coremlEncoder: encoder
)

# CLI
.build/release/speech transcribe recording.wav --engine qwen3-coreml

INT8 palettized (180 MB, cosine similarity > 0.999) is the default. An INT4 variant (90 MB) is also available for size-constrained deployments.

Streaming Mode

Streaming mode uses VAD (voice activity detection) to segment audio into chunks and transcribe them incrementally. This is useful for long recordings or real-time processing.

# Stream with default segment size
.build/release/speech transcribe recording.wav --stream

# Control maximum segment duration
.build/release/speech transcribe recording.wav --stream --max-segment 15

# Show partial (in-progress) results as they arrive
.build/release/speech transcribe recording.wav --stream --partial

The --max-segment flag controls the maximum chunk duration in seconds. The --partial flag enables partial result output, showing words as they are decoded.

Supported Formats

Qwen3-ASR accepts the following audio formats. All input is automatically resampled to 16 kHz mono internally.

WAV — uncompressed PCM
M4A — AAC-encoded audio
MP3 — MPEG Layer III
CAF — Apple Core Audio Format

Important

Models are downloaded from HuggingFace on first use and cached in ~/Library/Caches/qwen3-speech/. The 4-bit 0.6B model is approximately 1.5 GB.