Qwen3-ASR

Qwen3-ASR is a state-of-the-art multilingual automatic speech recognition model. It runs on-device using Metal GPU acceleration via MLX, with 4-bit quantization for efficient memory usage. Available in 0.6B and 1.7B parameter variants.

Pipeline

The Qwen3-ASR inference pipeline processes audio through four stages:

StageDescription
Audio InputRaw audio resampled to 16 kHz mono
Mel Spectrogram128-bin mel filterbank features extracted from the waveform
Audio Encoder18-layer transformer with block attention, processes mel frames into audio embeddings
Text Decoder28-layer Qwen3 transformer with grouped-query attention (GQA) and rotary position embeddings (RoPE), autoregressively generates text tokens

Performance

BackendRTFPeak MemoryNotes
MLX (GPU)~0.06~2.2 GBDefault, fastest single-model
CoreML + MLX (hybrid)~0.09~400 MB (encoder)Encoder on Neural Engine, decoder on GPU

M2 Max, 64 GB. RTF < 1.0 = faster than real-time.

Model Variants

ModelBackendSizeHuggingFace
Qwen3-ASR-0.6B (4-bit)MLX680 MBaufklarer/Qwen3-ASR-0.6B-MLX-4bit
Qwen3-ASR-0.6B (8-bit)MLX1.0 GBaufklarer/Qwen3-ASR-0.6B-MLX-8bit
Qwen3-ASR-0.6B (CoreML INT8)CoreML180 MBaufklarer/Qwen3-ASR-CoreML
Qwen3-ASR-1.7B (4-bit)MLX2.1 GBaufklarer/Qwen3-ASR-1.7B-MLX-4bit
Qwen3-ASR-1.7B (8-bit)MLX3.2 GBaufklarer/Qwen3-ASR-1.7B-MLX-8bit

CLI Usage

Transcribe an audio file with the default Qwen3-ASR model:

.build/release/audio transcribe recording.wav

Options

# Use the larger 1.7B model
.build/release/audio transcribe recording.wav --model 1.7b

# Specify language
.build/release/audio transcribe recording.wav --language en

# Streaming mode with partial results
.build/release/audio transcribe recording.wav --stream --partial

Swift API

Use the Qwen3ASR module to transcribe audio programmatically:

import Qwen3ASR

// Load the model (downloads from HuggingFace on first use)
let model = try await Qwen3ASRModel.loadFromHub()

// Transcribe an audio file
let result = try await model.transcribe(audioFile: "recording.wav")
print(result.text)

CoreML Encoder (Neural Engine)

Run the audio encoder on the Neural Engine via CoreML, with the text decoder on GPU via MLX. This hybrid approach lowers power consumption and frees the GPU for concurrent workloads.

import Qwen3ASR

let encoder = try await CoreMLASREncoder.fromPretrained()
let model = try await Qwen3ASRModel.fromPretrained()
let text = try model.transcribe(
    audio: samples, sampleRate: 16000,
    coremlEncoder: encoder
)
# CLI
.build/release/audio transcribe recording.wav --engine qwen3-coreml

INT8 palettized (180 MB, cosine similarity > 0.999) is the default. An INT4 variant (90 MB) is also available for size-constrained deployments.

Streaming Mode

Streaming mode uses VAD (voice activity detection) to segment audio into chunks and transcribe them incrementally. This is useful for long recordings or real-time processing.

# Stream with default segment size
.build/release/audio transcribe recording.wav --stream

# Control maximum segment duration
.build/release/audio transcribe recording.wav --stream --max-segment 15

# Show partial (in-progress) results as they arrive
.build/release/audio transcribe recording.wav --stream --partial

The --max-segment flag controls the maximum chunk duration in seconds. The --partial flag enables partial result output, showing words as they are decoded.

Supported Formats

Qwen3-ASR accepts the following audio formats. All input is automatically resampled to 16 kHz mono internally.

Important

Models are downloaded from HuggingFace on first use and cached in ~/Library/Caches/qwen3-speech/. The 4-bit 0.6B model is approximately 1.5 GB.