Qwen3-ASR
Qwen3-ASR is a state-of-the-art multilingual automatic speech recognition model. It runs on-device using Metal GPU acceleration via MLX, with 4-bit quantization for efficient memory usage. Available in 0.6B and 1.7B parameter variants.
Pipeline
The Qwen3-ASR inference pipeline processes audio through four stages:
| Stage | Description |
|---|---|
| Audio Input | Raw audio resampled to 16 kHz mono |
| Mel Spectrogram | 128-bin mel filterbank features extracted from the waveform |
| Audio Encoder | 18-layer transformer with block attention, processes mel frames into audio embeddings |
| Text Decoder | 28-layer Qwen3 transformer with grouped-query attention (GQA) and rotary position embeddings (RoPE), autoregressively generates text tokens |
Performance
| Backend | RTF | Peak Memory | Notes |
|---|---|---|---|
| MLX (GPU) | ~0.06 | ~2.2 GB | Default, fastest single-model |
| CoreML + MLX (hybrid) | ~0.09 | ~400 MB (encoder) | Encoder on Neural Engine, decoder on GPU |
M2 Max, 64 GB. RTF < 1.0 = faster than real-time.
Model Variants
| Model | Backend | Size | HuggingFace |
|---|---|---|---|
| Qwen3-ASR-0.6B (4-bit) | MLX | 680 MB | aufklarer/Qwen3-ASR-0.6B-MLX-4bit |
| Qwen3-ASR-0.6B (8-bit) | MLX | 1.0 GB | aufklarer/Qwen3-ASR-0.6B-MLX-8bit |
| Qwen3-ASR-0.6B (CoreML INT8) | CoreML | 180 MB | aufklarer/Qwen3-ASR-CoreML |
| Qwen3-ASR-1.7B (4-bit) | MLX | 2.1 GB | aufklarer/Qwen3-ASR-1.7B-MLX-4bit |
| Qwen3-ASR-1.7B (8-bit) | MLX | 3.2 GB | aufklarer/Qwen3-ASR-1.7B-MLX-8bit |
CLI Usage
Transcribe an audio file with the default Qwen3-ASR model:
.build/release/audio transcribe recording.wav
Options
# Use the larger 1.7B model
.build/release/audio transcribe recording.wav --model 1.7b
# Specify language
.build/release/audio transcribe recording.wav --language en
# Streaming mode with partial results
.build/release/audio transcribe recording.wav --stream --partial
Swift API
Use the Qwen3ASR module to transcribe audio programmatically:
import Qwen3ASR
// Load the model (downloads from HuggingFace on first use)
let model = try await Qwen3ASRModel.loadFromHub()
// Transcribe an audio file
let result = try await model.transcribe(audioFile: "recording.wav")
print(result.text)
CoreML Encoder (Neural Engine)
Run the audio encoder on the Neural Engine via CoreML, with the text decoder on GPU via MLX. This hybrid approach lowers power consumption and frees the GPU for concurrent workloads.
import Qwen3ASR
let encoder = try await CoreMLASREncoder.fromPretrained()
let model = try await Qwen3ASRModel.fromPretrained()
let text = try model.transcribe(
audio: samples, sampleRate: 16000,
coremlEncoder: encoder
)
# CLI
.build/release/audio transcribe recording.wav --engine qwen3-coreml
INT8 palettized (180 MB, cosine similarity > 0.999) is the default. An INT4 variant (90 MB) is also available for size-constrained deployments.
Streaming Mode
Streaming mode uses VAD (voice activity detection) to segment audio into chunks and transcribe them incrementally. This is useful for long recordings or real-time processing.
# Stream with default segment size
.build/release/audio transcribe recording.wav --stream
# Control maximum segment duration
.build/release/audio transcribe recording.wav --stream --max-segment 15
# Show partial (in-progress) results as they arrive
.build/release/audio transcribe recording.wav --stream --partial
The --max-segment flag controls the maximum chunk duration in seconds. The --partial flag enables partial result output, showing words as they are decoded.
Supported Formats
Qwen3-ASR accepts the following audio formats. All input is automatically resampled to 16 kHz mono internally.
- WAV — uncompressed PCM
- M4A — AAC-encoded audio
- MP3 — MPEG Layer III
- CAF — Apple Core Audio Format
Models are downloaded from HuggingFace on first use and cached in ~/Library/Caches/qwen3-speech/. The 4-bit 0.6B model is approximately 1.5 GB.