Qwen3 Chat (On-Device LLM)

Qwen3-0.6B is a compact language model quantized to INT4 (318 MB) or INT8 (571 MB) for CoreML deployment. Runs on iPhone and Mac Neural Engine with streaming token generation. Designed for voice pipelines where an on-device LLM provides the "brain" between ASR and TTS.

Voice Pipeline Ready

Qwen3Chat integrates with the SpeechCore VoicePipeline as the LLM component in ASR → LLM → TTS chains. Thinking mode support allows the model to reason before responding.

Quick Start

import Qwen3Chat

let chat = try await Qwen3ChatModel.fromPretrained()

// Single response
let response = try chat.generate("What is Swift?", systemPrompt: "Answer briefly.")
print(response)

// Streaming tokens
let stream = chat.chatStream("Tell me a joke", systemPrompt: "Be funny.")
for try await token in stream {
    print(token, terminator: "")
}

Architecture

Qwen3Chat uses a dual-model architecture optimized for CoreML: a prefill model for batch prompt processing (throughput-optimized) and a decode model for single-token generation (latency-optimized). KV cache with prompt caching preserves the system prompt across turns. Sampling is configurable with temperature, top-k, top-p, and repetition penalty.

Model I/O

DirectionNameShapeDescription
Inputinput_ids[1, seq_len]Token IDs (Int32)
Inputattention_mask[1, seq_len]Attention mask (Int32)
Inputkv_cacheper-layerKey-value cache state
Outputlogits[1, 1, 151936]Next-token logits (Float16)
Outputkv_cache_outper-layerUpdated KV cache

Model Variants

VariantQuantizationSizeComputeHuggingFace
Qwen3-0.6B ChatINT4318 MBNeural Engineaufklarer/Qwen3-0.6B-Chat-CoreML
Qwen3-0.6B ChatINT8571 MBNeural Engineaufklarer/Qwen3-0.6B-Chat-CoreML

Sampling Configuration

let config = ChatSamplingConfig(
    temperature: 0.7,
    topK: 40,
    topP: 0.9,
    maxTokens: 128,
    repetitionPenalty: 1.1,
    disableThinking: false,
    maxThinkingTokens: 50
)
let response = try chat.generate("Explain gravity", sampling: config)
ParameterDefaultDescription
temperature0.6Randomness (0 = greedy, 1 = creative)
topK50Keep top K candidates
topP0.95Nucleus sampling threshold
maxTokens512Max response tokens
repetitionPenalty1.1Penalize repeated tokens
disableThinkingfalseSkip thinking mode
maxThinkingTokens100Cap thinking tokens

Multi-turn Conversation

let chat = try await Qwen3ChatModel.fromPretrained()

let r1 = try chat.generate("My name is Alex", systemPrompt: "Remember the user's name.")
print(r1)  // "Nice to meet you, Alex!"

let r2 = try chat.generate("What's my name?")
print(r2)  // "Your name is Alex!"

chat.resetConversation()  // Clear history and KV cache

Memory Management

// Check memory state
print(chat.isLoaded)        // true
print(chat.memoryFootprint) // 333447168 (~318 MB)

// Free memory under pressure
chat.unload()
print(chat.isLoaded)        // false

// Reload when needed
let chat = try await Qwen3ChatModel.fromPretrained()
iOS Memory Tip

On iPhone, unloading the LLM before TTS inference frees ~318 MB, preventing jetsam termination when running full ASR → LLM → TTS pipelines.

Performance

DevicePrefillDecodeTokens/sec
M2 Max~50ms~65ms/tok~15 tok/s
iPhone 16 Pro~1.5s~450ms/tok~2.2 tok/s

Conversion

The CoreML model is converted from the original Qwen3-0.6B weights using the convert_qwen3_chat_coreml.py script. Supports --quantize int4 (318 MB, faster) and --quantize int8 (571 MB, higher quality). Pre-converted weights are available on HuggingFace at aufklarer/Qwen3-0.6B-Chat-CoreML.