Qwen3 Chat (On-Device LLM)
Qwen3-0.6B is a compact language model quantized to INT4 (318 MB) or INT8 (571 MB) for CoreML deployment. Runs on iPhone and Mac Neural Engine with streaming token generation. Designed for voice pipelines where an on-device LLM provides the "brain" between ASR and TTS.
Qwen3Chat integrates with the SpeechCore VoicePipeline as the LLM component in ASR → LLM → TTS chains. Thinking mode support allows the model to reason before responding.
Quick Start
import Qwen3Chat
let chat = try await Qwen3ChatModel.fromPretrained()
// Single response
let response = try chat.generate("What is Swift?", systemPrompt: "Answer briefly.")
print(response)
// Streaming tokens
let stream = chat.chatStream("Tell me a joke", systemPrompt: "Be funny.")
for try await token in stream {
print(token, terminator: "")
}
Architecture
Qwen3Chat uses a dual-model architecture optimized for CoreML: a prefill model for batch prompt processing (throughput-optimized) and a decode model for single-token generation (latency-optimized). KV cache with prompt caching preserves the system prompt across turns. Sampling is configurable with temperature, top-k, top-p, and repetition penalty.
Model I/O
| Direction | Name | Shape | Description |
|---|---|---|---|
| Input | input_ids | [1, seq_len] | Token IDs (Int32) |
| Input | attention_mask | [1, seq_len] | Attention mask (Int32) |
| Input | kv_cache | per-layer | Key-value cache state |
| Output | logits | [1, 1, 151936] | Next-token logits (Float16) |
| Output | kv_cache_out | per-layer | Updated KV cache |
Model Variants
| Variant | Quantization | Size | Compute | HuggingFace |
|---|---|---|---|---|
| Qwen3-0.6B Chat | INT4 | 318 MB | Neural Engine | aufklarer/Qwen3-0.6B-Chat-CoreML |
| Qwen3-0.6B Chat | INT8 | 571 MB | Neural Engine | aufklarer/Qwen3-0.6B-Chat-CoreML |
Sampling Configuration
let config = ChatSamplingConfig(
temperature: 0.7,
topK: 40,
topP: 0.9,
maxTokens: 128,
repetitionPenalty: 1.1,
disableThinking: false,
maxThinkingTokens: 50
)
let response = try chat.generate("Explain gravity", sampling: config)
| Parameter | Default | Description |
|---|---|---|
temperature | 0.6 | Randomness (0 = greedy, 1 = creative) |
topK | 50 | Keep top K candidates |
topP | 0.95 | Nucleus sampling threshold |
maxTokens | 512 | Max response tokens |
repetitionPenalty | 1.1 | Penalize repeated tokens |
disableThinking | false | Skip thinking mode |
maxThinkingTokens | 100 | Cap thinking tokens |
Multi-turn Conversation
let chat = try await Qwen3ChatModel.fromPretrained()
let r1 = try chat.generate("My name is Alex", systemPrompt: "Remember the user's name.")
print(r1) // "Nice to meet you, Alex!"
let r2 = try chat.generate("What's my name?")
print(r2) // "Your name is Alex!"
chat.resetConversation() // Clear history and KV cache
Memory Management
// Check memory state
print(chat.isLoaded) // true
print(chat.memoryFootprint) // 333447168 (~318 MB)
// Free memory under pressure
chat.unload()
print(chat.isLoaded) // false
// Reload when needed
let chat = try await Qwen3ChatModel.fromPretrained()
On iPhone, unloading the LLM before TTS inference frees ~318 MB, preventing jetsam termination when running full ASR → LLM → TTS pipelines.
Performance
| Device | Prefill | Decode | Tokens/sec |
|---|---|---|---|
| M2 Max | ~50ms | ~65ms/tok | ~15 tok/s |
| iPhone 16 Pro | ~1.5s | ~450ms/tok | ~2.2 tok/s |
Conversion
The CoreML model is converted from the original Qwen3-0.6B weights using the convert_qwen3_chat_coreml.py script. Supports --quantize int4 (318 MB, faster) and --quantize int8 (571 MB, higher quality). Pre-converted weights are available on HuggingFace at aufklarer/Qwen3-0.6B-Chat-CoreML.