Qwen3.5 Chat (On-Device LLM)
Qwen3.5-0.8B is a hybrid DeltaNet (linear attention) + GatedAttention model with 24 layers (18 DeltaNet + 6 GatedAttention), quantized to INT4 for MLX (Metal GPU) and INT8 for CoreML (Neural Engine). Runs on Mac via MLX or on iPhone and Mac via CoreML with streaming token generation. Designed for voice pipelines where an on-device LLM provides the "brain" between ASR and TTS.
Qwen3.5 Chat integrates with the SpeechCore VoicePipeline as the LLM component in ASR → LLM → TTS chains. The hybrid DeltaNet architecture provides efficient linear-time attention for long contexts.
Quick Start
import Qwen3Chat
let chat = try await Qwen35MLXChat.fromPretrained()
// Single response
let response = try chat.generate("What is Swift?", systemPrompt: "Answer briefly.")
print(response)
// Streaming tokens
let stream = chat.chatStream("Tell me a joke", systemPrompt: "Be funny.")
for try await token in stream {
print(token, terminator: "")
}
Architecture
Qwen3.5-0.8B is a hybrid model with 24 layers: 18 DeltaNet layers (linear attention with gated delta rule recurrence and RMSNormGated) and 6 GatedAttention layers (standard scaled dot-product attention). The MLX backend runs inference on the Metal GPU with safetensors weights. The CoreML backend uses a dual-model architecture (prefill + decode) optimized for the Neural Engine. Both support KV cache with prompt caching and configurable sampling (temperature, top-k, top-p, repetition penalty).
Model I/O
| Direction | Name | Shape | Description |
|---|---|---|---|
| Input | input_ids | [1, seq_len] | Token IDs (Int32) |
| Input | attention_mask | [1, seq_len] | Attention mask (Int32) |
| Input | kv_cache | per-layer | Key-value cache state |
| Output | logits | [1, 1, 151936] | Next-token logits (Float16) |
| Output | kv_cache_out | per-layer | Updated KV cache |
Model Variants
| Variant | Quantization | Size | Compute | HuggingFace |
|---|---|---|---|---|
| Qwen3.5-0.8B Chat | INT4 | 418 MB | Metal GPU (MLX) | aufklarer/Qwen3.5-0.8B-Chat-MLX |
| Qwen3.5-0.8B Chat | INT8 | 981 MB | Neural Engine (CoreML) | aufklarer/Qwen3.5-0.8B-Chat-CoreML |
Sampling Configuration
let config = ChatSamplingConfig(
temperature: 0.7,
topK: 40,
topP: 0.9,
maxTokens: 128,
repetitionPenalty: 1.1,
disableThinking: false,
maxThinkingTokens: 50
)
let response = try chat.generate("Explain gravity", sampling: config)
| Parameter | Default | Description |
|---|---|---|
temperature | 0.6 | Randomness (0 = greedy, 1 = creative) |
topK | 50 | Keep top K candidates |
topP | 0.95 | Nucleus sampling threshold |
maxTokens | 512 | Max response tokens |
repetitionPenalty | 1.1 | Penalize repeated tokens |
disableThinking | false | Skip thinking mode |
maxThinkingTokens | 100 | Cap thinking tokens |
Multi-turn Conversation
let chat = try await Qwen35MLXChat.fromPretrained()
let r1 = try chat.generate("My name is Alex", systemPrompt: "Remember the user's name.")
print(r1) // "Nice to meet you, Alex!"
let r2 = try chat.generate("What's my name?")
print(r2) // "Your name is Alex!"
chat.resetConversation() // Clear history and KV cache
Memory Management
// Check memory state
print(chat.isLoaded) // true
print(chat.memoryFootprint) // 438304768 (~418 MB)
// Free memory under pressure
chat.unload()
print(chat.isLoaded) // false
// Reload when needed
let chat = try await Qwen35MLXChat.fromPretrained()
On iPhone, unloading the LLM before TTS inference frees ~418 MB (INT4 MLX) or ~981 MB (INT8 CoreML), preventing jetsam termination when running full ASR → LLM → TTS pipelines.
Performance
| Device | Prefill | Decode | Tokens/sec |
|---|---|---|---|
| M2 Max | ~50ms | ~65ms/tok | ~15 tok/s |
| iPhone 16 Pro | ~1.5s | ~450ms/tok | ~2.2 tok/s |
Conversion
MLX weights are converted from the original Qwen3.5-0.8B checkpoint using the MLX conversion script. CoreML models use a separate conversion script for Neural Engine deployment. Pre-converted weights are available on HuggingFace at aufklarer/Qwen3.5-0.8B-Chat-MLX (INT4: 418 MB) and aufklarer/Qwen3.5-0.8B-Chat-CoreML (INT8: 981 MB).