Qwen3.5 Chat (On-Device LLM)

Qwen3.5-0.8B is a hybrid DeltaNet (linear attention) + GatedAttention model with 24 layers (18 DeltaNet + 6 GatedAttention), quantized to INT4 for MLX (Metal GPU) and INT8 for CoreML (Neural Engine). Runs on Mac via MLX or on iPhone and Mac via CoreML with streaming token generation. Designed for voice pipelines where an on-device LLM provides the "brain" between ASR and TTS.

Voice Pipeline Ready

Qwen3.5 Chat integrates with the SpeechCore VoicePipeline as the LLM component in ASR → LLM → TTS chains. The hybrid DeltaNet architecture provides efficient linear-time attention for long contexts.

Quick Start

import Qwen3Chat

let chat = try await Qwen35MLXChat.fromPretrained()

// Single response
let response = try chat.generate("What is Swift?", systemPrompt: "Answer briefly.")
print(response)

// Streaming tokens
let stream = chat.chatStream("Tell me a joke", systemPrompt: "Be funny.")
for try await token in stream {
    print(token, terminator: "")
}

Architecture

Qwen3.5-0.8B is a hybrid model with 24 layers: 18 DeltaNet layers (linear attention with gated delta rule recurrence and RMSNormGated) and 6 GatedAttention layers (standard scaled dot-product attention). The MLX backend runs inference on the Metal GPU with safetensors weights. The CoreML backend uses a dual-model architecture (prefill + decode) optimized for the Neural Engine. Both support KV cache with prompt caching and configurable sampling (temperature, top-k, top-p, repetition penalty).

Model I/O

DirectionNameShapeDescription
Inputinput_ids[1, seq_len]Token IDs (Int32)
Inputattention_mask[1, seq_len]Attention mask (Int32)
Inputkv_cacheper-layerKey-value cache state
Outputlogits[1, 1, 151936]Next-token logits (Float16)
Outputkv_cache_outper-layerUpdated KV cache

Model Variants

VariantQuantizationSizeComputeHuggingFace
Qwen3.5-0.8B ChatINT4418 MBMetal GPU (MLX)aufklarer/Qwen3.5-0.8B-Chat-MLX
Qwen3.5-0.8B ChatINT8981 MBNeural Engine (CoreML)aufklarer/Qwen3.5-0.8B-Chat-CoreML

Sampling Configuration

let config = ChatSamplingConfig(
    temperature: 0.7,
    topK: 40,
    topP: 0.9,
    maxTokens: 128,
    repetitionPenalty: 1.1,
    disableThinking: false,
    maxThinkingTokens: 50
)
let response = try chat.generate("Explain gravity", sampling: config)
ParameterDefaultDescription
temperature0.6Randomness (0 = greedy, 1 = creative)
topK50Keep top K candidates
topP0.95Nucleus sampling threshold
maxTokens512Max response tokens
repetitionPenalty1.1Penalize repeated tokens
disableThinkingfalseSkip thinking mode
maxThinkingTokens100Cap thinking tokens

Multi-turn Conversation

let chat = try await Qwen35MLXChat.fromPretrained()

let r1 = try chat.generate("My name is Alex", systemPrompt: "Remember the user's name.")
print(r1)  // "Nice to meet you, Alex!"

let r2 = try chat.generate("What's my name?")
print(r2)  // "Your name is Alex!"

chat.resetConversation()  // Clear history and KV cache

Memory Management

// Check memory state
print(chat.isLoaded)        // true
print(chat.memoryFootprint) // 438304768 (~418 MB)

// Free memory under pressure
chat.unload()
print(chat.isLoaded)        // false

// Reload when needed
let chat = try await Qwen35MLXChat.fromPretrained()
iOS Memory Tip

On iPhone, unloading the LLM before TTS inference frees ~418 MB (INT4 MLX) or ~981 MB (INT8 CoreML), preventing jetsam termination when running full ASR → LLM → TTS pipelines.

Performance

DevicePrefillDecodeTokens/sec
M2 Max~50ms~65ms/tok~15 tok/s
iPhone 16 Pro~1.5s~450ms/tok~2.2 tok/s

Conversion

MLX weights are converted from the original Qwen3.5-0.8B checkpoint using the MLX conversion script. CoreML models use a separate conversion script for Neural Engine deployment. Pre-converted weights are available on HuggingFace at aufklarer/Qwen3.5-0.8B-Chat-MLX (INT4: 418 MB) and aufklarer/Qwen3.5-0.8B-Chat-CoreML (INT8: 981 MB).