Qwen3.5 Chat (On-Device LLM)

Qwen3.5-0.8B is a hybrid DeltaNet (linear attention) + GatedAttention model with 24 layers (18 DeltaNet + 6 GatedAttention), quantized to INT4 for MLX (Metal GPU) and INT8 for CoreML (Neural Engine). Runs on Mac via MLX or on iPhone and Mac via CoreML with streaming token generation. Designed for voice pipelines where an on-device LLM provides the "brain" between ASR and TTS.

Voice Pipeline Ready

Qwen3.5 Chat integrates with the SpeechCore VoicePipeline as the LLM component in ASR → LLM → TTS chains. The hybrid DeltaNet architecture provides efficient linear-time attention for long contexts.

Quick Start

import Qwen3Chat

let chat = try await Qwen35MLXChat.fromPretrained()

// Single response
let response = try chat.generate("What is Swift?", systemPrompt: "Answer briefly.")
print(response)

// Streaming tokens
let stream = chat.chatStream("Tell me a joke", systemPrompt: "Be funny.")
for try await token in stream {
    print(token, terminator: "")
}

Architecture

Qwen3.5-0.8B is a hybrid model with 24 layers: 18 DeltaNet layers (linear attention with gated delta rule recurrence and RMSNormGated) and 6 GatedAttention layers (standard scaled dot-product attention). The MLX backend runs inference on the Metal GPU with safetensors weights. The CoreML backend uses a dual-model architecture (prefill + decode) optimized for the Neural Engine. Both support KV cache with prompt caching and configurable sampling (temperature, top-k, top-p, repetition penalty).

Model I/O

Direction	Name	Shape	Description
Input	`input_ids`	[1, seq_len]	Token IDs (Int32)
Input	`attention_mask`	[1, seq_len]	Attention mask (Int32)
Input	`kv_cache`	per-layer	Key-value cache state
Output	`logits`	[1, 1, 151936]	Next-token logits (Float16)
Output	`kv_cache_out`	per-layer	Updated KV cache

Model Variants

Variant	Quantization	Size	Compute	HuggingFace
Qwen3.5-0.8B Chat	INT4	418 MB	Metal GPU (MLX)	aufklarer/Qwen3.5-0.8B-Chat-MLX
Qwen3.5-0.8B Chat	INT8	981 MB	Neural Engine (CoreML)	aufklarer/Qwen3.5-0.8B-Chat-CoreML

Sampling Configuration

let config = ChatSamplingConfig(
    temperature: 0.7,
    topK: 40,
    topP: 0.9,
    maxTokens: 128,
    repetitionPenalty: 1.1,
    disableThinking: false,
    maxThinkingTokens: 50
)
let response = try chat.generate("Explain gravity", sampling: config)

Parameter	Default	Description
`temperature`	0.6	Randomness (0 = greedy, 1 = creative)
`topK`	50	Keep top K candidates
`topP`	0.95	Nucleus sampling threshold
`maxTokens`	512	Max response tokens
`repetitionPenalty`	1.1	Penalize repeated tokens
`disableThinking`	false	Skip thinking mode
`maxThinkingTokens`	100	Cap thinking tokens

Multi-turn Conversation

let chat = try await Qwen35MLXChat.fromPretrained()

let r1 = try chat.generate("My name is Alex", systemPrompt: "Remember the user's name.")
print(r1)  // "Nice to meet you, Alex!"

let r2 = try chat.generate("What's my name?")
print(r2)  // "Your name is Alex!"

chat.resetConversation()  // Clear history and KV cache

Memory Management

// Check memory state
print(chat.isLoaded)        // true
print(chat.memoryFootprint) // 438304768 (~418 MB)

// Free memory under pressure
chat.unload()
print(chat.isLoaded)        // false

// Reload when needed
let chat = try await Qwen35MLXChat.fromPretrained()

iOS Memory Tip

On iPhone, unloading the LLM before TTS inference frees ~418 MB (INT4 MLX) or ~981 MB (INT8 CoreML), preventing jetsam termination when running full ASR → LLM → TTS pipelines.

Performance

Device	Prefill	Decode	Tokens/sec
M2 Max	~50ms	~65ms/tok	~15 tok/s
iPhone 16 Pro	~1.5s	~450ms/tok	~2.2 tok/s

Conversion

MLX weights are converted from the original Qwen3.5-0.8B checkpoint using the MLX conversion script. CoreML models use a separate conversion script for Neural Engine deployment. Pre-converted weights are available on HuggingFace at aufklarer/Qwen3.5-0.8B-Chat-MLX (INT4: 418 MB) and aufklarer/Qwen3.5-0.8B-Chat-CoreML (INT8: 981 MB).