Qwen3.5 Chat（端侧 LLM）

Qwen3.5-0.8B 是一个混合 DeltaNet（线性注意力）+ GatedAttention 模型，共 24 层（18 层 DeltaNet + 6 层 GatedAttention），为 MLX（Metal GPU）量化到 INT4，为 CoreML（Neural Engine）量化到 INT8。可在 Mac 上通过 MLX 运行，也可在 iPhone 和 Mac 上通过 CoreML 运行，并支持流式 token 生成。它为语音流水线而设计，在 ASR 与 TTS 之间充当端侧 LLM 的"大脑"。

语音流水线就绪

Qwen3.5 Chat 可作为 LLM 组件集成进 SpeechCore VoicePipeline，用于 ASR → LLM → TTS 链路。混合 DeltaNet 架构在长上下文上提供高效的线性时间注意力。

快速开始

import Qwen3Chat

let chat = try await Qwen35MLXChat.fromPretrained()

// Single response
let response = try chat.generate(messages: [
    ChatMessage(role: .system, content: "Answer briefly."),
    ChatMessage(role: .user, content: "What is Swift?")
])
print(response)

// Streaming tokens
let stream = chat.generateStream(messages: [
    ChatMessage(role: .system, content: "Be funny."),
    ChatMessage(role: .user, content: "Tell me a joke")
])
for try await token in stream {
    print(token, terminator: "")
}

架构

Qwen3.5-0.8B 是一个 24 层的混合模型：18 层 DeltaNet（带门控 delta 规则递推和 RMSNormGated 的线性注意力）和 6 层 GatedAttention（标准的 scaled dot-product attention）。MLX 后端使用 safetensors 权重在 Metal GPU 上运行推理。CoreML 后端采用针对 Neural Engine 优化的双模型架构（prefill + decode）。两者都支持带 prompt cache 的 KV cache，以及可配置的采样（温度、top-k、top-p、重复惩罚）。

模型 I/O

方向	名称	形状	说明
输入	`input_ids`	[1, seq_len]	Token ID（Int32）
输入	`attention_mask`	[1, seq_len]	注意力掩码（Int32）
输入	`kv_cache`	每层	Key-value cache 状态
输出	`logits`	[1, 1, 151936]	下一个 token 的 logits（Float16）
输出	`kv_cache_out`	每层	更新后的 KV cache

模型变体

变体	量化	大小	计算	HuggingFace
Qwen3.5-0.8B Chat	INT4	418 MB	Metal GPU (MLX)	aufklarer/Qwen3.5-0.8B-Chat-MLX
Qwen3.5-0.8B Chat	INT8	981 MB	Neural Engine (CoreML)	aufklarer/Qwen3.5-0.8B-Chat-CoreML

采样配置

let config = ChatSamplingConfig(
    temperature: 0.7,
    topK: 40,
    topP: 0.9,
    maxTokens: 128,
    repetitionPenalty: 1.1
)
let response = try chat.generate(
    messages: [ChatMessage(role: .user, content: "Explain gravity")],
    sampling: config
)

参数	默认值	说明
`temperature`	0.6	随机性（0 = 贪心，1 = 创造性）
`topK`	50	保留前 K 个候选
`topP`	0.95	Nucleus 采样阈值
`maxTokens`	512	响应的最大 token 数
`repetitionPenalty`	1.1	对重复 token 的惩罚
`disableThinking`	false	跳过 thinking 模式
`maxThinkingTokens`	100	thinking token 上限

多轮对话

let chat = try await Qwen35MLXChat.fromPretrained()

let history = [
    ChatMessage(role: .system, content: "Remember the user's name."),
    ChatMessage(role: .user, content: "My name is Alex"),
    ChatMessage(role: .assistant, content: "Nice to meet you, Alex!"),
    ChatMessage(role: .user, content: "What's my name?")
]
let response = try chat.generate(messages: history)
print(response)  // "Your name is Alex!"

chat.resetState()  // Clear inference state for a new conversation

内存管理

// Check memory state
print(chat.isLoaded)        // true
print(chat.memoryFootprint) // 438304768 (~418 MB)

// Free memory under pressure
chat.unload()
print(chat.isLoaded)        // false

// Reload when needed
let chat = try await Qwen35MLXChat.fromPretrained()

iOS 内存小贴士

在 iPhone 上，在 TTS 推理之前卸载 LLM 可释放约 418 MB（INT4 MLX）或约 981 MB（INT8 CoreML），在运行完整的 ASR → LLM → TTS 流水线时可避免被 jetsam 终止。

性能

设备	Prefill	Decode	Tokens/sec
M2 Max	~50ms	~65ms/tok	~15 tok/s
iPhone 16 Pro	~1.5s	~450ms/tok	~2.2 tok/s

模型转换

MLX 权重通过 MLX 转换脚本从原始的 Qwen3.5-0.8B checkpoint 转换而来。CoreML 模型使用单独的转换脚本部署到 Neural Engine。预转换好的权重可在 HuggingFace 的 aufklarer/Qwen3.5-0.8B-Chat-MLX（INT4：418 MB）和 aufklarer/Qwen3.5-0.8B-Chat-CoreML（INT8：981 MB）获取。