Qwen3.5 Chat (LLM บนอุปกรณ์)

Qwen3.5-0.8B เป็นโมเดลไฮบริด DeltaNet (linear attention) + GatedAttention ขนาด 24 ชั้น (DeltaNet 18 ชั้น + GatedAttention 6 ชั้น) ลดความละเอียดเป็น INT4 สำหรับ MLX (GPU Metal) และ INT8 สำหรับ CoreML (Neural Engine) โมเดลทำงานบน Mac ผ่าน MLX หรือบน iPhone และ Mac ผ่าน CoreML โดยมีการสร้าง token แบบ streaming ออกแบบมาสำหรับ pipeline เสียงโดยที่ LLM บนอุปกรณ์ทำหน้าที่เป็น "สมอง" ระหว่าง ASR และ TTS

พร้อมสำหรับ pipeline เสียง

Qwen3.5 Chat ผสานเข้ากับ VoicePipeline ของ SpeechCore ในฐานะส่วนประกอบ LLM ในห่วงโซ่ ASR → LLM → TTS สถาปัตยกรรมไฮบริด DeltaNet ให้ attention เชิงเส้นที่มีประสิทธิภาพสำหรับบริบทยาว

เริ่มต้นอย่างรวดเร็ว

import Qwen3Chat

let chat = try await Qwen35MLXChat.fromPretrained()

// Single response
let response = try chat.generate(messages: [
    ChatMessage(role: .system, content: "Answer briefly."),
    ChatMessage(role: .user, content: "What is Swift?")
])
print(response)

// Streaming tokens
let stream = chat.generateStream(messages: [
    ChatMessage(role: .system, content: "Be funny."),
    ChatMessage(role: .user, content: "Tell me a joke")
])
for try await token in stream {
    print(token, terminator: "")
}

สถาปัตยกรรม

Qwen3.5-0.8B เป็นโมเดลไฮบริด 24 ชั้น: DeltaNet 18 ชั้น (linear attention พร้อม gated delta rule recurrence และ RMSNormGated) และ GatedAttention 6 ชั้น (scaled dot-product attention แบบมาตรฐาน) แบ็กเอนด์ MLX รัน inference บน GPU Metal ด้วยน้ำหนัก safetensors แบ็กเอนด์ CoreML ใช้สถาปัตยกรรมแบบสองโมเดล (prefill + decode) ที่ปรับให้เหมาะกับ Neural Engine ทั้งสองรองรับ KV cache พร้อม prompt caching และ sampling ที่กำหนดค่าได้ (temperature, top-k, top-p, repetition penalty)

อินพุต / เอาต์พุตของโมเดล

ทิศทาง	ชื่อ	รูปทรง	คำอธิบาย
อินพุต	`input_ids`	[1, seq_len]	ID ของ token (Int32)
อินพุต	`attention_mask`	[1, seq_len]	Attention mask (Int32)
อินพุต	`kv_cache`	ต่อชั้น	สถานะ cache key-value
เอาต์พุต	`logits`	[1, 1, 151936]	Logits ของ token ถัดไป (Float16)
เอาต์พุต	`kv_cache_out`	ต่อชั้น	KV cache ที่อัปเดตแล้ว

รูปแบบของโมเดล

รูปแบบ	การลดความละเอียด	ขนาด	การประมวลผล	HuggingFace
Qwen3.5-0.8B Chat	INT4	418 MB	GPU Metal (MLX)	aufklarer/Qwen3.5-0.8B-Chat-MLX
Qwen3.5-0.8B Chat	INT8	981 MB	Neural Engine (CoreML)	aufklarer/Qwen3.5-0.8B-Chat-CoreML

การตั้งค่า sampling

let config = ChatSamplingConfig(
    temperature: 0.7,
    topK: 40,
    topP: 0.9,
    maxTokens: 128,
    repetitionPenalty: 1.1
)
let response = try chat.generate(
    messages: [ChatMessage(role: .user, content: "Explain gravity")],
    sampling: config
)

พารามิเตอร์	ค่าเริ่มต้น	คำอธิบาย
`temperature`	0.6	ความสุ่ม (0 = greedy, 1 = สร้างสรรค์)
`topK`	50	เก็บผู้สมัครอันดับต้น K รายการ
`topP`	0.95	เกณฑ์ nucleus sampling
`maxTokens`	512	จำนวน token สูงสุดสำหรับการตอบกลับ
`repetitionPenalty`	1.1	ลงโทษ token ที่ซ้ำ
`disableThinking`	false	ข้ามโหมด thinking
`maxThinkingTokens`	100	จำกัด token ของ thinking

บทสนทนาหลายรอบ

let chat = try await Qwen35MLXChat.fromPretrained()

let history = [
    ChatMessage(role: .system, content: "Remember the user's name."),
    ChatMessage(role: .user, content: "My name is Alex"),
    ChatMessage(role: .assistant, content: "Nice to meet you, Alex!"),
    ChatMessage(role: .user, content: "What's my name?")
]
let response = try chat.generate(messages: history)
print(response)  // "Your name is Alex!"

chat.resetState()  // Clear inference state for a new conversation

การจัดการหน่วยความจำ

// Check memory state
print(chat.isLoaded)        // true
print(chat.memoryFootprint) // 438304768 (~418 MB)

// Free memory under pressure
chat.unload()
print(chat.isLoaded)        // false

// Reload when needed
let chat = try await Qwen35MLXChat.fromPretrained()

เคล็ดลับการจัดการหน่วยความจำบน iOS

บน iPhone การปลด LLM ก่อนทำ TTS inference จะคืนหน่วยความจำราว ~418 MB (INT4 MLX) หรือ ~981 MB (INT8 CoreML) ช่วยป้องกันไม่ให้ jetsam ยุติโปรเซสเมื่อรัน pipeline เต็มรูปแบบ ASR → LLM → TTS

ประสิทธิภาพ

อุปกรณ์	Prefill	Decode	Token/วินาที
M2 Max	~50ms	~65ms/tok	~15 tok/s
iPhone 16 Pro	~1.5s	~450ms/tok	~2.2 tok/s

การแปลง

น้ำหนัก MLX ถูกแปลงจาก checkpoint ต้นฉบับของ Qwen3.5-0.8B ด้วยสคริปต์การแปลง MLX โมเดล CoreML ใช้สคริปต์การแปลงแยกต่างหากสำหรับการ deploy บน Neural Engine น้ำหนักที่แปลงไว้ล่วงหน้ามีอยู่บน HuggingFace ที่ aufklarer/Qwen3.5-0.8B-Chat-MLX (INT4: 418 MB) และ aufklarer/Qwen3.5-0.8B-Chat-CoreML (INT8: 981 MB)