Qwen3.5 Chat (LLM en el dispositivo)

Qwen3.5-0.8B es un modelo híbrido de DeltaNet (atención lineal) + GatedAttention con 24 capas (18 DeltaNet + 6 GatedAttention), cuantizado a INT4 para MLX (GPU Metal) e INT8 para CoreML (Neural Engine). Se ejecuta en Mac mediante MLX o en iPhone y Mac mediante CoreML con generación de tokens en streaming. Diseñado para pipelines de voz donde un LLM en el dispositivo proporciona el "cerebro" entre ASR y TTS.

Listo para pipelines de voz

Qwen3.5 Chat se integra con el VoicePipeline de SpeechCore como componente LLM en las cadenas ASR → LLM → TTS. La arquitectura híbrida DeltaNet proporciona una atención eficiente de tiempo lineal para contextos largos.

Inicio rápido

import Qwen3Chat

let chat = try await Qwen35MLXChat.fromPretrained()

// Single response
let response = try chat.generate(messages: [
    ChatMessage(role: .system, content: "Answer briefly."),
    ChatMessage(role: .user, content: "What is Swift?")
])
print(response)

// Streaming tokens
let stream = chat.generateStream(messages: [
    ChatMessage(role: .system, content: "Be funny."),
    ChatMessage(role: .user, content: "Tell me a joke")
])
for try await token in stream {
    print(token, terminator: "")
}

Arquitectura

Qwen3.5-0.8B es un modelo híbrido con 24 capas: 18 capas DeltaNet (atención lineal con recurrencia gated delta rule y RMSNormGated) y 6 capas GatedAttention (atención estándar scaled dot-product). El backend MLX ejecuta la inferencia en la GPU Metal con pesos safetensors. El backend CoreML utiliza una arquitectura de doble modelo (prefill + decode) optimizada para el Neural Engine. Ambos soportan caché KV con prompt caching y sampling configurable (temperature, top-k, top-p, repetition penalty).

Entradas/salidas del modelo

Dirección	Nombre	Forma	Descripción
Entrada	`input_ids`	[1, seq_len]	IDs de token (Int32)
Entrada	`attention_mask`	[1, seq_len]	Máscara de atención (Int32)
Entrada	`kv_cache`	por capa	Estado de la caché clave-valor
Salida	`logits`	[1, 1, 151936]	Logits del siguiente token (Float16)
Salida	`kv_cache_out`	por capa	Caché KV actualizada

Variantes del modelo

Variante	Cuantización	Tamaño	Cómputo	HuggingFace
Qwen3.5-0.8B Chat	INT4	418 MB	GPU Metal (MLX)	aufklarer/Qwen3.5-0.8B-Chat-MLX
Qwen3.5-0.8B Chat	INT8	981 MB	Neural Engine (CoreML)	aufklarer/Qwen3.5-0.8B-Chat-CoreML

Configuración de sampling

let config = ChatSamplingConfig(
    temperature: 0.7,
    topK: 40,
    topP: 0.9,
    maxTokens: 128,
    repetitionPenalty: 1.1
)
let response = try chat.generate(
    messages: [ChatMessage(role: .user, content: "Explain gravity")],
    sampling: config
)

Parámetro	Por defecto	Descripción
`temperature`	0.6	Aleatoriedad (0 = greedy, 1 = creativo)
`topK`	50	Mantiene los K mejores candidatos
`topP`	0.95	Umbral de nucleus sampling
`maxTokens`	512	Tokens máximos de respuesta
`repetitionPenalty`	1.1	Penaliza tokens repetidos
`disableThinking`	false	Omitir el modo thinking
`maxThinkingTokens`	100	Límite de tokens de thinking

Conversación multi-turno

let chat = try await Qwen35MLXChat.fromPretrained()

let history = [
    ChatMessage(role: .system, content: "Remember the user's name."),
    ChatMessage(role: .user, content: "My name is Alex"),
    ChatMessage(role: .assistant, content: "Nice to meet you, Alex!"),
    ChatMessage(role: .user, content: "What's my name?")
]
let response = try chat.generate(messages: history)
print(response)  // "Your name is Alex!"

chat.resetState()  // Clear inference state for a new conversation

Gestión de memoria

// Check memory state
print(chat.isLoaded)        // true
print(chat.memoryFootprint) // 438304768 (~418 MB)

// Free memory under pressure
chat.unload()
print(chat.isLoaded)        // false

// Reload when needed
let chat = try await Qwen35MLXChat.fromPretrained()

Consejo de memoria en iOS

En iPhone, descargar el LLM antes de la inferencia de TTS libera ~418 MB (INT4 MLX) o ~981 MB (INT8 CoreML), evitando la terminación por jetsam al ejecutar pipelines completos ASR → LLM → TTS.

Rendimiento

Dispositivo	Prefill	Decode	Tokens/seg
M2 Max	~50ms	~65ms/tok	~15 tok/s
iPhone 16 Pro	~1.5s	~450ms/tok	~2.2 tok/s

Conversión

Los pesos MLX se convierten a partir del checkpoint original de Qwen3.5-0.8B usando el script de conversión de MLX. Los modelos CoreML utilizan un script de conversión independiente para el despliegue en el Neural Engine. Los pesos preconvertidos están disponibles en HuggingFace en aufklarer/Qwen3.5-0.8B-Chat-MLX (INT4: 418 MB) y aufklarer/Qwen3.5-0.8B-Chat-CoreML (INT8: 981 MB).