Stable Audio 3

此 Soniqo 页面记录本地 speech-swift / speech-core 实现中的 Stable Audio 3。Hugging Face 包链接放在集成说明之后。

先进入站内页面

首页卡片和文档菜单先指向这里；源模型和权重包链接仍在本页提供。

概览

模型	Stable Audio 3
用途	Text-to-music generation
后端	MLX, Medium DiT int8 default with int4 variant available
输出	44.1 kHz stereo Float PCM
语言	Prompt language depends on the T5Gemma text encoder
许可证	Stable Audio model terms apply
状态	Default speech compose engine for Stable Audio 3 Medium
来源	Stability AI Stable Audio 3
Swift 产品	`StableAudio3MusicGen`
CLI / 运行时	`speech compose --engine sa3`

使用

下面的片段对应当前 speech-swift 仓库暴露的 API 或命令。

# Generate 30 seconds of 44.1 kHz stereo audio.
.build/release/speech compose "lofi house loop" \
  --engine sa3 \
  --sa3-variant medium-int8 \
  --seconds 30 \
  -o music.wav

模型链接

实现说明

Download is already componentized into DiT, SAME encoder/decoder, and T5Gemma directories; moving it to byte-weighted progress would match the faster Fish path.
Medium DiT uses 24 layers, 1536 hidden size, differential attention, T5Gemma conditioning, and SAME-L decode.
Small Music and Small SFX bundle IDs exist, but the current Swift port wires the Medium family first.
Length is variable: latent steps are ceil(seconds * 44100 / 4096), then output is cropped to the requested duration.