Stable Audio 3
此 Soniqo 页面记录本地 speech-swift / speech-core 实现中的 Stable Audio 3。Hugging Face 包链接放在集成说明之后。
先进入站内页面
首页卡片和文档菜单先指向这里;源模型和权重包链接仍在本页提供。
概览
| 模型 | Stable Audio 3 |
|---|---|
| 用途 | Text-to-music generation |
| 后端 | MLX, Medium DiT int8 default with int4 variant available |
| 输出 | 44.1 kHz stereo Float PCM |
| 语言 | Prompt language depends on the T5Gemma text encoder |
| 许可证 | Stable Audio model terms apply |
| 状态 | Default speech compose engine for Stable Audio 3 Medium |
| 来源 | Stability AI Stable Audio 3 |
| Swift 产品 | StableAudio3MusicGen |
| CLI / 运行时 | speech compose --engine sa3 |
使用
下面的片段对应当前 speech-swift 仓库暴露的 API 或命令。
# Generate 30 seconds of 44.1 kHz stereo audio.
.build/release/speech compose "lofi house loop" \
--engine sa3 \
--sa3-variant medium-int8 \
--seconds 30 \
-o music.wav
模型链接
实现说明
- Download is already componentized into DiT, SAME encoder/decoder, and T5Gemma directories; moving it to byte-weighted progress would match the faster Fish path.
- Medium DiT uses 24 layers, 1536 hidden size, differential attention, T5Gemma conditioning, and SAME-L decode.
- Small Music and Small SFX bundle IDs exist, but the current Swift port wires the Medium family first.
- Length is variable: latent steps are ceil(seconds * 44100 / 4096), then output is cropped to the requested duration.