Stable Audio 3

この Soniqo ページは、ローカルの speech-swift / speech-core 実装にある Stable Audio 3 を説明します。Hugging Face バンドルへのリンクは統合メモの後にあります。

まずサイト内ページへ

ランディングカードとドキュメントメニューは先にこのページへ向け、ソースモデルとバンドルのリンクは本ページ内に残します。

概要

モデル	Stable Audio 3
役割	Text-to-music generation
バックエンド	MLX, Medium DiT int8 default with int4 variant available
出力	44.1 kHz stereo Float PCM
言語	Prompt language depends on the T5Gemma text encoder
ライセンス	Stable Audio model terms apply
状態	Default speech compose engine for Stable Audio 3 Medium
ソース	Stability AI Stable Audio 3
Swift プロダクト	`StableAudio3MusicGen`
CLI / ランタイム	`speech compose --engine sa3`

使い方

以下のスニペットは、現在の speech-swift リポジトリが公開している API またはコマンドに合わせています。

# Generate 30 seconds of 44.1 kHz stereo audio.
.build/release/speech compose "lofi house loop" \
  --engine sa3 \
  --sa3-variant medium-int8 \
  --seconds 30 \
  -o music.wav

モデルリンク

実装メモ

Download is already componentized into DiT, SAME encoder/decoder, and T5Gemma directories; moving it to byte-weighted progress would match the faster Fish path.
Medium DiT uses 24 layers, 1536 hidden size, differential attention, T5Gemma conditioning, and SAME-L decode.
Small Music and Small SFX bundle IDs exist, but the current Swift port wires the Medium family first.
Length is variable: latent steps are ceil(seconds * 44100 / 4096), then output is cropped to the requested duration.