Stable Audio 3

이 Soniqo 페이지는 로컬 speech-swift / speech-core 구현의 Stable Audio 3을 설명합니다. Hugging Face 번들 링크는 통합 메모 뒤에 있습니다.

내부 페이지 우선

랜딩 카드와 문서 메뉴는 먼저 이 페이지로 이동하고, 원본 모델과 번들 링크는 이 페이지 안에 둡니다.

개요

모델	Stable Audio 3
역할	Text-to-music generation
백엔드	MLX, Medium DiT int8 default with int4 variant available
출력	44.1 kHz stereo Float PCM
언어	Prompt language depends on the T5Gemma text encoder
라이선스	Stable Audio model terms apply
상태	Default speech compose engine for Stable Audio 3 Medium
소스	Stability AI Stable Audio 3
Swift 제품	`StableAudio3MusicGen`
CLI / 런타임	`speech compose --engine sa3`

사용

아래 스니펫은 현재 speech-swift 저장소의 API 또는 명령과 일치합니다.

# Generate 30 seconds of 44.1 kHz stereo audio.
.build/release/speech compose "lofi house loop" \
  --engine sa3 \
  --sa3-variant medium-int8 \
  --seconds 30 \
  -o music.wav

모델 링크

구현 메모

Download is already componentized into DiT, SAME encoder/decoder, and T5Gemma directories; moving it to byte-weighted progress would match the faster Fish path.
Medium DiT uses 24 layers, 1536 hidden size, differential attention, T5Gemma conditioning, and SAME-L decode.
Small Music and Small SFX bundle IDs exist, but the current Swift port wires the Medium family first.
Length is variable: latent steps are ceil(seconds * 44100 / 4096), then output is cropped to the requested duration.