Stable Audio 3

This first-party Soniqo page documents Stable Audio 3 from the local speech-swift / speech-core implementation. Hugging Face bundles are linked below after the integration notes.

Internal Page First

Landing cards and docs menus now point here first; source model and bundle links remain available from this page.

At a Glance

Model	Stable Audio 3
Role	Text-to-music generation
Backend	MLX, Medium DiT int8 default with int4 variant available
Output	44.1 kHz stereo Float PCM
Languages	Prompt language depends on the T5Gemma text encoder
License	Stable Audio model terms apply
Status	Default speech compose engine for Stable Audio 3 Medium
Source	Stability AI Stable Audio 3
Swift product	`StableAudio3MusicGen`
CLI / runtime	`speech compose --engine sa3`

Use

The snippet below mirrors the current speech-swift API or command exposed by the repo.

# Generate 30 seconds of 44.1 kHz stereo audio.
.build/release/speech compose "lofi house loop" \
  --engine sa3 \
  --sa3-variant medium-int8 \
  --seconds 30 \
  -o music.wav

Model Links

Implementation Notes

Download is already componentized into DiT, SAME encoder/decoder, and T5Gemma directories; moving it to byte-weighted progress would match the faster Fish path.
Medium DiT uses 24 layers, 1536 hidden size, differential attention, T5Gemma conditioning, and SAME-L decode.
Small Music and Small SFX bundle IDs exist, but the current Swift port wires the Medium family first.
Length is variable: latent steps are ceil(seconds * 44100 / 4096), then output is cropped to the requested duration.