Compose — Music & Audio Production
Three on-device modules cover the music and audio-production side of speech-swift, all running natively on Apple Silicon via MLX or CoreML. MAGNeT generates 30-second music clips from a text prompt. Source Separation (Open-Unmix) splits a stereo track into four stems (vocals / drums / bass / other). Speech Enhancement (DeepFilterNet3) removes background noise from speech in real time.
| Module | Task | Backend | Output | CLI |
|---|---|---|---|---|
| MAGNeT | Text → music | MLX (INT4 / INT8) | 30 s @ 32 kHz mono | speech compose |
| Open-Unmix | Stem separation | MLX | 4 stems @ 44.1 kHz stereo | speech separate |
| DeepFilterNet3 | Noise suppression | CoreML (Neural Engine) | 48 kHz, real-time | speech denoise |
MAGNeT — text-to-music generation
MLX Swift port of Meta's MAGNeT (Masked Audio Generation with a Single Non-Autoregressive Transformer). Generates 30 s clips of 32 kHz mono music from a free-form English prompt — "happy rock", "energetic EDM with synth lead", or rich descriptive text for cleaner results.
Architecture
Three loaded components, downloaded on first call:
| Component | Role | Source |
|---|---|---|
| MAGNeT decoder LM | Masked non-autoregressive transformer over 4 EnCodec codebooks. 24 layers (Small, 300M) or 48 (Medium, 1.5B). Quantised Q/K/V/out projections + FFN linears (MLX-affine, group 64). | aufklarer/MAGNeT-{Small,Medium}-30secs-MLX-{4,8}bit |
| T5-base text encoder | 110M-parameter encoder for text conditioning. FP32 (encoder-only path; no decoder, no LM head). | t5-base |
| EnCodec 32 kHz decoder | SEANet decoder (Conv1d / ConvTranspose1d / ResnetBlock / 2-layer LSTM) + 4-codebook Euclidean RVQ. Maps the LM's discrete tokens back to a 32 kHz waveform. | mlx-community/encodec-32khz-float32 |
Masked parallel decoding
Unlike autoregressive sibling MusicGen, MAGNeT runs 50 forward passes total (default split [20, 10, 10, 10] across the 4 codebooks) with cosine-scheduled remasking, classifier-free guidance annealing, and per-stage local attention windows. Stage 0 has full self-attention; stages 1–3 use a local |q − k| ≤ 5 window because higher codebooks only refine details.
Variants
| Variant | Params | LM on-disk | Peak RSS | Wall (M-series, 30 s) | RTF |
|---|---|---|---|---|---|
small-int4 | 300M | 287 MB | ~1.4 GB | ~10.8 s | 0.36× |
small-int8 | 300M | 425 MB | ~1.5 GB | ~11 s | 0.37× |
medium-int4 | 1.5B | 1.36 GB | ~2.2 GB | ~36 s | 1.20× |
medium-int8 | 1.5B | 2.10 GB | ~3.0 GB | ~36 s | 1.20× |
RTF below 1.0 = faster than real time. Quantisation barely moves wall-clock — attention dominates, not the linear projections — so the practical win from INT4 is memory rather than latency.
Quick start
import MAGNeTMusicGen
let model = try await MAGNeTMusicGen.fromPretrained(variant: .smallInt4)
let pcm = model.generate(text: "energetic upbeat rock anthem with electric guitar riffs, driving drums, bass groove")
// pcm: [Float] length 960_000 (30 s × 32 kHz mono)
try WAVWriter.write(samples: pcm, sampleRate: 32_000,
to: URL(fileURLWithPath: "out.wav"))
CLI
# Default: small-int4 (~10 s on M-series for 30 s of audio)
speech compose "happy rock" -o happy_rock.wav
# Larger model — better prompt following, ~3.5× slower
speech compose "lo-fi hip hop with mellow piano and warm vinyl crackle" \
--variant medium-int4 -o lofi.wav
# Reproducible
speech compose "energetic EDM with synth lead" --seed 42 -o edm.wav
Flags: --variant {small,medium}-{int4,int8}, --temperature (annealed, default 3.0), --top-p (default 0.9), --cfg-max / --cfg-min (default 10.0 / 1.0), --steps "20,10,10,10" (per-codebook iterations), --seed.
Short tags like "happy rock" work but feel thin. Descriptive prompts that mention instruments + tempo + mood noticeably improve coherence — in our quality sweep the richer prompt gave higher zero-crossing rate (0.116 vs 0.093, i.e. more high-frequency detail) and zero clipping. Compare:
"happy rock"— thin"energetic upbeat rock anthem with electric guitar riffs, driving drums, bass groove"— richer, usually better
Bundles & licence
All four MLX bundles are derived from facebook/magnet-small-30secs and facebook/magnet-medium-30secs and inherit Meta's licence: CC-BY-NC 4.0 — non-commercial use only. Generated audio carries the same restriction.
Source separation — Open-Unmix (4 stems)
Open-Unmix HQ / UMX-L ported to MLX. Splits a stereo mix into four stems — vocals, drums, bass, other instruments — via per-stem BiLSTM predictors and a multichannel Wiener-EM post-filter, all running end-to-end on MLX through the inverse STFT. Real-world RTF ~0.031 (32× faster than real time) on M-series for 30 s of audio.
# Split mix.wav into vocals/drums/bass/other.wav next to it
speech separate mix.wav
# Or keep stems together
speech separate mix.wav --output stems/
import SourceSeparation
let separator = try await SourceSeparator.fromPretrained()
let stems = try separator.separate(audio: stereoSamples, sampleRate: 44_100)
// stems.vocals, stems.drums, stems.bass, stems.other — each [Float]
Full architecture, tuning, and benchmark notes on the Source Separation guide.
Speech enhancement — DeepFilterNet3
DeepFilterNet3 on the Neural Engine (CoreML). Removes background noise from 48 kHz speech in real time with a 2.1M-parameter model — small enough to run alongside an ASR pipeline as a pre-processing step.
speech denoise noisy.wav -o clean.wav
import SpeechEnhancement
let enhancer = try await SpeechEnhancer.fromPretrained()
let clean = try enhancer.enhance(audio: noisy, sampleRate: 48_000)
Full configuration on the Speech Enhancement guide.
Picking the right tool
| You want… | Use |
|---|---|
| Generate music from a text prompt | MAGNeT (speech compose) |
| Pull vocals or drums out of an existing track | Open-Unmix (speech separate) |
| Clean up noisy speech before transcription | DeepFilterNet3 (speech denoise) |
| Convert text to speech (voice synthesis) | VoxCPM2 or Qwen3-TTS |