Compose — Music & Audio Production

Three on-device modules cover the music and audio-production side of speech-swift, all running natively on Apple Silicon via MLX or CoreML. MAGNeT generates 30-second music clips from a text prompt. Source Separation (Open-Unmix) splits a stereo track into four stems (vocals / drums / bass / other). Speech Enhancement (DeepFilterNet3) removes background noise from speech in real time.

Module	Task	Backend	Output	CLI
MAGNeT	Text → music	MLX (INT4 / INT8)	30 s @ 32 kHz mono	`speech compose`
Open-Unmix	Stem separation	MLX	4 stems @ 44.1 kHz stereo	`speech separate`
DeepFilterNet3	Noise suppression	CoreML (Neural Engine)	48 kHz, real-time	`speech denoise`

MAGNeT — text-to-music generation

MLX Swift port of Meta's MAGNeT (Masked Audio Generation with a Single Non-Autoregressive Transformer). Generates 30 s clips of 32 kHz mono music from a free-form English prompt — "happy rock", "energetic EDM with synth lead", or rich descriptive text for cleaner results.

Architecture

Three loaded components, downloaded on first call:

Component	Role	Source
MAGNeT decoder LM	Masked non-autoregressive transformer over 4 EnCodec codebooks. 24 layers (Small, 300M) or 48 (Medium, 1.5B). Quantised Q/K/V/out projections + FFN linears (MLX-affine, group 64).	`aufklarer/MAGNeT-{Small,Medium}-30secs-MLX-{4,8}bit`
T5-base text encoder	110M-parameter encoder for text conditioning. FP32 (encoder-only path; no decoder, no LM head).	`t5-base`
EnCodec 32 kHz decoder	SEANet decoder (Conv1d / ConvTranspose1d / ResnetBlock / 2-layer LSTM) + 4-codebook Euclidean RVQ. Maps the LM's discrete tokens back to a 32 kHz waveform.	`mlx-community/encodec-32khz-float32`

Masked parallel decoding

Unlike autoregressive sibling MusicGen, MAGNeT runs 50 forward passes total (default split [20, 10, 10, 10] across the 4 codebooks) with cosine-scheduled remasking, classifier-free guidance annealing, and per-stage local attention windows. Stage 0 has full self-attention; stages 1–3 use a local |q − k| ≤ 5 window because higher codebooks only refine details.

Variants

Variant	Params	LM on-disk	Peak RSS	Wall (M-series, 30 s)	RTF
`small-int4`	300M	287 MB	~1.4 GB	~10.8 s	0.36×
`small-int8`	300M	425 MB	~1.5 GB	~11 s	0.37×
`medium-int4`	1.5B	1.36 GB	~2.2 GB	~36 s	1.20×
`medium-int8`	1.5B	2.10 GB	~3.0 GB	~36 s	1.20×

RTF below 1.0 = faster than real time. Quantisation barely moves wall-clock — attention dominates, not the linear projections — so the practical win from INT4 is memory rather than latency.

Quick start

import MAGNeTMusicGen

let model = try await MAGNeTMusicGen.fromPretrained(variant: .smallInt4)
let pcm = model.generate(text: "energetic upbeat rock anthem with electric guitar riffs, driving drums, bass groove")
// pcm: [Float] length 960_000 (30 s × 32 kHz mono)

try WAVWriter.write(samples: pcm, sampleRate: 32_000,
                    to: URL(fileURLWithPath: "out.wav"))

CLI

# Default: small-int4 (~10 s on M-series for 30 s of audio)
speech compose "happy rock" -o happy_rock.wav

# Larger model — better prompt following, ~3.5× slower
speech compose "lo-fi hip hop with mellow piano and warm vinyl crackle" \
    --variant medium-int4 -o lofi.wav

# Reproducible
speech compose "energetic EDM with synth lead" --seed 42 -o edm.wav

Flags: --variant {small,medium}-{int4,int8}, --temperature (annealed, default 3.0), --top-p (default 0.9), --cfg-max / --cfg-min (default 10.0 / 1.0), --steps "20,10,10,10" (per-codebook iterations), --seed.

Prompt-engineering tip

Short tags like "happy rock" work but feel thin. Descriptive prompts that mention instruments + tempo + mood noticeably improve coherence — in our quality sweep the richer prompt gave higher zero-crossing rate (0.116 vs 0.093, i.e. more high-frequency detail) and zero clipping. Compare:

"happy rock" — thin
"energetic upbeat rock anthem with electric guitar riffs, driving drums, bass groove" — richer, usually better

Bundles & licence

All four MLX bundles are derived from facebook/magnet-small-30secs and facebook/magnet-medium-30secs and inherit Meta's licence: CC-BY-NC 4.0 — non-commercial use only. Generated audio carries the same restriction.

Source separation — Open-Unmix (4 stems)

Open-Unmix HQ / UMX-L ported to MLX. Splits a stereo mix into four stems — vocals, drums, bass, other instruments — via per-stem BiLSTM predictors and a multichannel Wiener-EM post-filter, all running end-to-end on MLX through the inverse STFT. Real-world RTF ~0.031 (32× faster than real time) on M-series for 30 s of audio.

# Split mix.wav into vocals/drums/bass/other.wav next to it
speech separate mix.wav

# Or keep stems together
speech separate mix.wav --output stems/

import SourceSeparation

let separator = try await SourceSeparator.fromPretrained()
let stems = separator.separate(audio: stereoSamples, sampleRate: 44_100)
// stems[.vocals], stems[.drums], stems[.bass], stems[.other]  — each [[Float]] (stereo)

Full architecture, tuning, and benchmark notes on the Source Separation guide.

Speech enhancement — DeepFilterNet3

DeepFilterNet3 on the Neural Engine (CoreML). Removes background noise from 48 kHz speech in real time with a 2.1M-parameter model — small enough to run alongside an ASR pipeline as a pre-processing step.

speech denoise noisy.wav -o clean.wav

import SpeechEnhancement

let enhancer = try await SpeechEnhancer.fromPretrained()
let clean = try enhancer.enhance(audio: noisy, sampleRate: 48_000)

Full configuration on the Speech Enhancement guide.

Picking the right tool

You want…	Use
Generate music from a text prompt	MAGNeT (`speech compose`)
Pull vocals or drums out of an existing track	Open-Unmix (`speech separate`)
Clean up noisy speech before transcription	DeepFilterNet3 (`speech denoise`)
Convert text to speech (voice synthesis)	VoxCPM2 or Qwen3-TTS