Compose — Music & Audio Production

Three on-device modules cover the music and audio-production side of speech-swift, all running natively on Apple Silicon via MLX or CoreML. MAGNeT generates 30-second music clips from a text prompt. Source Separation (Open-Unmix) splits a stereo track into four stems (vocals / drums / bass / other). Speech Enhancement (DeepFilterNet3) removes background noise from speech in real time.

ModuleTaskBackendOutputCLI
MAGNeTText → musicMLX (INT4 / INT8)30 s @ 32 kHz monospeech compose
Open-UnmixStem separationMLX4 stems @ 44.1 kHz stereospeech separate
DeepFilterNet3Noise suppressionCoreML (Neural Engine)48 kHz, real-timespeech denoise

MAGNeT — text-to-music generation

MLX Swift port of Meta's MAGNeT (Masked Audio Generation with a Single Non-Autoregressive Transformer). Generates 30 s clips of 32 kHz mono music from a free-form English prompt — "happy rock", "energetic EDM with synth lead", or rich descriptive text for cleaner results.

Architecture

Three loaded components, downloaded on first call:

ComponentRoleSource
MAGNeT decoder LMMasked non-autoregressive transformer over 4 EnCodec codebooks. 24 layers (Small, 300M) or 48 (Medium, 1.5B). Quantised Q/K/V/out projections + FFN linears (MLX-affine, group 64).aufklarer/MAGNeT-{Small,Medium}-30secs-MLX-{4,8}bit
T5-base text encoder110M-parameter encoder for text conditioning. FP32 (encoder-only path; no decoder, no LM head).t5-base
EnCodec 32 kHz decoderSEANet decoder (Conv1d / ConvTranspose1d / ResnetBlock / 2-layer LSTM) + 4-codebook Euclidean RVQ. Maps the LM's discrete tokens back to a 32 kHz waveform.mlx-community/encodec-32khz-float32

Masked parallel decoding

Unlike autoregressive sibling MusicGen, MAGNeT runs 50 forward passes total (default split [20, 10, 10, 10] across the 4 codebooks) with cosine-scheduled remasking, classifier-free guidance annealing, and per-stage local attention windows. Stage 0 has full self-attention; stages 1–3 use a local |q − k| ≤ 5 window because higher codebooks only refine details.

Variants

VariantParamsLM on-diskPeak RSSWall (M-series, 30 s)RTF
small-int4300M287 MB~1.4 GB~10.8 s0.36×
small-int8300M425 MB~1.5 GB~11 s0.37×
medium-int41.5B1.36 GB~2.2 GB~36 s1.20×
medium-int81.5B2.10 GB~3.0 GB~36 s1.20×

RTF below 1.0 = faster than real time. Quantisation barely moves wall-clock — attention dominates, not the linear projections — so the practical win from INT4 is memory rather than latency.

Quick start

import MAGNeTMusicGen

let model = try await MAGNeTMusicGen.fromPretrained(variant: .smallInt4)
let pcm = model.generate(text: "energetic upbeat rock anthem with electric guitar riffs, driving drums, bass groove")
// pcm: [Float] length 960_000 (30 s × 32 kHz mono)

try WAVWriter.write(samples: pcm, sampleRate: 32_000,
                    to: URL(fileURLWithPath: "out.wav"))

CLI

# Default: small-int4 (~10 s on M-series for 30 s of audio)
speech compose "happy rock" -o happy_rock.wav

# Larger model — better prompt following, ~3.5× slower
speech compose "lo-fi hip hop with mellow piano and warm vinyl crackle" \
    --variant medium-int4 -o lofi.wav

# Reproducible
speech compose "energetic EDM with synth lead" --seed 42 -o edm.wav

Flags: --variant {small,medium}-{int4,int8}, --temperature (annealed, default 3.0), --top-p (default 0.9), --cfg-max / --cfg-min (default 10.0 / 1.0), --steps "20,10,10,10" (per-codebook iterations), --seed.

Prompt-engineering tip

Short tags like "happy rock" work but feel thin. Descriptive prompts that mention instruments + tempo + mood noticeably improve coherence — in our quality sweep the richer prompt gave higher zero-crossing rate (0.116 vs 0.093, i.e. more high-frequency detail) and zero clipping. Compare:

  • "happy rock" — thin
  • "energetic upbeat rock anthem with electric guitar riffs, driving drums, bass groove" — richer, usually better

Bundles & licence

All four MLX bundles are derived from facebook/magnet-small-30secs and facebook/magnet-medium-30secs and inherit Meta's licence: CC-BY-NC 4.0 — non-commercial use only. Generated audio carries the same restriction.

Source separation — Open-Unmix (4 stems)

Open-Unmix HQ / UMX-L ported to MLX. Splits a stereo mix into four stems — vocals, drums, bass, other instruments — via per-stem BiLSTM predictors and a multichannel Wiener-EM post-filter, all running end-to-end on MLX through the inverse STFT. Real-world RTF ~0.031 (32× faster than real time) on M-series for 30 s of audio.

# Split mix.wav into vocals/drums/bass/other.wav next to it
speech separate mix.wav

# Or keep stems together
speech separate mix.wav --output stems/

import SourceSeparation

let separator = try await SourceSeparator.fromPretrained()
let stems = try separator.separate(audio: stereoSamples, sampleRate: 44_100)
// stems.vocals, stems.drums, stems.bass, stems.other  — each [Float]

Full architecture, tuning, and benchmark notes on the Source Separation guide.

Speech enhancement — DeepFilterNet3

DeepFilterNet3 on the Neural Engine (CoreML). Removes background noise from 48 kHz speech in real time with a 2.1M-parameter model — small enough to run alongside an ASR pipeline as a pre-processing step.

speech denoise noisy.wav -o clean.wav

import SpeechEnhancement

let enhancer = try await SpeechEnhancer.fromPretrained()
let clean = try enhancer.enhance(audio: noisy, sampleRate: 48_000)

Full configuration on the Speech Enhancement guide.

Picking the right tool

You want…Use
Generate music from a text promptMAGNeT (speech compose)
Pull vocals or drums out of an existing trackOpen-Unmix (speech separate)
Clean up noisy speech before transcriptionDeepFilterNet3 (speech denoise)
Convert text to speech (voice synthesis)VoxCPM2 or Qwen3-TTS