Source Separation
Open-Unmix HQ splits a stereo music track into four independent stems — vocals, drums, bass, and other. Four independent BiLSTM models (one per stem) produce magnitude masks over the mixture STFT; an optional Wiener post-filter reconciles them. Runs on Apple Silicon via MLX.
Two engines are available: Open-Unmix HQ (lightweight, the default) and HTDemucs (Demucs v4) — a higher-quality Hybrid Transformer model selected with --engine htdemucs. Both run on Apple Silicon via MLX and output the same four stems at 44.1 kHz.
What it is
- 4 stems per track — vocals, drums, bass, other. Each is a 2-channel 44.1 kHz stem file.
- Magnitude-mask model — each stem model predicts a non-negative mask applied to the mixture spectrogram; phase is taken from the mixture.
- Wiener post-filter (optional) — soft-mask refinement across all 4 stems so they sum coherently back to the mixture. Adds ~0.5 dB SDR.
- Small footprint — 8.9M params per stem, ~136 MB total for all 4 stems.
- Apache-2.0 — upstream weights under MIT, our CoreML/MLX conversion under Apache-2.0.
Architecture
Four independent stems, each a copy of the same network:
| Stage | Shape / operation |
|---|---|
| STFT | 4096-point FFT, 1024-hop, periodic Hann window, reflect-pad. 2049 frequency bins per frame. |
| Input normalize | Crop to 1487 bins (≈16 kHz), apply learned per-bin mean + scale from training. |
| Encoder | Linear 2974 → 512 + BatchNorm + tanh. Input is 2 channels × 1487 bins. |
| BiLSTM | 3 layers, 256 hidden per direction (512 effective). Captures temporal context across frames. |
| Decoder | Skip-concat of encoder and LSTM outputs (1024) → Linear 1024 → 512 + BN + ReLU → Linear 512 → 4098. |
| Output denorm + mask | Element-wise multiply with mixture magnitude; phase from mixture; iSTFT overlap-add. |
| Wiener (optional) | Power-ratio masks across all 4 stem estimates. Refines phase so stems sum to mixture. |
Model
| Component | Value |
|---|---|
| Parameters / stem | 8.9M |
| Parameters total (4 stems) | ~35.6M |
| Sample rate | 44.1 kHz stereo |
| Chunk latency | Offline (full-track STFT) |
| Weights | aufklarer/OpenUnmix-HQ-MLX (safetensors, ~136 MB) |
| Upstream | sigsep/open-unmix-pytorch (Stöter et al., JOSS 2019) |
HTDemucs (Demucs v4)
For higher separation quality — especially on bass and drums — the package also ships HTDemucs, Meta's Hybrid Transformer Demucs. It merges a spectrogram branch and a waveform branch through a cross-domain transformer; the shipped htdemucs_ft variant is a bag of four fine-tuned sub-models, one per stem. Weights download from HuggingFace on first use. On a directional MUSDB-sample benchmark (museval / BSSEval v4) it averages +3.01 dB SDR over UMX-HQ, with the biggest gains on bass (+5.75 dB).
| Component | Value |
|---|---|
| Parameters | 168M (4 × 42M fine-tuned sub-models) |
| Sample rate | 44.1 kHz stereo |
| Windowing | 7.8 s segments, 25% overlap, triangular cross-fade |
| Weights | aufklarer/HTDemucs-FT-MLX (fp16, ~320 MB) |
| Upstream | facebookresearch/demucs (Rouard et al., ICASSP 2023) |
Quick start — Swift
import SourceSeparation
import AudioCommon
let separator = try await SourceSeparator.fromPretrained()
let stereo = try AudioFileLoader.loadStereo(
url: URL(fileURLWithPath: "song.wav"),
targetSampleRate: 44100
)
let stems = separator.separate(audio: stereo, sampleRate: 44100)
// stems[.vocals], stems[.drums], stems[.bass], stems[.other]
// Each is [[Float]] — left channel, right channel.
try WAVWriter.writeStereo(
left: stems[.vocals]![0],
right: stems[.vocals]![1],
sampleRate: 44100,
to: URL(fileURLWithPath: "vocals.wav")
)
Pass wiener: true (default) for best quality. Pass targets: [.vocals] to extract only a subset of stems and skip the other models.
CLI
speech separate song.wav # all 4 stems into song_stems/ (Open-Unmix)
speech separate song.wav --engine htdemucs # Demucs v4 — higher quality
speech separate song.wav --engine htdemucs --htdemucs-precision int8 # smaller int8 bundle
speech separate song.wav --stems vocals # vocals only
speech separate song.wav --stems vocals,drums # subset
speech separate song.wav --output-dir /tmp/stems/ # custom output dir
speech separate song.wav --verbose # show timing
When to use
…you need a lightweight, offline source-separation pass inside an app or pipeline on Apple Silicon. 8.9M params per stem keeps download and memory modest. Magnitude-masking plus Wiener gives good stems on most pop/rock content. For state-of-the-art vocal isolation on studio material, switch to the bundled HTDemucs (Demucs v4) engine via --engine htdemucs; Open-Unmix remains the lightweight, ship-in-your-app end of the tradeoff.