Speech Enhancement

Remove background noise from speech recordings using DeepFilterNet3. The model runs on the Neural Engine via CoreML for efficient inference, while all signal processing (STFT, ERB filterbank, deep filtering) runs on the CPU via Accelerate/vDSP.

Architecture

DeepFilterNet3 uses a dual-decoder architecture that separates spectral envelope enhancement from fine-grained spectral detail recovery.

StageDetails
STFTShort-time Fourier transform via vDSP
Encoder4 SepConv2d layers + SqueezedGRU
ERB DecoderSigmoid mask applied to ERB-scale frequency bands
DF Decoder5-tap complex-valued filtering coefficients
iSTFTInverse STFT to reconstruct the time-domain signal

The ERB Decoder estimates a gain mask on the Equivalent Rectangular Bandwidth (ERB) scale, handling broad spectral shaping. The DF Decoder predicts 5-tap complex filtering coefficients for fine detail, applying learned filters directly in the frequency domain.

Processing Pipeline

  1. STFT — Decompose the noisy audio into time-frequency representation using vDSP
  2. ERB Features — Map STFT bins to ERB-scale frequency bands
  3. Neural Network — Encoder processes features on Neural Engine; ERB and DF decoders predict enhancement parameters
  4. ERB Masking — Apply sigmoid gain mask to suppress noise in the spectral envelope
  5. Deep Filtering — Apply 5-tap complex coefficients for fine spectral detail recovery
  6. iSTFT — Reconstruct clean audio from the enhanced spectrum

Model Variants

VariantSizePrecision
INT8 (default)~2.2 MB8-bit quantized
FP32~4.3 MBFull precision

The model has approximately 2.1M parameters. The INT8 variant is used by default and provides equivalent quality at half the size.

CLI Usage

# Denoise audio (output to _denoised.wav)
.build/release/audio denoise noisy.wav

# Specify output file
.build/release/audio denoise noisy.wav -o clean.wav

# Use FP32 model variant
.build/release/audio denoise noisy.wav --model fp32

Options

OptionDescription
--output, -oOutput file path (defaults to <input>_denoised.wav)
--modelModel variant: int8 (default) or fp32
Important

DeepFilterNet3 runs on the Neural Engine via CoreML, not on the GPU via MLX. This means it works efficiently even while other GPU-based models (ASR, TTS) are running. No metallib compilation is required.

Model Downloads

ModelSizeHuggingFace
DeepFilterNet3 (CoreML FP16)~4.2 MBaufklarer/DeepFilterNet3-CoreML

Combining with Other Models

Speech enhancement is particularly useful as a preprocessing step before other models:

# Denoise then transcribe
.build/release/audio denoise noisy.wav -o clean.wav
.build/release/audio transcribe clean.wav

Swift API

import SpeechEnhancement

let model = try await DeepFilterNet3.loadFromHub()
let cleanAudio = try await model.denoise(audioFile: "noisy.wav")
try cleanAudio.write(to: "clean.wav")