Speech Enhancement — DeepFilterNet3

Remove background noise from speech recordings using DeepFilterNet3. The model runs on the Neural Engine via CoreML for efficient inference, while all signal processing (STFT, ERB filterbank, deep filtering) runs on the CPU via Accelerate/vDSP.

Architecture

DeepFilterNet3 uses a dual-decoder architecture that separates spectral envelope enhancement from fine-grained spectral detail recovery.

Stage	Details
STFT	Short-time Fourier transform via vDSP
Encoder	4 SepConv2d layers + SqueezedGRU
ERB Decoder	Sigmoid mask applied to ERB-scale frequency bands
DF Decoder	5-tap complex-valued filtering coefficients
iSTFT	Inverse STFT to reconstruct the time-domain signal

The ERB Decoder estimates a gain mask on the Equivalent Rectangular Bandwidth (ERB) scale, handling broad spectral shaping. The DF Decoder predicts 5-tap complex filtering coefficients for fine detail, applying learned filters directly in the frequency domain.

Processing Pipeline

STFT — Decompose the noisy audio into time-frequency representation using vDSP
ERB Features — Map STFT bins to ERB-scale frequency bands
Neural Network — Encoder processes features on Neural Engine; ERB and DF decoders predict enhancement parameters
ERB Masking — Apply sigmoid gain mask to suppress noise in the spectral envelope
Deep Filtering — Apply 5-tap complex coefficients for fine spectral detail recovery
iSTFT — Reconstruct clean audio from the enhanced spectrum

Model Variants

Variant	Size	Precision
INT8 (default)	~2.2 MB	8-bit quantized
FP32	~4.3 MB	Full precision

The model has approximately 2.1M parameters. The INT8 variant is used by default and provides equivalent quality at half the size.

CLI Usage

# Denoise audio (output to _denoised.wav)
.build/release/speech denoise noisy.wav

# Specify output file
.build/release/speech denoise noisy.wav -o clean.wav

# Use FP32 model variant
.build/release/speech denoise noisy.wav --model fp32

Options

Option	Description
`--output`, `-o`	Output file path (defaults to `<input>_denoised.wav`)
`--model`	Model variant: `int8` (default) or `fp32`

Important

DeepFilterNet3 runs on the Neural Engine via CoreML, not on the GPU via MLX. This means it works efficiently even while other GPU-based models (ASR, TTS) are running. No metallib compilation is required.

Model Downloads

Model	Size	HuggingFace
DeepFilterNet3 (CoreML FP16)	~4.2 MB	aufklarer/DeepFilterNet3-CoreML

Combining with Other Models

Speech enhancement is particularly useful as a preprocessing step before other models:

Before transcription — Denoise audio before running ASR to improve word error rate on noisy recordings
Before speaker embedding — Cleaner audio produces more reliable speaker embeddings
Before diarization — Noise removal can improve segmentation accuracy

# Denoise then transcribe
.build/release/speech denoise noisy.wav -o clean.wav
.build/release/speech transcribe clean.wav

Swift API

import SpeechEnhancement

let model = try await SpeechEnhancer.fromPretrained()
let cleanAudio = try model.enhance(audio: noisySamples, sampleRate: 48000)

Also available on Android, Linux & Windows via ONNX Runtime.