Speech Enhancement
Remove background noise from speech recordings using DeepFilterNet3. The model runs on the Neural Engine via CoreML for efficient inference, while all signal processing (STFT, ERB filterbank, deep filtering) runs on the CPU via Accelerate/vDSP.
Architecture
DeepFilterNet3 uses a dual-decoder architecture that separates spectral envelope enhancement from fine-grained spectral detail recovery.
| Stage | Details |
|---|---|
| STFT | Short-time Fourier transform via vDSP |
| Encoder | 4 SepConv2d layers + SqueezedGRU |
| ERB Decoder | Sigmoid mask applied to ERB-scale frequency bands |
| DF Decoder | 5-tap complex-valued filtering coefficients |
| iSTFT | Inverse STFT to reconstruct the time-domain signal |
The ERB Decoder estimates a gain mask on the Equivalent Rectangular Bandwidth (ERB) scale, handling broad spectral shaping. The DF Decoder predicts 5-tap complex filtering coefficients for fine detail, applying learned filters directly in the frequency domain.
Processing Pipeline
- STFT — Decompose the noisy audio into time-frequency representation using vDSP
- ERB Features — Map STFT bins to ERB-scale frequency bands
- Neural Network — Encoder processes features on Neural Engine; ERB and DF decoders predict enhancement parameters
- ERB Masking — Apply sigmoid gain mask to suppress noise in the spectral envelope
- Deep Filtering — Apply 5-tap complex coefficients for fine spectral detail recovery
- iSTFT — Reconstruct clean audio from the enhanced spectrum
Model Variants
| Variant | Size | Precision |
|---|---|---|
| INT8 (default) | ~2.2 MB | 8-bit quantized |
| FP32 | ~4.3 MB | Full precision |
The model has approximately 2.1M parameters. The INT8 variant is used by default and provides equivalent quality at half the size.
CLI Usage
# Denoise audio (output to _denoised.wav)
.build/release/audio denoise noisy.wav
# Specify output file
.build/release/audio denoise noisy.wav -o clean.wav
# Use FP32 model variant
.build/release/audio denoise noisy.wav --model fp32
Options
| Option | Description |
|---|---|
--output, -o | Output file path (defaults to <input>_denoised.wav) |
--model | Model variant: int8 (default) or fp32 |
DeepFilterNet3 runs on the Neural Engine via CoreML, not on the GPU via MLX. This means it works efficiently even while other GPU-based models (ASR, TTS) are running. No metallib compilation is required.
Model Downloads
| Model | Size | HuggingFace |
|---|---|---|
| DeepFilterNet3 (CoreML FP16) | ~4.2 MB | aufklarer/DeepFilterNet3-CoreML |
Combining with Other Models
Speech enhancement is particularly useful as a preprocessing step before other models:
- Before transcription — Denoise audio before running ASR to improve word error rate on noisy recordings
- Before speaker embedding — Cleaner audio produces more reliable speaker embeddings
- Before diarization — Noise removal can improve segmentation accuracy
# Denoise then transcribe
.build/release/audio denoise noisy.wav -o clean.wav
.build/release/audio transcribe clean.wav
Swift API
import SpeechEnhancement
let model = try await DeepFilterNet3.loadFromHub()
let cleanAudio = try await model.denoise(audioFile: "noisy.wav")
try cleanAudio.write(to: "clean.wav")