Audio Super-Resolution — FlashSR

Upsample low-rate audio (8 kHz / 16 kHz / 24 kHz) to 48 kHz in a single diffusion step using FlashSR, a distilled student of AudioSR (ICASSP 2025). Reconstructs realistic high-frequency detail for both speech and music. Runs on Apple Silicon via MLX (Metal), shipped quantized at INT4 (~346 MB) or INT8 (~649 MB).

When to reach for super-resolution

FlashSR is the bridge between low-rate audio sources (phone calls, legacy recordings, low-quality streams, generated speech that bakes in artifacts at the synthesis sample rate) and high-fidelity 48 kHz playback. Pair it with DeepFilterNet3 for noisy phone calls, or with source separation for restoring music stems.

Architecture

FlashSR distills a 50-step AudioSR teacher into a single deterministic forward pass. The pipeline runs entirely on the GPU via MLX:

Stage	Module	Details
1. Mel spectrogram	HiFi-GAN-style STFT	n_fft=2048, hop=480, 256 mels, log-scale
2. VAE encode	AutoencoderKL	4-level NHWC encoder, 8× downsample, 16-channel latent
3. One-step diffusion	AudioSRUnet	Cross-attention transformer, v-prediction, cosine schedule
4. VAE decode	AutoencoderKL	Latent → reconstructed mel
5. SR Vocoder	BigVGAN-style	SnakeBeta activations, alias-free FIR, LR-audio conditioning pyramid

The vocoder is BigVGAN-flavor with one important addition: it ingests the low-rate audio as a per-level conditioning pyramid, so the upsampler stays anchored to the source waveform rather than hallucinating energy across the full band.

Processing Pipeline

Normalize — Mean-center and max-abs scale the low-rate audio to [-0.5, 0.5]
(Optional) Lowpass condition — Chebyshev-I order-8 zero-phase filter at the detected spectral cutoff (matches the model's training distribution)
Mel — STFT + Slaney mel filterbank + log
VAE encode — Mel → 16-channel latent cond_z
Single DPM-Solver step — Inject noise, predict v with the UNet conditioned on cond_z, recover x_0 via v-prediction
VAE decode — Latent → reconstructed full-band mel
Vocoder — Mel + normalized LR audio → 48 kHz waveform
Denormalize — Restore original level

Model Variants

Variant	Size	HuggingFace
INT4 (default)	~346 MB	aufklarer/FlashSR-MLX-4bit
INT8	~649 MB	aufklarer/FlashSR-MLX-8bit

Both variants are flat-quantized: each weight tensor is reshaped (O, fan_in) and quantized with group_size=64. Output quality is audibly indistinguishable between INT4 and INT8 — pick INT4 unless you have storage to spare.

CLI Usage

# Upsample to 48 kHz (output to _upsampled.wav)
.build/release/speech upsample low_rate.wav

# Specify output file
.build/release/speech upsample low_rate.wav -o hr.wav

# Use INT8 model variant
.build/release/speech upsample low_rate.wav --variant int8 -o hr.wav

Options

Option	Description
`--output`, `-o`	Output file path (defaults to `<input>_upsampled.wav`)
`--variant`	Model variant: `int4` (default) or `int8`
`--timestep`	Diffusion timestep, 0–999. Default 999. Lower values reduce hallucinated detail at the cost of brightness.
`--seed`	Seed for the diffusion noise. Defaults to system random.

Window size and length

FlashSR processes audio in fixed 5.12-second windows (245760 samples at 48 kHz). The CLI handles arbitrary-length input by windowing automatically — short clips are zero-padded; longer files are split into non-overlapping windows and concatenated.

Performance

Device	Variant	Wall time	RTF
M-series Mac	INT4	~7.8 s / 5.12 s clip	~1.5×
M-series Mac	INT8	~8.0 s / 5.12 s clip	~1.6×

RTF > 1 means the model takes longer than real-time per window — this is offline-quality restoration, not a streaming filter. For real-time noise suppression use DeepFilterNet3 instead.

Swift API

import FlashSR
import AudioCommon

// Load the INT4 bundle (downloads on first use, cached in ~/Library/Caches/qwen3-speech)
let model = try await FlashSR.loadFromHub(variant: .int4)

// Read low-rate audio (any sample rate; resampled to 48 kHz internally)
let lr = try AudioFile.read("phone_call.wav")
let hr = try model.upsample(lr.samples, sampleRate: lr.sampleRate)
try AudioFile.write(hr, path: "phone_call_48k.wav", sampleRate: 48000)

FlashSR conforms to the SpeechEnhancementModel protocol, so it slots into the same call sites as DeepFilterNet3 for audio enhancement pipelines.

Combining with Other Models

Super-resolution composes naturally with the rest of the library:

Phone-call restoration — Run speech denoise first to clean noise, then speech upsample to lift narrowband audio to 48 kHz
TTS at 24 kHz → 48 kHz — Generate with Qwen3-TTS at the model's native 24 kHz, then upsample for high-fidelity playback
Music restoration — Pair with source separation to upgrade individual stems
Pre-ASR for archive recordings — Old narrowband recordings transcribe better after upsampling because Qwen3-ASR was trained on full-band audio

# Denoise then upsample
.build/release/speech denoise noisy_phone.wav -o clean.wav
.build/release/speech upsample clean.wav -o clean_48k.wav

Implementation Notes

Two MLX-specific gotchas matter if you're porting BigVGAN-flavor models to MLX-swift:

GroupNorm — MLX's GroupNorm defaults to pytorchCompatible: false, which normalizes per-channel only (skipping spatial dims). PyTorch normalizes per-group across (channels, H, W). Without the flag the VAE encoder drifts ~80% at the first downsample level and the synthesised audio has ~6× the zero-crossing rate of the upstream reference.
Alias-free FIR padding — BigVGAN's alias_free_torch module uses padding_mode='replicate' for both the upsample and lowpass filters. Zero-padding the kaiser-sinc filter introduces boundary discontinuities that ring as HF noise.

Both fixes are documented inline in the Swift module and the Python export.