Audio Super-Resolution — FlashSR
Upsample low-rate audio (8 kHz / 16 kHz / 24 kHz) to 48 kHz in a single diffusion step using FlashSR, a distilled student of AudioSR (ICASSP 2025). Reconstructs realistic high-frequency detail for both speech and music. Runs on Apple Silicon via MLX (Metal), shipped quantized at INT4 (~346 MB) or INT8 (~649 MB).
FlashSR is the bridge between low-rate audio sources (phone calls, legacy recordings, low-quality streams, generated speech that bakes in artifacts at the synthesis sample rate) and high-fidelity 48 kHz playback. Pair it with DeepFilterNet3 for noisy phone calls, or with source separation for restoring music stems.
Architecture
FlashSR distills a 50-step AudioSR teacher into a single deterministic forward pass. The pipeline runs entirely on the GPU via MLX:
| Stage | Module | Details |
|---|---|---|
| 1. Mel spectrogram | HiFi-GAN-style STFT | n_fft=2048, hop=480, 256 mels, log-scale |
| 2. VAE encode | AutoencoderKL | 4-level NHWC encoder, 8× downsample, 16-channel latent |
| 3. One-step diffusion | AudioSRUnet | Cross-attention transformer, v-prediction, cosine schedule |
| 4. VAE decode | AutoencoderKL | Latent → reconstructed mel |
| 5. SR Vocoder | BigVGAN-style | SnakeBeta activations, alias-free FIR, LR-audio conditioning pyramid |
The vocoder is BigVGAN-flavor with one important addition: it ingests the low-rate audio as a per-level conditioning pyramid, so the upsampler stays anchored to the source waveform rather than hallucinating energy across the full band.
Processing Pipeline
- Normalize — Mean-center and max-abs scale the low-rate audio to [-0.5, 0.5]
- (Optional) Lowpass condition — Chebyshev-I order-8 zero-phase filter at the detected spectral cutoff (matches the model's training distribution)
- Mel — STFT + Slaney mel filterbank + log
- VAE encode — Mel → 16-channel latent
cond_z - Single DPM-Solver step — Inject noise, predict
vwith the UNet conditioned oncond_z, recoverx_0via v-prediction - VAE decode — Latent → reconstructed full-band mel
- Vocoder — Mel + normalized LR audio → 48 kHz waveform
- Denormalize — Restore original level
Model Variants
| Variant | Size | HuggingFace |
|---|---|---|
| INT4 (default) | ~346 MB | aufklarer/FlashSR-MLX-4bit |
| INT8 | ~649 MB | aufklarer/FlashSR-MLX-8bit |
Both variants are flat-quantized: each weight tensor is reshaped (O, fan_in) and quantized with group_size=64. Output quality is audibly indistinguishable between INT4 and INT8 — pick INT4 unless you have storage to spare.
CLI Usage
# Upsample to 48 kHz (output to _upsampled.wav)
.build/release/speech upsample low_rate.wav
# Specify output file
.build/release/speech upsample low_rate.wav -o hr.wav
# Use INT8 model variant
.build/release/speech upsample low_rate.wav --variant int8 -o hr.wav
Options
| Option | Description |
|---|---|
--output, -o | Output file path (defaults to <input>_upsampled.wav) |
--variant | Model variant: int4 (default) or int8 |
--timestep | Diffusion timestep, 0–999. Default 999. Lower values reduce hallucinated detail at the cost of brightness. |
--seed | Seed for the diffusion noise. Defaults to system random. |
FlashSR processes audio in fixed 5.12-second windows (245760 samples at 48 kHz). The CLI handles arbitrary-length input by windowing automatically — short clips are zero-padded; longer files are split into non-overlapping windows and concatenated.
Performance
| Device | Variant | Wall time | RTF |
|---|---|---|---|
| M-series Mac | INT4 | ~7.8 s / 5.12 s clip | ~1.5× |
| M-series Mac | INT8 | ~8.0 s / 5.12 s clip | ~1.6× |
RTF > 1 means the model takes longer than real-time per window — this is offline-quality restoration, not a streaming filter. For real-time noise suppression use DeepFilterNet3 instead.
Swift API
import FlashSR
import AudioCommon
// Load the INT4 bundle (downloads on first use, cached in ~/Library/Caches/qwen3-speech)
let model = try await FlashSR.loadFromHub(variant: .int4)
// Read low-rate audio (any sample rate; resampled to 48 kHz internally)
let lr = try AudioFile.read("phone_call.wav")
let hr = try model.upsample(lr.samples, sampleRate: lr.sampleRate)
try AudioFile.write(hr, path: "phone_call_48k.wav", sampleRate: 48000)
FlashSR conforms to the SpeechEnhancementModel protocol, so it slots into the same call sites as DeepFilterNet3 for audio enhancement pipelines.
Combining with Other Models
Super-resolution composes naturally with the rest of the library:
- Phone-call restoration — Run
speech denoisefirst to clean noise, thenspeech upsampleto lift narrowband audio to 48 kHz - TTS at 24 kHz → 48 kHz — Generate with Qwen3-TTS at the model's native 24 kHz, then upsample for high-fidelity playback
- Music restoration — Pair with source separation to upgrade individual stems
- Pre-ASR for archive recordings — Old narrowband recordings transcribe better after upsampling because Qwen3-ASR was trained on full-band audio
# Denoise then upsample
.build/release/speech denoise noisy_phone.wav -o clean.wav
.build/release/speech upsample clean.wav -o clean_48k.wav
Implementation Notes
Two MLX-specific gotchas matter if you're porting BigVGAN-flavor models to MLX-swift:
- GroupNorm — MLX's
GroupNormdefaults topytorchCompatible: false, which normalizes per-channel only (skipping spatial dims). PyTorch normalizes per-group across(channels, H, W). Without the flag the VAE encoder drifts ~80% at the first downsample level and the synthesised audio has ~6× the zero-crossing rate of the upstream reference. - Alias-free FIR padding — BigVGAN's
alias_free_torchmodule usespadding_mode='replicate'for both the upsample and lowpass filters. Zero-padding the kaiser-sinc filter introduces boundary discontinuities that ring as HF noise.
Both fixes are documented inline in the Swift module and the Python export.