Audio Super-Resolution — FlashSR

Upsample low-rate audio (8 kHz / 16 kHz / 24 kHz) to 48 kHz in a single diffusion step using FlashSR, a distilled student of AudioSR (ICASSP 2025). Reconstructs realistic high-frequency detail for both speech and music. Runs on Apple Silicon via MLX (Metal), shipped quantized at INT4 (~346 MB) or INT8 (~649 MB).

When to reach for super-resolution

FlashSR is the bridge between low-rate audio sources (phone calls, legacy recordings, low-quality streams, generated speech that bakes in artifacts at the synthesis sample rate) and high-fidelity 48 kHz playback. Pair it with DeepFilterNet3 for noisy phone calls, or with source separation for restoring music stems.

Architecture

FlashSR distills a 50-step AudioSR teacher into a single deterministic forward pass. The pipeline runs entirely on the GPU via MLX:

StageModuleDetails
1. Mel spectrogramHiFi-GAN-style STFTn_fft=2048, hop=480, 256 mels, log-scale
2. VAE encodeAutoencoderKL4-level NHWC encoder, 8× downsample, 16-channel latent
3. One-step diffusionAudioSRUnetCross-attention transformer, v-prediction, cosine schedule
4. VAE decodeAutoencoderKLLatent → reconstructed mel
5. SR VocoderBigVGAN-styleSnakeBeta activations, alias-free FIR, LR-audio conditioning pyramid

The vocoder is BigVGAN-flavor with one important addition: it ingests the low-rate audio as a per-level conditioning pyramid, so the upsampler stays anchored to the source waveform rather than hallucinating energy across the full band.

Processing Pipeline

  1. Normalize — Mean-center and max-abs scale the low-rate audio to [-0.5, 0.5]
  2. (Optional) Lowpass condition — Chebyshev-I order-8 zero-phase filter at the detected spectral cutoff (matches the model's training distribution)
  3. Mel — STFT + Slaney mel filterbank + log
  4. VAE encode — Mel → 16-channel latent cond_z
  5. Single DPM-Solver step — Inject noise, predict v with the UNet conditioned on cond_z, recover x_0 via v-prediction
  6. VAE decode — Latent → reconstructed full-band mel
  7. Vocoder — Mel + normalized LR audio → 48 kHz waveform
  8. Denormalize — Restore original level

Model Variants

VariantSizeHuggingFace
INT4 (default)~346 MBaufklarer/FlashSR-MLX-4bit
INT8~649 MBaufklarer/FlashSR-MLX-8bit

Both variants are flat-quantized: each weight tensor is reshaped (O, fan_in) and quantized with group_size=64. Output quality is audibly indistinguishable between INT4 and INT8 — pick INT4 unless you have storage to spare.

CLI Usage

# Upsample to 48 kHz (output to _upsampled.wav)
.build/release/speech upsample low_rate.wav

# Specify output file
.build/release/speech upsample low_rate.wav -o hr.wav

# Use INT8 model variant
.build/release/speech upsample low_rate.wav --variant int8 -o hr.wav

Options

OptionDescription
--output, -oOutput file path (defaults to <input>_upsampled.wav)
--variantModel variant: int4 (default) or int8
--timestepDiffusion timestep, 0–999. Default 999. Lower values reduce hallucinated detail at the cost of brightness.
--seedSeed for the diffusion noise. Defaults to system random.
Window size and length

FlashSR processes audio in fixed 5.12-second windows (245760 samples at 48 kHz). The CLI handles arbitrary-length input by windowing automatically — short clips are zero-padded; longer files are split into non-overlapping windows and concatenated.

Performance

DeviceVariantWall timeRTF
M-series MacINT4~7.8 s / 5.12 s clip~1.5×
M-series MacINT8~8.0 s / 5.12 s clip~1.6×

RTF > 1 means the model takes longer than real-time per window — this is offline-quality restoration, not a streaming filter. For real-time noise suppression use DeepFilterNet3 instead.

Swift API

import FlashSR
import AudioCommon

// Load the INT4 bundle (downloads on first use, cached in ~/Library/Caches/qwen3-speech)
let model = try await FlashSR.loadFromHub(variant: .int4)

// Read low-rate audio (any sample rate; resampled to 48 kHz internally)
let lr = try AudioFile.read("phone_call.wav")
let hr = try model.upsample(lr.samples, sampleRate: lr.sampleRate)
try AudioFile.write(hr, path: "phone_call_48k.wav", sampleRate: 48000)

FlashSR conforms to the SpeechEnhancementModel protocol, so it slots into the same call sites as DeepFilterNet3 for audio enhancement pipelines.

Combining with Other Models

Super-resolution composes naturally with the rest of the library:

# Denoise then upsample
.build/release/speech denoise noisy_phone.wav -o clean.wav
.build/release/speech upsample clean.wav -o clean_48k.wav

Implementation Notes

Two MLX-specific gotchas matter if you're porting BigVGAN-flavor models to MLX-swift:

Both fixes are documented inline in the Swift module and the Python export.

References