Forced Alignment

Qwen3-ForcedAligner provides word-level timestamp alignment for audio. It performs a non-autoregressive single forward pass to align each word in a transcript to its precise position in the audio waveform.

How It Works

The aligner uses CTC (Connectionist Temporal Classification) alignment with a LIS (Longest Increasing Subsequence) monotonicity correction step. This ensures timestamps are always in order, even when the raw CTC output contains minor inconsistencies.

PropertyValue
Alignment methodCTC with LIS monotonicity correction
Timestamp resolution80 ms
Output classes5000
Inference modeNon-autoregressive (single forward pass)

CLI Usage

Align an audio file. If no transcript is provided, the audio is automatically transcribed first using Qwen3-ASR:

.build/release/audio align recording.wav

Provide a known transcript to skip automatic transcription:

.build/release/audio align recording.wav --text "The quick brown fox jumps over the lazy dog"

Options

# Specify transcript text directly
.build/release/audio align recording.wav --text "known transcript"

# Choose ASR model for auto-transcription step
.build/release/audio align recording.wav --model 1.7b

# Specify aligner model variant
.build/release/audio align recording.wav --aligner-model default

# Set language
.build/release/audio align recording.wav --language en

Model Variants

Multiple model variants are available, trading size for accuracy:

VariantModel IDSize
MLX 4-bit (default)aufklarer/Qwen3-ForcedAligner-0.6B-4bit~979 MB
MLX 8-bitaufklarer/Qwen3-ForcedAligner-0.6B-8bit~1.3 GB
MLX bf16aufklarer/Qwen3-ForcedAligner-0.6B-bf16~1.8 GB
CoreML INT4aufklarer/Qwen3-ForcedAligner-0.6B-CoreML-INT4~662 MB
CoreML INT8aufklarer/Qwen3-ForcedAligner-0.6B-CoreML-INT8~1.1 GB

Select a variant with --aligner-model:

.build/release/audio align recording.wav --aligner-model aufklarer/Qwen3-ForcedAligner-0.6B-8bit

Output Format

The aligner outputs one line per word with start and end timestamps in seconds:

[0.24 - 0.48] The
[0.48 - 0.72] quick
[0.72 - 1.04] brown
[1.04 - 1.36] fox
[1.36 - 1.68] jumps
[1.68 - 1.92] over
[1.92 - 2.08] the
[2.08 - 2.40] lazy
[2.40 - 2.80] dog

Each timestamp pair indicates the start and end time of the word in the audio, at 80 ms resolution.

Important

When no --text is provided, the aligner first runs a full Qwen3-ASR transcription pass, then aligns the resulting text. This means the first run loads both the ASR model and the aligner model. Providing --text skips the ASR step and only loads the aligner.