Forced Alignment

Qwen3-ForcedAligner provides word-level timestamp alignment for audio. It performs a non-autoregressive single forward pass to align each word in a transcript to its precise position in the audio waveform.

How It Works

The aligner uses CTC (Connectionist Temporal Classification) alignment with a LIS (Longest Increasing Subsequence) monotonicity correction step. This ensures timestamps are always in order, even when the raw CTC output contains minor inconsistencies.

PropertyValue
Alignment methodCTC with LIS monotonicity correction
Timestamp resolution80 ms
Output classes5000
Inference modeNon-autoregressive (single forward pass)

CLI Usage

Align an audio file. If no transcript is provided, the audio is automatically transcribed first using Qwen3-ASR:

.build/release/speech align recording.wav

Provide a known transcript to skip automatic transcription:

.build/release/speech align recording.wav --text "The quick brown fox jumps over the lazy dog"

Options

# Specify transcript text directly
.build/release/speech align recording.wav --text "known transcript"

# Choose ASR model for auto-transcription step
.build/release/speech align recording.wav --model 1.7b

# Specify aligner model variant
.build/release/speech align recording.wav --aligner-model default

# Set language
.build/release/speech align recording.wav --language en

Language Support

Pass --language matching the audio's language. The model is officially trained on 11 languages (en, zh, ja, ko, es, fr, de, ru, it, pt, ar); the preprocessor also segments Japanese morphemes, Korean words, Chinese per-character, and Thai / Lao / Khmer / Burmese / Tibetan natively via Apple's NLTokenizer. Combining marks (Devanagari matras, Thai vowels, etc.) are preserved so words like नमस्ते and สวัสดี stay intact.

Model Variants

Multiple model variants are available, trading size for accuracy:

VariantModel IDSize
MLX 4-bit (default)aufklarer/Qwen3-ForcedAligner-0.6B-4bit~979 MB
MLX 8-bitaufklarer/Qwen3-ForcedAligner-0.6B-8bit~1.3 GB
MLX bf16aufklarer/Qwen3-ForcedAligner-0.6B-bf16~1.8 GB
CoreML INT4aufklarer/Qwen3-ForcedAligner-0.6B-CoreML-INT4~662 MB
CoreML INT8aufklarer/Qwen3-ForcedAligner-0.6B-CoreML-INT8~1.1 GB

Select a variant with --aligner-model:

.build/release/speech align recording.wav --aligner-model aufklarer/Qwen3-ForcedAligner-0.6B-8bit

Output Format

The aligner outputs one line per word with start and end timestamps in seconds:

[0.24 - 0.48] The
[0.48 - 0.72] quick
[0.72 - 1.04] brown
[1.04 - 1.36] fox
[1.36 - 1.68] jumps
[1.68 - 1.92] over
[1.92 - 2.08] the
[2.08 - 2.40] lazy
[2.40 - 2.80] dog

Each timestamp pair indicates the start and end time of the word in the audio, at 80 ms resolution.

Long-audio handling

The classify head can address up to 400 seconds in principle (5000 classes × 80 ms), but on the shipped Qwen3-ForcedAligner-0.6B the model is reliably trained up to about 270 seconds. Past that point, the model emits noisy timestamp indices and the LIS post-processing collapses every trailing word onto the same timestamp.

The CLI handles this automatically: long audio is chunked at the saturation point and re-aligned. You will see a one-line message when chunking kicks in:

Audio 306.2s saturated after word 690 (272.6s); chunking remaining 33.6s (pass 2)

Set ALIGN_DEBUG=1 to dump raw vs. corrected timestamp indices when investigating misaligned outputs.

Known limitation: leading non-speech

When audio starts with non-speech (music intro, long silence), the model often stamps the first word near 0 seconds because the classifier has no notion of "speech hasn't started yet". Workaround: trim the leading non-speech before aligning, or run a VAD pre-pass with Silero to find the actual speech start.

Important

When no --text is provided, the aligner first runs a full Qwen3-ASR transcription pass, then aligns the resulting text. This means the first run loads both the ASR model and the aligner model. Providing --text skips the ASR step and only loads the aligner.