Forced Alignment
Qwen3-ForcedAligner provides word-level timestamp alignment for audio. It performs a non-autoregressive single forward pass to align each word in a transcript to its precise position in the audio waveform.
How It Works
The aligner uses CTC (Connectionist Temporal Classification) alignment with a LIS (Longest Increasing Subsequence) monotonicity correction step. This ensures timestamps are always in order, even when the raw CTC output contains minor inconsistencies.
| Property | Value |
|---|---|
| Alignment method | CTC with LIS monotonicity correction |
| Timestamp resolution | 80 ms |
| Output classes | 5000 |
| Inference mode | Non-autoregressive (single forward pass) |
CLI Usage
Align an audio file. If no transcript is provided, the audio is automatically transcribed first using Qwen3-ASR:
.build/release/speech align recording.wav
Provide a known transcript to skip automatic transcription:
.build/release/speech align recording.wav --text "The quick brown fox jumps over the lazy dog"
Options
# Specify transcript text directly
.build/release/speech align recording.wav --text "known transcript"
# Choose ASR model for auto-transcription step
.build/release/speech align recording.wav --model 1.7b
# Specify aligner model variant
.build/release/speech align recording.wav --aligner-model default
# Set language
.build/release/speech align recording.wav --language en
Language Support
Pass --language matching the audio's language. The model is officially trained on 11 languages (en, zh, ja, ko, es, fr, de, ru, it, pt, ar); the preprocessor also segments Japanese morphemes, Korean words, Chinese per-character, and Thai / Lao / Khmer / Burmese / Tibetan natively via Apple's NLTokenizer. Combining marks (Devanagari matras, Thai vowels, etc.) are preserved so words like नमस्ते and สวัสดี stay intact.
Model Variants
Multiple model variants are available, trading size for accuracy:
| Variant | Model ID | Size |
|---|---|---|
| MLX 4-bit (default) | aufklarer/Qwen3-ForcedAligner-0.6B-4bit | ~979 MB |
| MLX 8-bit | aufklarer/Qwen3-ForcedAligner-0.6B-8bit | ~1.3 GB |
| MLX bf16 | aufklarer/Qwen3-ForcedAligner-0.6B-bf16 | ~1.8 GB |
| CoreML INT4 | aufklarer/Qwen3-ForcedAligner-0.6B-CoreML-INT4 | ~662 MB |
| CoreML INT8 | aufklarer/Qwen3-ForcedAligner-0.6B-CoreML-INT8 | ~1.1 GB |
Select a variant with --aligner-model:
.build/release/speech align recording.wav --aligner-model aufklarer/Qwen3-ForcedAligner-0.6B-8bit
Output Format
The aligner outputs one line per word with start and end timestamps in seconds:
[0.24 - 0.48] The
[0.48 - 0.72] quick
[0.72 - 1.04] brown
[1.04 - 1.36] fox
[1.36 - 1.68] jumps
[1.68 - 1.92] over
[1.92 - 2.08] the
[2.08 - 2.40] lazy
[2.40 - 2.80] dog
Each timestamp pair indicates the start and end time of the word in the audio, at 80 ms resolution.
Long-audio handling
The classify head can address up to 400 seconds in principle (5000 classes × 80 ms), but on the shipped Qwen3-ForcedAligner-0.6B the model is reliably trained up to about 270 seconds. Past that point, the model emits noisy timestamp indices and the LIS post-processing collapses every trailing word onto the same timestamp.
The CLI handles this automatically: long audio is chunked at the saturation point and re-aligned. You will see a one-line message when chunking kicks in:
Audio 306.2s saturated after word 690 (272.6s); chunking remaining 33.6s (pass 2)
Set ALIGN_DEBUG=1 to dump raw vs. corrected timestamp indices when investigating misaligned outputs.
Known limitation: leading non-speech
When audio starts with non-speech (music intro, long silence), the model often stamps the first word near 0 seconds because the classifier has no notion of "speech hasn't started yet". Workaround: trim the leading non-speech before aligning, or run a VAD pre-pass with Silero to find the actual speech start.
When no --text is provided, the aligner first runs a full Qwen3-ASR transcription pass, then aligns the resulting text. This means the first run loads both the ASR model and the aligner model. Providing --text skips the ASR step and only loads the aligner.