Forced Alignment
Qwen3-ForcedAligner provides word-level timestamp alignment for audio. It performs a non-autoregressive single forward pass to align each word in a transcript to its precise position in the audio waveform.
How It Works
The aligner uses CTC (Connectionist Temporal Classification) alignment with a LIS (Longest Increasing Subsequence) monotonicity correction step. This ensures timestamps are always in order, even when the raw CTC output contains minor inconsistencies.
| Property | Value |
|---|---|
| Alignment method | CTC with LIS monotonicity correction |
| Timestamp resolution | 80 ms |
| Output classes | 5000 |
| Inference mode | Non-autoregressive (single forward pass) |
CLI Usage
Align an audio file. If no transcript is provided, the audio is automatically transcribed first using Qwen3-ASR:
.build/release/audio align recording.wav
Provide a known transcript to skip automatic transcription:
.build/release/audio align recording.wav --text "The quick brown fox jumps over the lazy dog"
Options
# Specify transcript text directly
.build/release/audio align recording.wav --text "known transcript"
# Choose ASR model for auto-transcription step
.build/release/audio align recording.wav --model 1.7b
# Specify aligner model variant
.build/release/audio align recording.wav --aligner-model default
# Set language
.build/release/audio align recording.wav --language en
Language Support
Pass --language matching the audio's language. The model is officially trained on 11 languages (en, zh, ja, ko, es, fr, de, ru, it, pt, ar); the preprocessor also segments Japanese morphemes, Korean words, Chinese per-character, and Thai / Lao / Khmer / Burmese / Tibetan natively via Apple's NLTokenizer. Combining marks (Devanagari matras, Thai vowels, etc.) are preserved so words like नमस्ते and สวัสดี stay intact.
Model Variants
Multiple model variants are available, trading size for accuracy:
| Variant | Model ID | Size |
|---|---|---|
| MLX 4-bit (default) | aufklarer/Qwen3-ForcedAligner-0.6B-4bit | ~979 MB |
| MLX 8-bit | aufklarer/Qwen3-ForcedAligner-0.6B-8bit | ~1.3 GB |
| MLX bf16 | aufklarer/Qwen3-ForcedAligner-0.6B-bf16 | ~1.8 GB |
| CoreML INT4 | aufklarer/Qwen3-ForcedAligner-0.6B-CoreML-INT4 | ~662 MB |
| CoreML INT8 | aufklarer/Qwen3-ForcedAligner-0.6B-CoreML-INT8 | ~1.1 GB |
Select a variant with --aligner-model:
.build/release/audio align recording.wav --aligner-model aufklarer/Qwen3-ForcedAligner-0.6B-8bit
Output Format
The aligner outputs one line per word with start and end timestamps in seconds:
[0.24 - 0.48] The
[0.48 - 0.72] quick
[0.72 - 1.04] brown
[1.04 - 1.36] fox
[1.36 - 1.68] jumps
[1.68 - 1.92] over
[1.92 - 2.08] the
[2.08 - 2.40] lazy
[2.40 - 2.80] dog
Each timestamp pair indicates the start and end time of the word in the audio, at 80 ms resolution.
When no --text is provided, the aligner first runs a full Qwen3-ASR transcription pass, then aligns the resulting text. This means the first run loads both the ASR model and the aligner model. Providing --text skips the ASR step and only loads the aligner.