Forced Alignment
Qwen3-ForcedAligner provides word-level timestamp alignment for audio. It performs a non-autoregressive single forward pass to align each word in a transcript to its precise position in the audio waveform.
How It Works
The aligner uses CTC (Connectionist Temporal Classification) alignment with a LIS (Longest Increasing Subsequence) monotonicity correction step. This ensures timestamps are always in order, even when the raw CTC output contains minor inconsistencies.
| Property | Value |
|---|---|
| Alignment method | CTC with LIS monotonicity correction |
| Timestamp resolution | 80 ms |
| Output classes | 5000 |
| Inference mode | Non-autoregressive (single forward pass) |
CLI Usage
Align an audio file. If no transcript is provided, the audio is automatically transcribed first using Qwen3-ASR:
.build/release/audio align recording.wav
Provide a known transcript to skip automatic transcription:
.build/release/audio align recording.wav --text "The quick brown fox jumps over the lazy dog"
Options
# Specify transcript text directly
.build/release/audio align recording.wav --text "known transcript"
# Choose ASR model for auto-transcription step
.build/release/audio align recording.wav --model 1.7b
# Specify aligner model variant
.build/release/audio align recording.wav --aligner-model default
# Set language
.build/release/audio align recording.wav --language en
Model Variants
Multiple model variants are available, trading size for accuracy:
| Variant | Model ID | Size |
|---|---|---|
| MLX 4-bit (default) | aufklarer/Qwen3-ForcedAligner-0.6B-4bit | ~979 MB |
| MLX 8-bit | aufklarer/Qwen3-ForcedAligner-0.6B-8bit | ~1.3 GB |
| MLX bf16 | aufklarer/Qwen3-ForcedAligner-0.6B-bf16 | ~1.8 GB |
| CoreML INT4 | aufklarer/Qwen3-ForcedAligner-0.6B-CoreML-INT4 | ~662 MB |
| CoreML INT8 | aufklarer/Qwen3-ForcedAligner-0.6B-CoreML-INT8 | ~1.1 GB |
Select a variant with --aligner-model:
.build/release/audio align recording.wav --aligner-model aufklarer/Qwen3-ForcedAligner-0.6B-8bit
Output Format
The aligner outputs one line per word with start and end timestamps in seconds:
[0.24 - 0.48] The
[0.48 - 0.72] quick
[0.72 - 1.04] brown
[1.04 - 1.36] fox
[1.36 - 1.68] jumps
[1.68 - 1.92] over
[1.92 - 2.08] the
[2.08 - 2.40] lazy
[2.40 - 2.80] dog
Each timestamp pair indicates the start and end time of the word in the audio, at 80 ms resolution.
When no --text is provided, the aligner first runs a full Qwen3-ASR transcription pass, then aligns the resulting text. This means the first run loads both the ASR model and the aligner model. Providing --text skips the ASR step and only loads the aligner.