Parakeet TDT
Parakeet TDT is NVIDIA's speech recognition model, adapted to run on Apple Silicon's Neural Engine via CoreML. It uses a FastConformer encoder paired with a Token-and-Duration Transducer (TDT) decoder for accurate, efficient transcription.
Architecture
The model is split across three CoreML model files that work together during inference:
| Component | Description |
|---|---|
| Encoder | FastConformer — convolutional + self-attention layers for audio feature extraction |
| Decoder | Prediction network that maintains a text token history |
| Joint | Combines encoder and decoder outputs to produce token probabilities |
The encoder is INT8 quantized for minimal memory footprint and fast Neural Engine execution. The decoder and joint network are small enough that quantization is not needed.
Model Variants
| Model | Size | HuggingFace |
|---|---|---|
| Parakeet-TDT-0.6B (CoreML INT8) | 500 MB | aufklarer/Parakeet-TDT-v3-CoreML-INT8 |
Performance
| Metric | Value |
|---|---|
| Real-time factor | ~32x real-time on Apple Silicon Neural Engine |
| Compute target | Neural Engine (via CoreML) |
| Quantization | INT8 |
CLI Usage
Use the --engine parakeet flag to select Parakeet TDT instead of the default Qwen3-ASR:
.build/release/audio transcribe recording.wav --engine parakeet
CoreML vs MLX
Parakeet TDT uses CoreML to run on the Neural Engine, while Qwen3-ASR uses MLX to run on the Metal GPU. The two approaches have different trade-offs:
| Parakeet TDT (CoreML) | Qwen3-ASR (MLX) | |
|---|---|---|
| Compute target | Neural Engine | Metal GPU |
| Speed | ~32x real-time | ~17x real-time |
| Architecture | FastConformer + TDT | Encoder-decoder transformer |
| Multilingual | English-focused | Multilingual |
| Quantization | INT8 | 4-bit (MLX) |
CoreML models run on the Neural Engine, which operates independently from the GPU. This means Parakeet TDT can run concurrently with GPU-based tasks like TTS without contention.
Streaming variant
For real-time dictation and live captioning, see Parakeet-EOU-120M — a smaller (120 MB) RNN-T variant with an explicit end-of-utterance head, designed to run incrementally on 640 ms audio chunks. It shares the same SentencePiece vocabulary as Parakeet TDT 0.6B but is optimized for sub-second partial latency rather than peak throughput.
Also available on Android & Linux via ONNX Runtime.