Parakeet TDT

Parakeet TDT is NVIDIA's speech recognition model, adapted to run on Apple Silicon's Neural Engine via CoreML. It uses a FastConformer encoder paired with a Token-and-Duration Transducer (TDT) decoder for accurate, efficient transcription.

Architecture

The model is split across three CoreML model files that work together during inference:

ComponentDescription
EncoderFastConformer — convolutional + self-attention layers for audio feature extraction
DecoderPrediction network that maintains a text token history
JointCombines encoder and decoder outputs to produce token probabilities

The encoder is INT8 quantized for minimal memory footprint and fast Neural Engine execution. The decoder and joint network are small enough that quantization is not needed.

Model Variants

ModelSizeHuggingFace
Parakeet-TDT-0.6B (CoreML INT8)500 MBaufklarer/Parakeet-TDT-v3-CoreML-INT8

Performance

MetricValue
Real-time factor~32x real-time on Apple Silicon Neural Engine
Compute targetNeural Engine (via CoreML)
QuantizationINT8

CLI Usage

Use the --engine parakeet flag to select Parakeet TDT instead of the default Qwen3-ASR:

.build/release/audio transcribe recording.wav --engine parakeet

CoreML vs MLX

Parakeet TDT uses CoreML to run on the Neural Engine, while Qwen3-ASR uses MLX to run on the Metal GPU. The two approaches have different trade-offs:

Parakeet TDT (CoreML)Qwen3-ASR (MLX)
Compute targetNeural EngineMetal GPU
Speed~32x real-time~17x real-time
ArchitectureFastConformer + TDTEncoder-decoder transformer
MultilingualEnglish-focusedMultilingual
QuantizationINT84-bit (MLX)
Important

CoreML models run on the Neural Engine, which operates independently from the GPU. This means Parakeet TDT can run concurrently with GPU-based tasks like TTS without contention.

Streaming variant

For real-time dictation and live captioning, see Parakeet-EOU-120M — a smaller (120 MB) RNN-T variant with an explicit end-of-utterance head, designed to run incrementally on 640 ms audio chunks. It shares the same SentencePiece vocabulary as Parakeet TDT 0.6B but is optimized for sub-second partial latency rather than peak throughput.

Also available on Android & Linux via ONNX Runtime.