Kokoro TTS
Kokoro-82M is a lightweight, non-autoregressive text-to-speech model based on StyleTTS 2 with an ISTFTNet vocoder. It runs entirely on the Neural Engine via CoreML, producing natural 24 kHz speech from text input in a single forward pass.
Kokoro-82M is designed for on-device iOS deployment. At 82M parameters (~80 MB with 1 bucket, INT8), it fits comfortably on iPhone and iPad. CoreML runs on the Neural Engine, leaving the GPU free for other tasks.
Supported Languages
| Language | Code | Example Voices |
|---|---|---|
| English (US) | en | af_heart, am_adam, af_sky |
| English (UK) | en | bf_emma, bm_george |
| Spanish | es | ef_dora |
| French | fr | ff_siwis |
| Hindi | hi | hf_alpha, hm_omega |
| Italian | it | if_sara |
| Japanese | ja | jf_alpha, jm_omega |
| Portuguese | pt | pf_dora |
| Chinese | zh | zf_xiaobei, zm_yunjian |
| Korean | ko | kf_somi |
50 preset voices total. Voice naming convention: [language][gender]_[name] — e.g., af_heart = American Female "Heart".
Architecture
Kokoro uses a 3-stage CoreML pipeline. No sampling loop — all stages are non-autoregressive forward passes with a Swift-side alignment step between stages 1 and 2.
3-Stage Pipeline
| Stage | Model | Input | Output |
|---|---|---|---|
| 1. Duration | duration.mlmodelc | Phoneme tokens + voice embedding + speed | Durations, prosody features, text encoding |
| — | Swift alignment | Durations + stage 1 features | Aligned prosody & text features |
| 2. Prosody | prosody.mlmodelc | Aligned prosody features + style | F0 (pitch) + noise predictions |
| 3. Decoder | decoder_*.mlmodelc | Aligned text + F0 + noise + style | 24 kHz audio waveform |
Phoneme Buckets (Duration Model)
The duration model uses enumerated input shapes. Input is padded to the smallest bucket that fits:
| Bucket | Max Phonemes | Use Case |
|---|---|---|
| p16 | 16 | Short phrases |
| p32 | 32 | Short sentences |
| p64 | 64 | Medium sentences |
| p128 | 128 | Long sentences |
Decoder Buckets
Fixed-shape decoder models for different maximum audio lengths (each frame = 600 samples at 24 kHz):
| Bucket | Max Frames | Max Audio |
|---|---|---|
decoder_5s | 200 | 5.0s |
decoder_10s | 400 | 10.0s |
decoder_15s | 600 | 15.0s |
Requires iOS 18+ / macOS 15+.
Phonemizer
Text is converted to phoneme tokens via a three-tier pipeline — all Apache-2.0 licensed, no GPL dependencies:
- Dictionary lookup — US English and British English pronunciation dictionaries with heteronym support
- Suffix stemming — Morphological decomposition for known suffixes (e.g., "-ing", "-tion")
- BART G2P — Neural grapheme-to-phoneme fallback using a separate CoreML encoder-decoder model for out-of-vocabulary words
Model Weights
| Component | Size | Format |
|---|---|---|
| Duration model | ~39 MB | .mlmodelc |
| Prosody model | ~17 MB | .mlmodelc |
| Decoder models (3 buckets) | ~107 MB each | .mlmodelc |
| Voice embeddings (54 voices) | ~0.3 MB | JSON (256-dim Float32) |
| G2P encoder + decoder | ~1.5 MB | .mlmodelc |
| Dictionaries + vocab | ~6 MB | JSON |
| Total (1 decoder) | ~170 MB |
Performance
| Metric | Value |
|---|---|
| Parameters | 82M |
| Inference backend | CoreML (Neural Engine) |
| Inference RTFx | ~0.7 (faster than real-time) |
| Output sample rate | 24 kHz |
| Weight memory | ~170 MB (1 decoder bucket) |
Unlike Qwen3-TTS and CosyVoice3 which generate tokens step-by-step, Kokoro uses a 3-stage pipeline with no sampling loop. All stages are deterministic forward passes.
CLI Usage
audio kokoro "Hello, world!" --voice af_heart --output hello.wav
Options
| Option | Default | Description |
|---|---|---|
<text> | Text to synthesize | |
--voice | af_heart | Voice preset name |
--language | en | Language code: en, es, fr, hi, it, ja, pt, zh, ko, de |
--output, -o | kokoro_output.wav | Output WAV file path |
--list-voices | List all available voices and exit | |
--model, -m | HuggingFace model ID |
Examples
# English with default voice
audio kokoro "Hello, how are you today?" --output hello.wav
# French
audio kokoro "Bonjour le monde" --voice ff_siwis --language fr --output bonjour.wav
# Japanese
audio kokoro "こんにちは世界" --voice jf_alpha --language ja --output konnichiwa.wav
# List all 50 voices
audio kokoro --list-voices
Swift API
import KokoroTTS
import AudioCommon
let tts = try await KokoroTTSModel.fromPretrained()
// Downloads ~170 MB on first run
let audio = try tts.synthesize(text: "Hello world", voice: "af_heart")
// audio: [Float] — 24 kHz mono PCM
try WAVWriter.write(samples: audio, sampleRate: 24000, to: outputURL)
Compute Unit Override
fromPretrained(computeUnits:) selects which hardware runs the main CoreML model. The default (.all) lets Core ML prefer the Neural Engine, which is the fastest path on every supported device. Pass .cpuAndGPU to bypass the ANE as a fallback on platforms where the ANE compiler produces incorrect output for this model.
import CoreML
import KokoroTTS
// Default: ANE preferred
let tts = try await KokoroTTSModel.fromPretrained()
// Fallback: bypass the Neural Engine
let tts = try await KokoroTTSModel.fromPretrained(computeUnits: .cpuAndGPU)
When to Use Kokoro
| Use Case | Recommended TTS |
|---|---|
| iOS app, lightweight, battery-efficient | Kokoro (CoreML, 82M params, ~170 MB) |
| Highest quality, streaming, voice cloning | Qwen3-TTS (MLX, 600M params, ~1.7 GB) |
| Multilingual streaming, 9 languages | CosyVoice3 (MLX, 500M params, ~1.2 GB) |
| Full-duplex spoken dialogue | PersonaPlex (MLX, 7B params, ~5.5 GB) |
License
- Model weights: Apache-2.0 (hexgrad/Kokoro-82M)
- CoreML conversion: Apache-2.0 (aufklarer/Kokoro-82M-CoreML)
- Dictionaries and G2P: Apache-2.0