Kokoro TTS
Kokoro-82M is a lightweight, non-autoregressive text-to-speech model based on StyleTTS 2 with an ISTFTNet vocoder. It runs entirely on the Neural Engine via CoreML, producing natural 24 kHz speech from text input in a single forward pass.
Kokoro-82M is designed for on-device iOS deployment. At 82M parameters (~325 MB), it fits comfortably on iPhone and iPad. CoreML runs on the Neural Engine, leaving the GPU free for other tasks.
Supported Languages
| Language | Code | Example Voices |
|---|---|---|
| English (US) | en | af_heart, am_adam, af_sky |
| English (UK) | en | bf_emma, bm_george |
| Spanish | es | ef_dora |
| French | fr | ff_siwis |
| Hindi | hi | hf_alpha, hm_omega |
| Italian | it | if_sara |
| Japanese | ja | jf_alpha, jm_omega |
| Portuguese | pt | pf_dora |
| Chinese | zh | zf_xiaobei, zm_yunjian |
| Korean | ko | kf_somi |
50 preset voices total. Voice naming convention: [language][gender]_[name] — e.g., af_heart = American Female "Heart".
Architecture
Kokoro uses an end-to-end CoreML model that takes phoneme tokens and a voice style embedding, and outputs audio directly. No sampling loop required.
Model I/O
| Direction | Name | Shape | Description |
|---|---|---|---|
| Input | input_ids | [1, N] | Phoneme token IDs (Int32) |
| Input | attention_mask | [1, N] | 1 for real tokens, 0 for padding (Int32) |
| Input | ref_s | [1, 256] | Voice style embedding (Float32) |
| Input | random_phases | [1, 9] | Random phases for ISTFTNet vocoder (Float32) |
| Output | audio | [1, 1, S] | 24 kHz waveform (Float32) |
| Output | audio_length_samples | [1] | Valid sample count (Int32) |
| Output | pred_dur | [1, N] | Predicted phoneme durations (Float32) |
Model Variants
Five pre-compiled CoreML model buckets handle different output lengths. The runtime selects the smallest bucket that fits the input token count.
| Variant | Max Tokens | Max Audio | Target |
|---|---|---|---|
kokoro_24_10s | 242 | 10.0s | iOS 17+ / macOS 14+ |
kokoro_24_15s | 242 | 15.0s | iOS 17+ / macOS 14+ |
kokoro_21_5s | 124 | 7.3s | iOS 16+ / macOS 13+ |
kokoro_21_10s | 168 | 10.6s | iOS 16+ / macOS 13+ |
kokoro_21_15s | 249 | 15.5s | iOS 16+ / macOS 13+ |
v2.4 models use the latest CoreML operations (iOS 17+). v2.1 models provide backward compatibility with iOS 16+.
Phonemizer
Text is converted to phoneme tokens via a three-tier pipeline — all Apache-2.0 licensed, no GPL dependencies:
- Dictionary lookup — US English and British English pronunciation dictionaries with heteronym support
- Suffix stemming — Morphological decomposition for known suffixes (e.g., "-ing", "-tion")
- BART G2P — Neural grapheme-to-phoneme fallback using a separate CoreML encoder-decoder model for out-of-vocabulary words
Model Weights
| Component | Size | Format |
|---|---|---|
| CoreML models (5 variants) | ~280 MB | .mlmodelc |
| Voice embeddings (50 voices) | ~1 MB | JSON (256-dim Float32) |
| G2P encoder + decoder | ~40 MB | .mlmodelc |
| Dictionaries + vocab | ~4 MB | JSON |
| Total | ~325 MB |
Performance
| Metric | Value |
|---|---|
| Parameters | 82M |
| Inference backend | CoreML (Neural Engine) |
| Inference latency | ~45 ms (constant, regardless of output length) |
| Output sample rate | 24 kHz |
| Weight memory | ~325 MB |
| Peak inference memory | ~500 MB |
Unlike Qwen3-TTS and CosyVoice3 which generate tokens step-by-step, Kokoro produces the entire audio in a single forward pass. Latency is constant (~45 ms) regardless of output length.
CLI Usage
audio kokoro "Hello, world!" --voice af_heart --output hello.wav
Options
| Option | Default | Description |
|---|---|---|
<text> | Text to synthesize | |
--voice | af_heart | Voice preset name |
--language | en | Language code: en, es, fr, hi, it, ja, pt, zh, ko, de |
--output, -o | kokoro_output.wav | Output WAV file path |
--list-voices | List all available voices and exit | |
--model, -m | HuggingFace model ID |
Examples
# English with default voice
audio kokoro "Hello, how are you today?" --output hello.wav
# French
audio kokoro "Bonjour le monde" --voice ff_siwis --language fr --output bonjour.wav
# Japanese
audio kokoro "こんにちは世界" --voice jf_alpha --language ja --output konnichiwa.wav
# List all 50 voices
audio kokoro --list-voices
Swift API
import KokoroTTS
import AudioCommon
let tts = try await KokoroTTSModel.fromPretrained()
// Downloads ~325 MB on first run
let audio = try tts.synthesize(text: "Hello world", voice: "af_heart")
// audio: [Float] — 24 kHz mono PCM
try WAVWriter.write(samples: audio, sampleRate: 24000, to: outputURL)
When to Use Kokoro
| Use Case | Recommended TTS |
|---|---|
| iOS app, lightweight, battery-efficient | Kokoro (CoreML, 82M params, ~325 MB) |
| Highest quality, streaming, voice cloning | Qwen3-TTS (MLX, 600M params, ~1.7 GB) |
| Multilingual streaming, 9 languages | CosyVoice3 (MLX, 500M params, ~1.2 GB) |
| Full-duplex spoken dialogue | PersonaPlex (MLX, 7B params, ~5.5 GB) |
License
- Model weights: Apache-2.0 (hexgrad/Kokoro-82M)
- CoreML conversion: Apache-2.0 (aufklarer/Kokoro-82M-CoreML)
- Dictionaries and G2P: Apache-2.0