Kokoro TTS

Kokoro-82M is a lightweight, non-autoregressive text-to-speech model based on StyleTTS 2 with an ISTFTNet vocoder. It runs entirely on the Neural Engine via CoreML, producing natural 24 kHz speech from text input in a single forward pass.

iOS-Ready

Kokoro-82M is designed for on-device iOS deployment. At 82M parameters (~325 MB), it fits comfortably on iPhone and iPad. CoreML runs on the Neural Engine, leaving the GPU free for other tasks.

Supported Languages

LanguageCodeExample Voices
English (US)enaf_heart, am_adam, af_sky
English (UK)enbf_emma, bm_george
Spanishesef_dora
Frenchfrff_siwis
Hindihihf_alpha, hm_omega
Italianitif_sara
Japanesejajf_alpha, jm_omega
Portugueseptpf_dora
Chinesezhzf_xiaobei, zm_yunjian
Koreankokf_somi

50 preset voices total. Voice naming convention: [language][gender]_[name] — e.g., af_heart = American Female "Heart".

Architecture

Kokoro uses an end-to-end CoreML model that takes phoneme tokens and a voice style embedding, and outputs audio directly. No sampling loop required.

Model I/O

DirectionNameShapeDescription
Inputinput_ids[1, N]Phoneme token IDs (Int32)
Inputattention_mask[1, N]1 for real tokens, 0 for padding (Int32)
Inputref_s[1, 256]Voice style embedding (Float32)
Inputrandom_phases[1, 9]Random phases for ISTFTNet vocoder (Float32)
Outputaudio[1, 1, S]24 kHz waveform (Float32)
Outputaudio_length_samples[1]Valid sample count (Int32)
Outputpred_dur[1, N]Predicted phoneme durations (Float32)

Model Variants

Five pre-compiled CoreML model buckets handle different output lengths. The runtime selects the smallest bucket that fits the input token count.

VariantMax TokensMax AudioTarget
kokoro_24_10s24210.0siOS 17+ / macOS 14+
kokoro_24_15s24215.0siOS 17+ / macOS 14+
kokoro_21_5s1247.3siOS 16+ / macOS 13+
kokoro_21_10s16810.6siOS 16+ / macOS 13+
kokoro_21_15s24915.5siOS 16+ / macOS 13+

v2.4 models use the latest CoreML operations (iOS 17+). v2.1 models provide backward compatibility with iOS 16+.

Phonemizer

Text is converted to phoneme tokens via a three-tier pipeline — all Apache-2.0 licensed, no GPL dependencies:

  1. Dictionary lookup — US English and British English pronunciation dictionaries with heteronym support
  2. Suffix stemming — Morphological decomposition for known suffixes (e.g., "-ing", "-tion")
  3. BART G2P — Neural grapheme-to-phoneme fallback using a separate CoreML encoder-decoder model for out-of-vocabulary words

Model Weights

ComponentSizeFormat
CoreML models (5 variants)~280 MB.mlmodelc
Voice embeddings (50 voices)~1 MBJSON (256-dim Float32)
G2P encoder + decoder~40 MB.mlmodelc
Dictionaries + vocab~4 MBJSON
Total~325 MB

Performance

MetricValue
Parameters82M
Inference backendCoreML (Neural Engine)
Inference latency~45 ms (constant, regardless of output length)
Output sample rate24 kHz
Weight memory~325 MB
Peak inference memory~500 MB
Non-Autoregressive

Unlike Qwen3-TTS and CosyVoice3 which generate tokens step-by-step, Kokoro produces the entire audio in a single forward pass. Latency is constant (~45 ms) regardless of output length.

CLI Usage

audio kokoro "Hello, world!" --voice af_heart --output hello.wav

Options

OptionDefaultDescription
<text>Text to synthesize
--voiceaf_heartVoice preset name
--languageenLanguage code: en, es, fr, hi, it, ja, pt, zh, ko, de
--output, -okokoro_output.wavOutput WAV file path
--list-voicesList all available voices and exit
--model, -mHuggingFace model ID

Examples

# English with default voice
audio kokoro "Hello, how are you today?" --output hello.wav

# French
audio kokoro "Bonjour le monde" --voice ff_siwis --language fr --output bonjour.wav

# Japanese
audio kokoro "こんにちは世界" --voice jf_alpha --language ja --output konnichiwa.wav

# List all 50 voices
audio kokoro --list-voices

Swift API

import KokoroTTS
import AudioCommon

let tts = try await KokoroTTSModel.fromPretrained()
// Downloads ~325 MB on first run

let audio = try tts.synthesize(text: "Hello world", voice: "af_heart")
// audio: [Float] — 24 kHz mono PCM

try WAVWriter.write(samples: audio, sampleRate: 24000, to: outputURL)

When to Use Kokoro

Use CaseRecommended TTS
iOS app, lightweight, battery-efficientKokoro (CoreML, 82M params, ~325 MB)
Highest quality, streaming, voice cloningQwen3-TTS (MLX, 600M params, ~1.7 GB)
Multilingual streaming, 9 languagesCosyVoice3 (MLX, 500M params, ~1.2 GB)
Full-duplex spoken dialoguePersonaPlex (MLX, 7B params, ~5.5 GB)

License