Kokoro TTS

Kokoro-82M is a lightweight, non-autoregressive text-to-speech model based on StyleTTS 2 with an ISTFTNet vocoder. It runs entirely on the Neural Engine via CoreML, producing natural 24 kHz speech from text input in a single forward pass.

iOS-Ready

Kokoro-82M is designed for on-device iOS deployment. At 82M parameters (~80 MB with 1 bucket, INT8), it fits comfortably on iPhone and iPad. CoreML runs on the Neural Engine, leaving the GPU free for other tasks.

Supported Languages

LanguageCodeExample Voices
English (US)enaf_heart, am_adam, af_sky
English (UK)enbf_emma, bm_george
Spanishesef_dora
Frenchfrff_siwis
Hindihihf_alpha, hm_omega
Italianitif_sara
Japanesejajf_alpha, jm_omega
Portugueseptpf_dora
Chinesezhzf_xiaobei, zm_yunjian
Koreankokf_somi

50 preset voices total. Voice naming convention: [language][gender]_[name] — e.g., af_heart = American Female "Heart".

Architecture

Kokoro uses a 3-stage CoreML pipeline. No sampling loop — all stages are non-autoregressive forward passes with a Swift-side alignment step between stages 1 and 2.

3-Stage Pipeline

StageModelInputOutput
1. Durationduration.mlmodelcPhoneme tokens + voice embedding + speedDurations, prosody features, text encoding
Swift alignmentDurations + stage 1 featuresAligned prosody & text features
2. Prosodyprosody.mlmodelcAligned prosody features + styleF0 (pitch) + noise predictions
3. Decoderdecoder_*.mlmodelcAligned text + F0 + noise + style24 kHz audio waveform

Phoneme Buckets (Duration Model)

The duration model uses enumerated input shapes. Input is padded to the smallest bucket that fits:

BucketMax PhonemesUse Case
p1616Short phrases
p3232Short sentences
p6464Medium sentences
p128128Long sentences

Decoder Buckets

Fixed-shape decoder models for different maximum audio lengths (each frame = 600 samples at 24 kHz):

BucketMax FramesMax Audio
decoder_5s2005.0s
decoder_10s40010.0s
decoder_15s60015.0s

Requires iOS 18+ / macOS 15+.

Phonemizer

Text is converted to phoneme tokens via a three-tier pipeline — all Apache-2.0 licensed, no GPL dependencies:

  1. Dictionary lookup — US English and British English pronunciation dictionaries with heteronym support
  2. Suffix stemming — Morphological decomposition for known suffixes (e.g., "-ing", "-tion")
  3. BART G2P — Neural grapheme-to-phoneme fallback using a separate CoreML encoder-decoder model for out-of-vocabulary words

Model Weights

ComponentSizeFormat
Duration model~39 MB.mlmodelc
Prosody model~17 MB.mlmodelc
Decoder models (3 buckets)~107 MB each.mlmodelc
Voice embeddings (54 voices)~0.3 MBJSON (256-dim Float32)
G2P encoder + decoder~1.5 MB.mlmodelc
Dictionaries + vocab~6 MBJSON
Total (1 decoder)~170 MB

Performance

MetricValue
Parameters82M
Inference backendCoreML (Neural Engine)
Inference RTFx~0.7 (faster than real-time)
Output sample rate24 kHz
Weight memory~170 MB (1 decoder bucket)
Non-Autoregressive

Unlike Qwen3-TTS and CosyVoice3 which generate tokens step-by-step, Kokoro uses a 3-stage pipeline with no sampling loop. All stages are deterministic forward passes.

CLI Usage

audio kokoro "Hello, world!" --voice af_heart --output hello.wav

Options

OptionDefaultDescription
<text>Text to synthesize
--voiceaf_heartVoice preset name
--languageenLanguage code: en, es, fr, hi, it, ja, pt, zh, ko, de
--output, -okokoro_output.wavOutput WAV file path
--list-voicesList all available voices and exit
--model, -mHuggingFace model ID

Examples

# English with default voice
audio kokoro "Hello, how are you today?" --output hello.wav

# French
audio kokoro "Bonjour le monde" --voice ff_siwis --language fr --output bonjour.wav

# Japanese
audio kokoro "こんにちは世界" --voice jf_alpha --language ja --output konnichiwa.wav

# List all 50 voices
audio kokoro --list-voices

Swift API

import KokoroTTS
import AudioCommon

let tts = try await KokoroTTSModel.fromPretrained()
// Downloads ~170 MB on first run

let audio = try tts.synthesize(text: "Hello world", voice: "af_heart")
// audio: [Float] — 24 kHz mono PCM

try WAVWriter.write(samples: audio, sampleRate: 24000, to: outputURL)

Compute Unit Override

fromPretrained(computeUnits:) selects which hardware runs the main CoreML model. The default (.all) lets Core ML prefer the Neural Engine, which is the fastest path on every supported device. Pass .cpuAndGPU to bypass the ANE as a fallback on platforms where the ANE compiler produces incorrect output for this model.

import CoreML
import KokoroTTS

// Default: ANE preferred
let tts = try await KokoroTTSModel.fromPretrained()

// Fallback: bypass the Neural Engine
let tts = try await KokoroTTSModel.fromPretrained(computeUnits: .cpuAndGPU)

When to Use Kokoro

Use CaseRecommended TTS
iOS app, lightweight, battery-efficientKokoro (CoreML, 82M params, ~170 MB)
Highest quality, streaming, voice cloningQwen3-TTS (MLX, 600M params, ~1.7 GB)
Multilingual streaming, 9 languagesCosyVoice3 (MLX, 500M params, ~1.2 GB)
Full-duplex spoken dialoguePersonaPlex (MLX, 7B params, ~5.5 GB)

License