Kokoro TTS

Kokoro-82M is a lightweight, non-autoregressive text-to-speech model based on StyleTTS 2 with an ISTFTNet vocoder. It runs entirely on the Neural Engine via CoreML, producing natural 24 kHz speech from text input in a single forward pass.

iOS-Ready

Kokoro-82M is designed for on-device iOS deployment. At 82M parameters (~80 MB with 1 bucket, INT8), it fits comfortably on iPhone and iPad. CoreML runs on the Neural Engine, leaving the GPU free for other tasks.

Supported Languages

Language	Code	Example Voices
English (US)	en	af_heart, am_adam, af_sky
English (UK)	en	bf_emma, bm_george
Spanish	es	ef_dora
French	fr	ff_siwis
Hindi	hi	hf_alpha, hm_omega
Italian	it	if_sara
Japanese	ja	jf_alpha, jm_omega
Portuguese	pt	pf_dora
Chinese	zh	zf_xiaobei, zm_yunjian
Korean	ko	kf_somi

50 preset voices total. Voice naming convention: [language][gender]_[name] — e.g., af_heart = American Female "Heart".

Architecture

Kokoro uses a 3-stage CoreML pipeline. No sampling loop — all stages are non-autoregressive forward passes with a Swift-side alignment step between stages 1 and 2.

3-Stage Pipeline

Stage	Model	Input	Output
1. Duration	`duration.mlmodelc`	Phoneme tokens + voice embedding + speed	Durations, prosody features, text encoding
—	Swift alignment	Durations + stage 1 features	Aligned prosody & text features
2. Prosody	`prosody.mlmodelc`	Aligned prosody features + style	F0 (pitch) + noise predictions
3. Decoder	`decoder_*.mlmodelc`	Aligned text + F0 + noise + style	24 kHz audio waveform

Phoneme Buckets (Duration Model)

The duration model uses enumerated input shapes. Input is padded to the smallest bucket that fits:

Bucket	Max Phonemes	Use Case
p16	16	Short phrases
p32	32	Short sentences
p64	64	Medium sentences
p128	128	Long sentences

Decoder Buckets

Fixed-shape decoder models for different maximum audio lengths (each frame = 600 samples at 24 kHz):

Bucket	Max Frames	Max Audio
`decoder_5s`	200	5.0s
`decoder_10s`	400	10.0s
`decoder_15s`	600	15.0s

Requires iOS 18+ / macOS 15+.

Phonemizer

Text is converted to phoneme tokens via a three-tier pipeline — all Apache-2.0 licensed, no GPL dependencies:

Dictionary lookup — US English and British English pronunciation dictionaries with heteronym support
Suffix stemming — Morphological decomposition for known suffixes (e.g., "-ing", "-tion")
BART G2P — Neural grapheme-to-phoneme fallback using a separate CoreML encoder-decoder model for out-of-vocabulary words

Model Weights

Component	Size	Format
Duration model	~39 MB	.mlmodelc
Prosody model	~17 MB	.mlmodelc
Decoder models (3 buckets)	~107 MB each	.mlmodelc
Voice embeddings (54 voices)	~0.3 MB	JSON (256-dim Float32)
G2P encoder + decoder	~1.5 MB	.mlmodelc
Dictionaries + vocab	~6 MB	JSON
Total (1 decoder)	~170 MB

Performance

Metric	Value
Parameters	82M
Inference backend	CoreML (Neural Engine)
Inference RTFx	~0.7 (faster than real-time)
Output sample rate	24 kHz
Weight memory	~170 MB (1 decoder bucket)

Non-Autoregressive

Unlike Qwen3-TTS and CosyVoice3 which generate tokens step-by-step, Kokoro uses a 3-stage pipeline with no sampling loop. All stages are deterministic forward passes.

CLI Usage

audio kokoro "Hello, world!" --voice af_heart --output hello.wav

Options

Option	Default	Description
`<text>`		Text to synthesize
`--voice`	`af_heart`	Voice preset name
`--language`	`en`	Language code: en, es, fr, hi, it, ja, pt, zh, ko, de
`--output, -o`	`kokoro_output.wav`	Output WAV file path
`--list-voices`		List all available voices and exit
`--model, -m`		HuggingFace model ID

Examples

# English with default voice
audio kokoro "Hello, how are you today?" --output hello.wav

# French
audio kokoro "Bonjour le monde" --voice ff_siwis --language fr --output bonjour.wav

# Japanese
audio kokoro "こんにちは世界" --voice jf_alpha --language ja --output konnichiwa.wav

# List all 50 voices
audio kokoro --list-voices

Swift API

import KokoroTTS
import AudioCommon

let tts = try await KokoroTTSModel.fromPretrained()
// Downloads ~170 MB on first run

let audio = try tts.synthesize(text: "Hello world", voice: "af_heart")
// audio: [Float] — 24 kHz mono PCM

try WAVWriter.write(samples: audio, sampleRate: 24000, to: outputURL)

Compute Unit Override

fromPretrained(computeUnits:) selects which hardware runs the main CoreML model. The default (.all) lets Core ML prefer the Neural Engine, which is the fastest path on every supported device. Pass .cpuAndGPU to bypass the ANE as a fallback on platforms where the ANE compiler produces incorrect output for this model.

import CoreML
import KokoroTTS

// Default: ANE preferred
let tts = try await KokoroTTSModel.fromPretrained()

// Fallback: bypass the Neural Engine
let tts = try await KokoroTTSModel.fromPretrained(computeUnits: .cpuAndGPU)

When to Use Kokoro

Use Case	Recommended TTS
iOS app, lightweight, battery-efficient	Kokoro (CoreML, 82M params, ~170 MB)
Highest quality, streaming, voice cloning	Qwen3-TTS (MLX, 600M params, ~1.7 GB)
Multilingual streaming, 9 languages	CosyVoice3 (MLX, 500M params, ~1.2 GB)
Full-duplex spoken dialogue	PersonaPlex (MLX, 7B params, ~5.5 GB)

License

Model weights: Apache-2.0 (hexgrad/Kokoro-82M)
CoreML conversion: Apache-2.0 (aufklarer/Kokoro-82M-CoreML)
Dictionaries and G2P: Apache-2.0