Kokoro TTS — Android

Kokoro-82M is a lightweight, non-autoregressive text-to-speech model running on Android via ONNX Runtime. It produces natural 24 kHz speech with 50 preset voices across 7 languages.

Supported Languages

Language	Code	Example Voices
English (US)	en	af_heart, am_adam, af_sky
English (UK)	en	bf_emma, bm_george
Spanish	es	ef_dora
French	fr	ff_siwis
Hindi	hi	hf_alpha, hm_omega
Italian	it	if_sara
Japanese	ja	jf_alpha, jm_omega
Portuguese	pt	pf_dora
Chinese	zh	zf_xiaobei, zm_yunjian

50 preset voices total. Voice naming convention: [language][gender]_[name] — e.g., af_heart = American Female "Heart".

Model Files

File	Size
`kokoro-model-int8.onnx`	~89 MB
`voices.bin`	Voice embeddings
Phoneme dictionaries	Language-specific pronunciation data

HuggingFace: aufklarer/Kokoro-82M-ONNX

Performance

Metric	Value
Parameters	82M
Inference backend	ONNX Runtime
Output sample rate	24 kHz

Phonemizer

Text is converted to phoneme tokens using a dictionary-based phonemizer with language-specific support. The Android implementation includes phonemizers for English, French, Spanish, Italian, Portuguese, Hindi, Japanese, and Chinese.

Pipeline Integration

On Android, Kokoro TTS is part of the SpeechPipeline. After STT transcribes speech, the text is phonemized and synthesized back into audio. The pipeline manages the full VAD → STT → TTS flow automatically.

val modelDir = ModelManager.ensureModels(context)
val pipeline = SpeechPipeline(SpeechConfig(modelDir = modelDir))
pipeline.events.collect { event ->
    when (event) {
        is SpeechEvent.TranscriptionCompleted -> println(event.text)
        else -> {}
    }
}
pipeline.start()
pipeline.pushAudio(samples) // 16kHz mono float32

Source code: github.com/soniqo/speech-android