Kokoro TTS — Android

Kokoro-82M is a lightweight, non-autoregressive text-to-speech model running on Android via ONNX Runtime. It produces natural 24 kHz speech with 50 preset voices across 7 languages.

Supported Languages

LanguageCodeExample Voices
English (US)enaf_heart, am_adam, af_sky
English (UK)enbf_emma, bm_george
Spanishesef_dora
Frenchfrff_siwis
Hindihihf_alpha, hm_omega
Italianitif_sara
Japanesejajf_alpha, jm_omega
Portugueseptpf_dora
Chinesezhzf_xiaobei, zm_yunjian

50 preset voices total. Voice naming convention: [language][gender]_[name] — e.g., af_heart = American Female "Heart".

Model Files

FileSize
kokoro-model-int8.onnx~89 MB
voices.binVoice embeddings
Phoneme dictionariesLanguage-specific pronunciation data

HuggingFace: aufklarer/Kokoro-82M-ONNX

Performance

MetricValue
Parameters82M
Inference backendONNX Runtime
Output sample rate24 kHz

Phonemizer

Text is converted to phoneme tokens using a dictionary-based phonemizer with language-specific support. The Android implementation includes phonemizers for English, French, Spanish, Italian, Portuguese, Hindi, Japanese, and Chinese.

Pipeline Integration

On Android, Kokoro TTS is part of the SpeechPipeline. After STT transcribes speech, the text is phonemized and synthesized back into audio. The pipeline manages the full VAD → STT → TTS flow automatically.

val modelDir = ModelManager.ensureModels(context)
val pipeline = SpeechPipeline(SpeechConfig(modelDir = modelDir))
pipeline.events.collect { event ->
    when (event) {
        is SpeechEvent.TranscriptionCompleted -> println(event.text)
        else -> {}
    }
}
pipeline.start()
pipeline.pushAudio(samples) // 16kHz mono float32

Source code: github.com/soniqo/speech-android