Using Phonemes in TTS with Meta Voice SDK: Wit.ai, Custom Models, or ONNX in Unity?
Hi all,
I'm working on a Unity project where speech technology is central, and I'm facing a hurdle with Meta's Voice SDK. My primary need is to use phonemes directly for text-to-speech (TTS), but I've found that Wit.ai does not support direct IPA (International Phonetic Alphabet) input or return phoneme-level control for TTS.
Questions & Discussion Points:
- Is there any way to use Wit.ai for phoneme or IPA-based TTS, or is this currently unsupported?
- Are there recommended approaches to integrate speech models based on self-supervised learning (like wav2vec 2.0, HuBERT, or WavLM) with Unity, either alongside or instead of Wit.ai?
- For complete control over TTS—especially for phoneme-level synthesis—would it make sense to bypass Wit.ai entirely and run a model (converted to ONNX) for inference directly inside Unity?
- Have others run into similar limitations, and if so, what workflows or toolchains have worked best for you?
I’d appreciate any advice/examples for integrating more advanced or flexible TTS pipelines into Unity, especially those compatible with IPA/phoneme input or utilizing state-of-the-art self-supervised models.
Thanks!
Phoneme-level TTS control isn’t currently supported by Wit.ai, so using IPA directly isn’t an option there. For more flexibility, many developers choose to run custom models through ONNX inside Unity. This approach allows greater control, especially if you need precise phoneme-level synthesis. Self-supervised models like wav2vec 2.0 or HuBERT can be adapted, but they usually require extra steps for phoneme alignment before integration.