Forum Discussion

kirupanandanj's avatar
kirupanandanj
Honored Guest
7 months ago
Solved

Using Phonemes in TTS with Meta Voice SDK: Wit.ai, Custom Models, or ONNX in Unity?

Hi all,

I'm working on a Unity project where speech technology is central, and I'm facing a hurdle with Meta's Voice SDK. My primary need is to use phonemes directly for text-to-speech (TTS), but I've found that Wit.ai does not support direct IPA (International Phonetic Alphabet) input or return phoneme-level control for TTS.

Questions & Discussion Points:

  • Is there any way to use Wit.ai for phoneme or IPA-based TTS, or is this currently unsupported?
  • Are there recommended approaches to integrate speech models based on self-supervised learning (like wav2vec 2.0, HuBERT, or WavLM) with Unity, either alongside or instead of Wit.ai?
  • For complete control over TTS—especially for phoneme-level synthesis—would it make sense to bypass Wit.ai entirely and run a model (converted to ONNX) for inference directly inside Unity?
  • Have others run into similar limitations, and if so, what workflows or toolchains have worked best for you?

I’d appreciate any advice/examples for integrating more advanced or flexible TTS pipelines into Unity, especially those compatible with IPA/phoneme input or utilizing state-of-the-art self-supervised models.

Thanks!

  • Phoneme-level TTS control isn’t currently supported by Wit.ai, so using IPA directly isn’t an option there. For more flexibility, many developers choose to run custom models through ONNX inside Unity. This approach allows greater control, especially if you need precise phoneme-level synthesis. Self-supervised models like wav2vec 2.0 or HuBERT can be adapted, but they usually require extra steps for phoneme alignment before integration.

1 Reply

  • Phoneme-level TTS control isn’t currently supported by Wit.ai, so using IPA directly isn’t an option there. For more flexibility, many developers choose to run custom models through ONNX inside Unity. This approach allows greater control, especially if you need precise phoneme-level synthesis. Self-supervised models like wav2vec 2.0 or HuBERT can be adapted, but they usually require extra steps for phoneme alignment before integration.

→ Find helpful resources to begin your development journey in Getting Started

→ Get the latest information about HorizonOS development in News & Announcements.

→ Access Start program mentor videos and share knowledge, tutorials, and videos in Community Resources.

→ Get support or provide help in Questions & Discussions.

→ Show off your work in What I’m Building to get feedback and find playtesters.

→ Looking for documentation?  Developer Docs

→ Looking for account support?  Support Center

→ Looking for the previous forum?  Forum Archive

→ Looking to join the Start program? Apply here.

 

Recent Discussions