Yesterday was a whirlwind of development as I built the tool we’re now using to generate all the speech sound files for our game. We initially started with Google TTS during the prototype phase, but we’ve since transitioned to Azure TTS—and let me tell you, it’s been quite the revelation.
I was absolutely delighted when I discovered that the response from Azure’s TTS endpoint not only gives you the sound file but also the viseme ID and audio offset. This will significantly help in generating speech-to-face animations on the fly! However, imagine my excitement when I realized that Azure’s TTS endpoint can also include the viseme as a 2D SVG image or even as blend shapes! My jaw literally dropped. While I know that endpoints can provide such features, I never expected one to be so well thought through and comprehensive.
Just imagine: if you’re developing a chatbot with a 2D character, you can get the mouth movements directly in sync with the audio, straight from the response. And the fact that you can also receive blend shapes? This means you could potentially have a Metahuman or another 3D character created in something like Character Creator 4 (CC4) performing facial animations straight out of the box. Incredible, right?
But, as always, there’s a catch—two big ones, actually.
- No Icelandic Support: While the feature-rich Azure TTS endpoint support neural Icelandic speech sounds, that doesn’t include the viseme support. Given our focus on language learning, this is a significant setback.
- Incompatibility with My Rig: I’ve moved away from using Metahumans and the CC4 character we developed with a freelancer, mainly for optimization and aesthetic reasons. My current Rigify setup is bone-based and doesn’t support blend shapes, which makes the blend shape data from Azure less useful for us.
Despite these setbacks, I’m not one to be easily deterred. If setbacks could phase me, I wouldn’t be here today, running my own game company. So, of course, I immediately started brainstorming workarounds.
Tackling the Challenges
For the Icelandic Support Issue: While the Azure endpoint works beautifully for all other languages we have planned, Icelandic will require a bit more manual effort. Since the endpoint supports bookmarks, I thought I could bookmark each letter and define which viseme corresponds to each point. However, I soon discovered that bookmarks aren’t as silent as the documentation claims—they distort the words, so this strategy won’t work as intended. Instead, I’ll count visemes, add placeholder audio offsets, and manually adjust those offsets until the timing of the animations aligns perfectly with the audio.
For the Blend Shape Issue: While I can’t use blend shapes with my current bone-based rig, I can still leverage the CC4 character by recreating the visemes for my Rigify setup and mapping them to the viseme ID. This could actually be beneficial in the long run, as I can customize and store these visemes for future use. Even though typical viseme animations aren’t unique per sound—like B and M sharing the same movement—I’m considering adding more detail for a speech learning game. Starting with these basics and enhancing them over time seems like the best path forward.
The Big Decision: Rigify or CC4?
The biggest question now is whether to continue down this path with my Rigify rig or revisit the CC4 character. Each option has its pros and cons. While the CC4 character offers more detail, it could be too much for toddlers and is far less optimal for mobile platforms. I also won’t have the same flexibility in achieving the exact look I want or quickly generating more characters in the future.
It’s a tough decision, but I’m confident either path will lead to success. For now, I’ll keep generating content while I weigh the options and decide the best way forward.

Leave a Reply