Make a realtime realistic 3D avatar with text-to-speech, Viseme Lip-sync, and emotions/gestures

5.6k Views Asked by At

I have used Haptek in the past but is now defunct. To see what I want to do: ejTalk Cassandra

The idea is to send a text string with as "text-to-say(with ssml):avatar-emotion:avatar-gesture" I will adapt to any sort of markup. The ejTalk engine manages all the ASR/NLP/Dialog/etc. What I want is JUST the talking head.

It can be browser based, or C++ linkable library, or stand alone server but running on Windows 10/11.

I have coded in C++, Javascript, etc. for decades so I don't scare easily.

I am looking into Unreal and Unity engines but they seem like heavy platforms and may not lend themselves to being driven by text strings from another server.

1

There are 1 best solutions below

0
On

This is a broad question. Here are some resources and examples:

Services for generating avatars with lip sync animations ("visemes") integrated:

Examples of text-to-speech with 3D model sync'ing:

Examples without 3D modeling, but showing how to make a chat experience (using voice or text) with ChatGPT, which you can infer how to integrate with 3D models like in the previous examples:

  • QuiLLMan - a complete chat app that transcribes audio in real-time using Whisper, streams back a response from a language model, and synthesizes this response as natural-sounding speech

If you prefer to go native, instead of using web tech, you can probably infer from above how to load GLTF models in your native framework (Unity, Unreal, etc) and how to hit the APIs in the demos from your native code to achieve the same.