Microsoft recently unveiled its most recent text-to-speech tools with vision capabilities, allowing users to use text inputs to make talking avatar films. Moreover, the new function will aid in the development of an interactive bot that has been educated on human photographs.
With the most recent text-to-speech avatar technology, users can create fake films of a 2D photorealistic avatar speaking thanks to characteristics with vision skills. Deep neural networks use examples from human video recordings to train the neural text-to-speech model. An avatar voice will be generated using a text-to-text-to-speech voice model.
With the use of this text-to-speech avatar, users will be able to develop chatbots, virtual assistants, conversational agents, and more in addition to more interesting digital interactions.
This is intended to prevent the spread of dangerous deepfakes and deceptive content, safeguard individual and societal rights, and promote open communication between humans and computers.
For what reason did Microsoft create an avatar that could speak?
Microsoft’s text-to-speech avatar states:
The creation of traditional video material typically requires a large investment of time and money. This involves setting up a location for filming, filming, editing, etc. This Microsoft avatar will assist you in producing videos more effectively and lessen your reliance on conventional methods. With the aid of text input, the avatar will also assist users in creating product introductions, client testimonials, training videos, etc.
With the introduction of neural text-to-speech and Azure OpenAI Service, interactive conversations are far more organic than they were previously. More captivating digital interactions are made possible by this avatar. This can also be used by the user to create chatbots, virtual assistants, conversational agents, and more.
The official website lists three procedures for creating content: text analyzer, TTS avatar video synthesizer, and TTS audio synthesizer.
Currently, the company offers two distinct text-to-speech avatar functions. There are two types of text-to-speech avatars: one is prebuilt and the other is made to order.
On its website, the company states that “Microsoft provides its subscribers with prebuilt text-to-speech avatars on Azure as out-of-the-box products.” Depending on the text entered these avatars can speak in various languages and voices. Clients can utilize an avatar they choose from a range of options to develop interactive applications or video content that responds to them in real-time.
Creating video content with an avatar that speaks text-to-speech
For your avatar, start with a talking script; alternatively, you can use plain text or Synthesis Markup Language (SSML). Using movements like waving and pointing to objects, as well as pronouncing and expressing words like brand names, you may customize your avatar’s voice with the aid of SSML.
When your talking script is ready, you can synthesize your video using the Azure TTS 3.1 API. You can customize your avatar character, its design, and even the format of your preferred video in addition to the SSML inputs.
To create the finished film, you can optionally include content photos, text-based videos, animations, graphics, etc.
Create a rich video experience by combining all of your assets, such as the avatar video, content, and optional background music.