The Microsoft company continues to innovate with the development of different applications for Artificial Intelligence (AI) since it now has a voice simulator that can recreate the tone of anyone just having three seconds of audio.
His name is VALLEY, and it is a language model for text-to-speech synthesis (TTS). Microsoft promises that you only need three seconds of audio recording for the system to be able to imitate the voice of the same.
One of the most interesting points that the company shares in its statement is that they are developing VALL-E to work with other generative AI modelssuch as GPT-3, its chat that allows you to have a natural conversation with Artificial Intelligence.
In other words, the ChatGPT would be able to offer voice results once this model has been integrated.
The examples that Microsoft shows are very striking. In them, it shows us what has been the audio input that has been taken as a base, the intermediate steps and the final result of VALL-E.
The model is not only able to imitate the voice, but the original cadence of language itself and the original pitch at which the voice sample was recorded.