Sound simulation in 3 seconds with the help of Microsoft's new artificial intelligence VALL-E

Last Thursday researchers Microsoft A new artificial intelligence model of text-to-speech conversion that can easily perform this task in three seconds was introduced by the name of VALL-E. This artificial intelligence algorithm, once it has learned a certain voice, can easily repeat the speaker’s words while maintaining the tone.

The creators of this AI estimate that VALL-E can be used for high-quality text-to-speech and speech audio editing applications. Microsoft describes VALL-E as a neural-linguistic codec model and says it is built with the help of a technology called EnCodec that Meta introduced in October 2022.

Unlike other text-to-speech methods that usually involve manipulating waveforms, Microsoft has stated that VALL-E independently generates separate and proprietary audio codecs based on text and audio messages, essentially analyzing a person’s voice and It is converted into dedicated components with the help of EnCodec, and by using machine learning algorithms and training data, it analyzes and predicts how to express other sentences and words with the same voice.

Microsoft VALL-E

Redmond’s AI speech training capability VALL-E is based on the Meta-developed LibriLight software library, which contains 60,000 hours of English speech from more than 7,000 speakers, drawn primarily from the LibriVox audio library.

Microsoft also on The VALL-E website provides practical examples of modeling this artificial intelligence has shared Despite being useful and providing practical features, this technology also has the ability to forge voice for illegal uses, especially in social networks, and Microsoft, knowing this, has not made VALL-E available for testing directly and independently.

