Microsoft's new text-to-speech tool VALL-E can accurately mimic speakers' tone, emotion, and acoustic environment using merely a three-second-long prompt.
Microsoft researchers revealed the new text-to-speech (TTS) model, dubbed VALL-E, last week. The model can accurately simulate a person's voice from a short audio sample.
Unlike other TTS models, VALL-E doesn't manipulate waveforms to mimic speech. Researchers used off-the-shelf neural audio codec models to train the AI.
In other words, VALL-E uses the three-second sample to analyze how a person sounds, breaks down the information into individual parts, and uses that information to 'guess' how a person might sound outside the sample audio.
First, however, researchers taught the AI intricacies of human language with 60k hours of English speech from LibriLight, an open-source dataset gathered by Meta. The dataset contains tens of thousands of hours of speech from thousands of different speakers.
"VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt," researchers claim.
Audio results researchers uploaded to a demo website show how the AI can simulate whole sentences from very brief samples. Moreover, researchers told VALL-E to 'say' sentences completely different from the input.
As several audio samples show, Microsoft's new AI can successfully mimic human emotions and imitate the acoustic details from the sample audio.
"Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity," researchers boasted.
The applications of Microsoft's AI could be far-reaching. For example, VALL-E could be paired with ChatGPT, an AI-based chatbot owned by OpenAI, a company Microsoft intends to shower with billions of dollars. Marrying text and audio AI could allow users to create compelling content from scratch.
However, VALL-E could become a major security threat in the wrong hands. Robocalls, automated spam calls are already a headache for law enforcement agencies and victims alike. Having a computer to convincingly mimic a voice of a loved one or somebody we trust would present crooks with potent tools for malicious activity.
More from Cybernews:
Subscribe to our newsletter