© 2023 CyberNews - Latest tech news,
product reviews, and analyses.

If you purchase via links on our site, we may receive affiliate commissions.

Microsoft’s AI can mimic your voice with seconds of training


Microsoft's new text-to-speech tool VALL-E can accurately mimic speakers' tone, emotion, and acoustic environment using merely a three-second-long prompt.

Microsoft researchers revealed the new text-to-speech (TTS) model, dubbed VALL-E, last week. The model can accurately simulate a person's voice from a short audio sample.

Unlike other TTS models, VALL-E doesn't manipulate waveforms to mimic speech. Researchers used off-the-shelf neural audio codec models to train the AI.

In other words, VALL-E uses the three-second sample to analyze how a person sounds, breaks down the information into individual parts, and uses that information to 'guess' how a person might sound outside the sample audio.

First, however, researchers taught the AI intricacies of human language with 60k hours of English speech from LibriLight, an open-source dataset gathered by Meta. The dataset contains tens of thousands of hours of speech from thousands of different speakers.

"VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt," researchers claim.

Audio results researchers uploaded to a demo website show how the AI can simulate whole sentences from very brief samples. Moreover, researchers told VALL-E to 'say' sentences completely different from the input.

As several audio samples show, Microsoft's new AI can successfully mimic human emotions and imitate the acoustic details from the sample audio.

"Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity," researchers boasted.

The applications of Microsoft's AI could be far-reaching. For example, VALL-E could be paired with ChatGPT, an AI-based chatbot owned by OpenAI, a company Microsoft intends to shower with billions of dollars. Marrying text and audio AI could allow users to create compelling content from scratch.

However, VALL-E could become a major security threat in the wrong hands. Robocalls, automated spam calls are already a headache for law enforcement agencies and victims alike. Having a computer to convincingly mimic a voice of a loved one or somebody we trust would present crooks with potent tools for malicious activity.


More from Cybernews:

US ramps up space cyber defenses eyeing private businesses

Microsoft to throw $10 billion at ChatGPT - media

Threat actors can use ChatGPT to create deployable malware

WhatsApp gets green light to pursue a lawsuit against Israel’s NSO Group

Princeton student designs app to catch ChatGPT-written essays

Subscribe to our newsletter



Leave a Reply

Your email address will not be published. Required fields are marked