GenAI can now create wild combos of music, voice, and sound, says Nvidia


An AI model that can translate text and audio prompts into any combination of music, voices, and sounds? Yes! Will you get to try it out anytime soon? Not so much, says Nvidia. Here's more on what this transformational model can do and why the AI chip giant is still debating its fate.

Nvidia boasted about its latest generative AI advancement in a blog post on Monday, along with a snazzy video of a cat wearing headphones and typing at a computer keyboard. Already, I’m intrigued.

The world’s largest AI chip supplier says its latest “Foundational Generative Audio Transformer Opus 1” – or “Fugatto” for short – can “generate or transform any mix of music,” and was designed with the music industry, Hollywood, and video game makers in mind.

ADVERTISEMENT

From creating a simple snippet out of a text to removing instruments from an existing song to changing the accent or emotion in a voice, and even letting people produce sounds they've never heard before, music industry bigwigs who had a chance to test out the novel AI are calling it “incredible” and “inspirational.”

“This thing is wild. The idea that I can create entirely new sounds on the fly in the studio is incredible,” said Ido Zmishlany, the multi-platinum producer, songwriter, and cofounder of the Nvidia-backed GenAI audio start-up One Take Audio.

"We have a new instrument, a new tool for making music — and that’s super exciting,” Zmishlany said.

Emergent properties is a foundational first

Orchestral conductor and composer Rafael Valle – also NVIDIA’s applied audio research manager and one of the brains behind the creation of Fugatto – said the globally diverse twelve-person team wanted to devise a large language model (LLM) that “understands and generates sound like humans do.”

And act like a human, it does.

Fugatto is the first foundational model known to have emergent properties – meaning its individual components, when combined together, create a functioning system capable in ways that its standalone components could never achieve.

In Fugatto’s case, researchers trained the model using a specialized dataset and found the AI model can carry out free-form instructions it has not been pre-trained on, such as “generating a high-quality singing voice” from just a text prompt.

ADVERTISEMENT

When asked to create electronic music with dogs barking in time to the beat, Fugatto also easily complied. “The first time it generated music from a prompt, it blew our minds,” Vale said, of the year-long effort.

Vale compared the discovery to OpenAI’s 2021 “avocado chair” breakthrough when image generator Dalle-E was essentially able to “conceptually” fill in the blanks to create an image based on a simple text input.

Fugatto can “make a trumpet bark or a saxophone meow,” according to the blog. “Whatever users can describe, the model can create,” Nvidia said.

Avocado arm chair for music

The blog posed several examples of how the new model could be used industry-wide, including using accents to target ads for specific regions worldwide, personalizing an online course with the voice of a family member or friend, or modifying prerecorded inputs on a video game depending on the users' changing actions.

Its breakthrough capabilities include what Nvidia labeled as “ComposableART.” This allows the user to not only combine instructions “that were only seen separately” during a model’s pre-training, but also take it to another level.

Nvidia says through a combination of text prompts Fugatto can not only create “a sad feeling in a French accent,” but also can fine-tune that creation and produce audio with degrees of sadness and/or strength of accent.

Nvidia Fugatto
Diagram of Fugatto's synthetic caption generation pipeline for Prompt-to-Voice (P2V). Image by Nvidia.

"If we think about synthetic audio over the past 50 years, music sounds different now because of computers, because of synthesizers," said Bryan Catanzaro, vice president of applied deep learning research at Nvidia. "I think that generative AI is going to bring new capabilities to music, to video games, and to ordinary folks that want to create things."

ADVERTISEMENT

Sadly, at this time, Nvidia says it does not have immediate plans to release its latest model, which was trained on open-source data, to the public.

"Any generative technology always carries some risks, because people might use that to generate things that we would prefer they don't," Catanzaro said. "We need to be careful about that, which is why we don't have immediate plans to release this."

Similarly, OpenAI and Meta have not said when they plan to release models that generate audio or video to the public, Reuters reports.

The full version uses 2.5 billion parameters and was trained on a bank of NVIDIA DGX systems packing 32 NVIDIA H100 Tensor Core GPUs.