Chinese competitor Vidu rivals OpenAI’s Sora


Vidu, China’s answer to OpenAI’s Sora, is creating videos that rival the ChatGPT maker’s technology.

The new Chinese text-to-video AI model can create 16-second videos at 1080p with the touch of a button.

The technology debuted at the Zhongguancun Forum Future Artificial Intelligence Pioneer Forum and was built by Shengshu Technology, a Chinese start-up, and Tsinghua University.

ADVERTISEMENT

“This model uses the team’s original architecture U-ViT, which integrates Diffusion and Transformer and supports one-click generation of high-definition video content,” the Shengshu website reads.

“Vidu's rapid breakthrough comes from the team's long-term accumulation and multiple original achievements in Bayesian machine learning and multi-modal large models.”

This technology is supposedly the first in the world to combine Diffusion and Transformer models, according to Shengshu, and is meant to rival OpenAI’s Sora, another text-to-video model.

The creators of Vidu have crafted videos similar to those of the viral text-to-video generation tool Sora. There are various landscape shots of unrealistic scenes, like a panda playing a guitar or a beautiful room that backs out onto the ocean.

Although Vidu’s videos aren’t as long as Sora's, they demonstrate that organizations in China can create an array of AI technologies and are close to matching the US in the global AI race.

“Vidu is the world's first large-scale video model to achieve major breakthroughs since the release of Sora,” the start-up states.

However, when you look at Vidu and Sora together, the gap isn’t as close as it may appear.

ADVERTISEMENT

Vidu can only generate 16-second videos, whereas Sora can generate videos of up to 60 seconds. Sora’s software supposedly maintains visual quality and adheres to users’ prompts.

Shengshu claims that Vidu is not only able to simulate the real physical world but also has rich imagination, multi-lens generation, and high spatio-temporal consistency.”

Vidu claims to be a “universal visual model” that “can support the generation of more diverse and longer video content.”

In the future, the software's “flexible architecture will be compatible with a wider range of modalities and further expand the boundaries of multi-modal universal capabilities,” the text concludes.