China’s answer to OpenAI’s Sora: new video generator boasts 30B parameters

A Chinese artificial intelligence (AI) company has revealed its homegrown text-to-video model like OpenAI’s Sora – just newer and potentially with more parameters.

Researchers from Step Fun have revealed their latest artificial intelligence model, Step-Video-T2V, which is described as “a state-of-the-art pre-trained model with 30B parameters.”

A Chinese AI lab just dropped the best ever open-source text-to-video model: Step Video!

– 30B param, 540p, ~8s at 30fps
– Trained on 1000s of H800s
– Evaluates as well as Meta MovieGen, feels as good as Sora / Veo

Paper and demo is awesome and reveals all the gory details: pic.twitter.com/0Nx4lOzK4k
undefined Deedy (@deedydas) February 18, 2025

Thirty billion parameters is a huge amount. Although it hasn’t been explicitly revealed, Sora could have roughly 33 million to 675 million parameters, with some saying that it might have upwards of 3 billion.

This is just an estimate, as Sora is built on Diffusion Transformer (DiT) architecture with the largest DiT model (DiT-XL), which contains between 33 million and 675 million parameters.

Parameters are the main components that dictate how that model processes input data to generate accurate outputs, in this case, videos.

Therefore, if you have more parameters, then the model could potentially generate more accurate outputs. But this isn’t always the case.

As seen in the demos of Step Fun’s latest AI model, it looks similar to early iterations of other text-to-video generators.

skeleton-step-fun — Screenshot from Step Fun

Step-Video-T2V can generate videos up to 204 frames in length, which allows users to create clips that are roughly three to eight seconds long.

The videos are heavily compressed with 16 by 16 spatial compression and eight times temporal compression.

However, Step Fun researchers claim that although the videos are compressed, they still look high-quality.

step-fun-flag — Screenshot from Step Fun

While the demos look realistic enough, the movement just isn’t there, as the subject of the video looks like it's going in and out of focus when doing things like running.

While the movements are smooth, they’re almost too smooth, which gives it that artificially generated look that you don’t really get with other models.

However, Step Fun explained that the Step-Video-T2V model was tested using a new video generation benchmark known as Step-Video-T2V-Eval.

lego-step-fun — Screenshot from Step Fun

The test found that this model created videos that supposedly trumped other open-source and commercial text-to-video models.

Another success for China’s homegrown AI?

While this model may not yet rival other text-to-video generators, it does show China’s relevance when it comes to AI.

DeepSeek, China’s newest AI model, sent shockwaves through the Western world after Chinese researchers said that it spent considerably less money on training its model.

Since its release, the reactions to the AI model have been varied. Some have embraced DeepSeek with open arms, while others have rejected the new technology.

DeepSeek has been integrated into massive companies like Baidu and Tencent, but downloads of the AI model have been suspended in South Korea.

China’s answer to OpenAI’s Sora: new video generator boasts 30B parameters

More from Cybernews

Another success for China’s homegrown AI?