Alibaba’s trillion-parameter AI model challenges GPT-5: Does it have what it takes?

Alibaba has officially released Qwen3-Max, its largest large language model (LLM) with over 1 trillion parameters. Its “preview” version previously achieved third place on the LMArena leaderboard, and the team claims that the reasoning version, coming soon, will be even better.
Qwen3-Max first arrived in early September as a preview, and this version currently ranks third on the LMArena leaderboard, surpassing the GPT-5-Chat, Grok 4, and Deepseek v3.1 but below Claude Opus 4.1, Gemini 2.5 Pro, OpenAI’s o3, and GPT-5-high models. This leaderboard ranks chatbots based on human preferences and voting, which might not accurately reflect true model capabilities.
The Qwen team claims that the official version elevates its model capabilities even further, particularly in coding and agent performance.
“The official release further enhances performance in coding and agent capabilities, achieving state-of-the-art results across a comprehensive suite of benchmarks – including knowledge, reasoning, coding, instruction following, human preference alignment, agent tasks, and multilingual understanding,” Alibaba Cloud’s QwenTeam said in a blog post.
Contrary to Qwen’s many smaller open-weight models, the Qwen3-Max is proprietary. This means it is not publicly available for anyone to tweak and run on their own hardware, limiting transparency.
The API price for a million output tokens is set to $6.4, which is also comparable to other proprietary models. OpenAI and Google are asking prices of $10 per million tokens for their best offerings.
Qwen-Max also seems to have a disadvantage of a low maximum context window of 262,144 tokens, while Gemini 2.5 Pro offers up to 1 million tokens.
Alibaba has yet to release the “thinking” version of the model, which usually delivers even better results. The reasoning model is still under active training, and “is already demonstrating remarkable potential.”
Qwen claims high benchmark scores
The Qwen team also showed some select benchmarks where the released non-reasoning Qwen-Max-Instruct dominates DeepSeek V3.1 and Claude Opus 4.
“On SWE-Bench Verified, a benchmark focused on solving real-world coding challenges, Qwen3-Max-Instruct achieves an impressive score of 69.6, placing it firmly among the world’s top-performing models,” the team claims.
“Moreover, on Tau2-Bench – a rigorous evaluation of agent tool-calling proficiency – Qwen3-Max-Instruct delivers a breakthrough score of 74.8, surpassing both Claude Opus 4 and DeepSeek V3.1.”
🚀 Qwen3-Max is here—no preview, just power!
undefined Qwen (@Alibaba_Qwen) September 23, 2025
Qwen Chat:https://t.co/FBpr7zfQY6
Blog: https://t.co/jJJcfi5FJJ
API: https://t.co/olURJV1Enl
We’ve supercharged coding & agentic skills—now Qwen3-Max-Instruct without thinking rivaling top models on SWE-Bench, Tau2-Bench,… pic.twitter.com/ZIL08Akm24
While independent evaluations for the official Qwen3-Max are not yet available, the artificialanalysis.ai data reveals that the “Preview” version of the Qwen3-Max scored 76.4% in the GPQA Diamond benchmark. This was not among the top 10 results, falling behind Grok 4 (87.7%), GPT-5 high (85.4%), and other best models. This benchmark shows how capable models are in answering PhD expert-level questions.
Qwen3-Max-Preview didn’t reach the top 10 in the MMLU-Pro benchmark either, which tests 12,000 graduate-level questions across 14 subject areas. Neither did it make it to the top 10 in “Humanity’s Last Exam Benchmark,” the frontier-level benchmark with 2,500 expert-vetted questions across mathematics, sciences, and humanities.
In these benchmarks, the previous and smaller Qwen3-235B model scored better than the largest Qwen “Preview” model, according to artificialanalysis.ai data.
The Qwen team also claims that the reasoning model under active development, named Qwen3-Max-Thinking, “augmented with tool usage and scaled test-time compute, " demonstrates “extraordinary performance.” It reportedly achieves perfect 100 scores on challenging math reasoning benchmarks AIME 25 and HMMT, but so does GPT-5 Pro.
undefined Qwen (@Alibaba_Qwen) September 23, 2025
Qwen3-Max has over 1 trillion parameters and uses a Mixture of Experts architecture, which makes the model run more efficiently and faster.
Alibaba Cloud has recently released many other AI products. The AI community was surprised by the medium-sized open-weigths 80 billion parameters models Qwen3-Next-80B, which competes with Gemini-2.5 Flash. Currently, both models rank in 17th place on Lmarena’s Text Arena.
This week, Alibaba also introduced Qwen3-Omni, a 30 billion-parameter model that processes text, images, audio, and video and delivers real-time streaming responses in both text and natural speech.
The company also unveiled Qwen3-VL series models for visual perception, Qwen3‑LiveTranslate for real‑time multilingual audio and video interpretation, the Qwen3Guard safety guardrail model, and other products.
Reuters reports that Alibaba is doubling down on AI as a core business strategy, prioritizing it alongside its traditional e-commerce operations. Earlier this year, the company announced plans to invest 380 billion yuan ($53.40 billion) in AI-related infrastructure over the next three years.