Humanity's Last Exam set to push current AI models to new heights


The Center for AI Safety will pay experienced tech professionals who can devise the world’s most challenging questions to train today’s AI model. It hopes to usher in a new era where AI becomes smarter than the smartest human.

The new AI initiative – Humanity’s Last Exam – is designed to push the boundaries of AI system benchmarks in a way that has not been achieved so far.

It’s the most ambitious AI benchmark to date, according to the Director of the Center for AI Safety (CAIS) Dan Hendrycks, and launch partner, Scale AI.

ADVERTISEMENT

The ‘last exam’ will consist of questions chosen from online submissions by worldwide experts across all fields, with the goal of helping to build the world's most difficult public AI benchmark.

Technology experts have until November 1st to submit what they believe to be the most difficult questions ever posed to an AI model.

“We are collecting the hardest and broadest set of questions ever to evaluate how close we are to achieving expert-level AI across diverse domains,” Hendrycks posted on X Monday.

“If you have 5+ years in a technical field or hold/are pursuing a PhD, submit your questions by 11/1 to share in $500k in prizes and co-authorship,” he said.

Individuals whose questions are chosen for the exam will be invited as co-authors on the paper corresponding with the new advanced dataset, and have a chance to win money from a $500,000 prize pool, the website states.

Among the prizes, the top 50 questions will earn $5000 each. The next top 500 questions will earn $500 each, other prizes may be awarded for the quality or novelty of a question.

ADVERTISEMENT

Scale AI is a San Francisco-based software company that provides labeled data used to train AI applications. It powers nearly every major foundation model in existence today.

The company says initiatives like these are necessary to push AI benchmarks, as existing tests have already become too easy for AI models.

As of September, OpenAI’s newest Strawberry release, OpenAI o1, which is the first AI model capable of reasoning before providing its answers, performed close to the ceiling on all of the most popular benchmarks, Scale explained.

Scale said it is important to retain the ability to “distinguish between AI systems, which can now ace undergrad exams, and those [systems] which can genuinely contribute to frontier research and problem solving.”

Submission requirements

Questions will be accepted from all fields, including mathematics, rocket engineering, and analytic philosophy, to name a few.

“Simply think of a hard question, and see if AIs get it right. If it's hard for the AIs it is likely good to submit,” according to the explanation.

According to the guideline rules, questions must be original (meaning they cannot be copy pasted from other materials), challenging, as well as objective and self-contained.

This means the answers to a specific question must be widely accepted by other experts with relevant expertise.

Finally, the questions must not contain controversial or dangerous subject matter. For example, queries related to “chemical, biological, radiological, nuclear weapons, or cyberweapons used for attacking critical infrastructure” will not be accepted.

ADVERTISEMENT

Click here for more information on how to submit questions for Humanity’s Last Exam and full guidelines.

Through their AI safety labs, Scale regularly researches evaluation methods for frontier models helping the AI community gain deep insights into leading models, it said.

ADVERTISEMENT