The AGI Progress Report

The race toward artificial general intelligence (AGI) is heating up. Today’s AI is already impressive with its immense capabilities, but AGI aims to go further – with machines that can think, learn, and solve problems across many areas, just like humans. Experts disagree on when AGI will arrive, making tests and benchmarks crucial to know whether AI is truly becoming general intelligence or is just very clever at narrow tasks.

Meta's Chief AI Scientist, Yann LeCun, is famously sceptical about our chances of reaching AGI, especially if we continue down the current large language models (LLM) approach. The makers of the main LLM models are more bullish, however, with both OpenAI’s Sam Altman and Anthropic’s Dario Amodei arguing that we’ll achieve AGI in a few years.

This is far from a general, philosophical debate, as achieving AGI will move the technology from being something that can accurately mimic human beings to more versatile and adaptable problem solvers across a host of domains, much as humans are capable of doing today.

Charting progress

The question then becomes, how do we actually know when we’ve achieved AGI? The answer largely depends on tests that are able to distinguish genuine intelligence from copying and simulation. For a long time, we’ve been judging AI on its ability to perform narrow tasks, such as playing chess or folding proteins.

The Turing Test, developed by Alan Turing in 1950, also continues to loom large, even though it’s largely inadequate, as most LLMs are capable of fooling a conversational partner, at least some of the time.

The ability to pass itself off as human is no longer a reliable indicator of general intelligence, therefore, and so a new wave of benchmarks has emerged to try and provide more reliable indicators. Arguably, the most well-known is ARC-AGI, which was created in 2019 and has long acted as the “north star” for AI researchers. ARC-AGI tests for fluid intelligence, which is the ability of AI to adapt to new tasks.

OpenAI’s o3 has made progress towards achieving this, with its approach of pairing large language models with reasoning engines showing promise. Despite this, it’s still clumsy and costly, and doesn’t really operate independently of human oversight.

Raising the bar

The ARC Prize raised the bar in January with the introduction of ARC-AGI-2, which is a tougher test designed to expose the gap between how humans and machines reason. The principle behind the benchmark is disarmingly simple.

It selects tasks that humans can generally master in a couple of attempts, but which have traditionally baffled even the best AI systems. The tests measure things like compositional reasoning, symbolic interpretation, and the flexible application of rules in context. In other words, things that machines have traditionally struggled with.

brunette short hair woman on left, white robot on right — Image by Cybernews.

The results so far have been pretty sobering for AI boosters. Whereas humans typically pass all of the tests with flying colors, the best AI models manage a success rate in the single digits. What’s more, far from the hype around AI being cost-effective compared to humans, the human successes cost around $17 per task, whereas each AI success cost around $200.

Measuring both the efficiency and the performance gaps is baked into the design of the test. The creators argue that we should look at intelligence not just in terms of accuracy but also in terms of economy of effort. If AI gets to the right answer via brute force rather than elegant and efficient reasoning, it can’t be conflated with intelligence.

Accelerating progress

The ARC Prize’s 2025 competition was designed to accelerate progress towards AGI. The challenge, which was hosted on Kaggle with a $1m prize fund, awarded participants $700,000 if they achieved an 85% success rate within predefined efficiency limits.

More modest prizes were then available for participants with the highest scores and also the most inventive approaches. ARC hopes that by providing a bigger prize fund, albeit one that is tiny compared to the huge sums on offer by the tech giants, they can build on the 1,500 teams that participated last year.

arc prize white letters like from a computer game, black tiles, 85 percent success rate — Image by Cybernews.

Of course, ARC-AGI-2 isn’t the only game in town.

For instance, the Winograd Schema Challenge takes a different approach and looks at how good AI is at employing common-sense reasoning about language. Similarly esoteric is Steve Wozniak’s Coffee Test, which challenges a robot to go into a strange kitchen and make a cup of coffee.

The Robot College Student Test imagines an AI completing a degree across disciplines. The Employment Test asks whether a machine could perform any job without special accommodation. And the Ethical Reasoning Test gauges whether it can navigate moral dilemmas with something approaching human judgment.

The relatively narrow nature of these tests means that none of them will be enough to certify the arrival of AGI on their own. Taken collectively, however, they begin to draw a detailed picture of whether AI can master the kind of human-like intelligence that defines AGI in areas like reasoning, adaptability, efficiency, breadth of learning, and moral discernment.

Add us as your Preferred Source on Google

Add us as your Preferred Source on Google.

Benchmarks, then, are more than mere milestones. They are tools to channel research away from parlor tricks and gimmicks, and towards genuine breakthroughs that can derive meaningful applications across society.

The frontier is defined not by what machines can do that humans cannot, but by what humans find easy and machines find hard. When that gap disappears, such as when an AI can breeze through ARC-AGI tasks, earn a degree, hold down a job, and act with efficiency and restraint, then the claim of AGI will be hard to deny.

Unlock more exclusive Cybernews content on YouTube.

Chasing AGI: how close are machines to true human-level intelligence?

Charting progress

More from Cybernews

Raising the bar

Accelerating progress