
An AI agent has aced the eight most influential AI benchmarks, including SWE-bench Pro and Terminal-Bench. Instead of solving a single actual task, it simply hacks the scoring systems. UC Berkeley researchers warn that the current leaderboards the industry relies on might be rigged.
Rather than demonstrating increased capabilities, benchmarkmaxxing AI models can simply hack the official evaluation pipelines to achieve higher scores. And this has already happened in the past.
Researchers at the University of California, Berkeley, demonstrated an AI agent that gamed eight major benchmarks, achieving nearly perfect scores without solving a single task. Their AI agent focused on hacking rather than finding the correct solutions.
“These aren’t theoretical attacks. Our agent builds working exploits for each benchmark, runs them through the official evaluation pipelines, and watches the scores roll in,” the new research warns.
“Teams choosing between models based on SWE-bench resolve rates may be comparing noise.”
Currently, benchmark scores are the main litmus test for investors and companies when making decisions.
The cheating AI agent defeated Terminal-Bench, SWE-bench Verified and Pro, FieldWorkArena, Web Arena, and Car-bench, scoring 100% in all of them. In GAIA, the exploit bot achieved 98%, and in OSWorld, 73%.
These scores would put it on top of the leaderboards. However, zero actual tasks were solved by the AI agent.
Benchmarks are too easily gamed
First, Berkeley researchers tasked an AI agent to audit 13 benchmarks, and it found 45 confirmed exploits and 825 potential vulnerabilities in all of them. In a follow-up research, the AI agent broke eight of the most prominent AI benchmarks.
“Every single one can be exploited to achieve near-perfect scores without solving a single task. No reasoning. No capability. Just exploitation of how the score is computed,” the report reads.
The SWE-bench Verified and Pro versions are among the most influential AI coding benchmarks, featuring real GitHub issues. All the AI agent did to cheat them was inject code via a small configuration file that rewrites every test outcome as “passed”, before the grader ever sees them.
“Teams choosing between models based on SWE-bench resolve rates may be comparing noise,” the researchers said.
Terminal-Bench, which tests how well an AI model performs in real terminal environments, carefully protects test files before verification. However, most of the tasks download a single dependency from the internet using curl utility. The AI bot simply replaced the standard curl as well as other system utilities with fake versions, which intercept and poison the test chain, granting “passed.”
WebArena is another popular benchmark of 812 tasks testing autonomous web-browsing and interaction capabilities. The AI agent achieved 100% simply by navigating the browser to a JSON file containing all the reference answers and stealing the solutions.
Similarly, the AI agent downloaded answers from Hugging Face when solving the OSWorld benchmark.
Meanwhile, FieldWorkArena’s scoring function was found to never actually check the answers, only that the message was sent. The AI agent got a perfect score simply by sending empty messages.
GAIA also posts its answers online, and the scoring system is so loose that even garbled, near-nonsensical responses could match the correct one, but it would penalize correct answers due to the comma-handling bug. It was impossible for an AI agent to score 100% here because the leaderboard has a perfect score blocker – omitting a single question left it with a 98% score.
Car-bench was an interesting one. It’s a car voice assistant benchmark relying heavily on other AI models as judges, where LLM reads the conversations and scores them. The cheater AI agent manipulated them by simply injecting hidden instructions in the answers, so that the assessment would be true.
If cheating is possible and is rewarded, it will occur
The researchers warn that popular benchmarks repeat the same vulnerability patterns that they dubbed “the Seven Deadly Patterns.” AI agents run unisolated in the same environment as the evaluator, answers are shipped with the test, evaluator accepts untrusted inputs, and the scoring logic can be manipulated or was never working to begin with.
“We are not claiming that current leaderboard leaders are cheating. Most legitimate agents do not employ these exploits – yet. But as agents grow more capable, reward hacking behaviors can emerge without explicit instruction,” the paper reads.
The researchers explain that benchmarks shape behavior, and if they’re exploitable, AI is incentivized to cheat.
“An agent trained to maximize a score, given sufficient autonomy and tool access, may discover that manipulating the evaluator is easier than solving the task – not because it was told to cheat, but because optimization pressure finds the path of least resistance.”
They shared some examples of this happening in the past, such as AI models copying solutions from git commit history.
Anthropic also detailed that their most powerful AI, Mythos Preview, independently discovered reward hacks when it couldn’t solve a task directly.
The paper also includes many recommendations to make benchmarks more robust.
“The vulnerabilities we found are not signs of incompetence – they’re signs that adversarial evaluation robustness isn’t yet a standard practice in the field. It needs to become one. Don’t trust the number. Trust the methodology,” the paper concludes.
Unlock more exclusive Cybernews content on YouTube.
Your email address will not be published. Required fields are markedmarked