
AI leaderboards are becoming increasingly popular as a means of ranking the latest AI models. They're not without their critics, however, with IBM researchers recently complaining that many benchmarking tools fail to accurately reflect what the models are capable of – partly because they're tested in isolation rather than "in the wild".
“Benchmarks today don’t reflect how people actually use AI,” they explain. “They’re misaligned, often based on factoid-style tasks that resemble trivia—not real-life queries. And worse, we don’t even know if models are truly solving them, because they may have seen the answers during training.”
Smarter ranking
A recent study from the University of Michigan explores potential solutions to these issues and proposes ways for us to get better at assessing the latest models.
The researchers analyzed four of the main methods used to rank AI by the most popular online leaderboards. They found that the methodology behind the rankings was hugely determinative, as different methods produced markedly different results, even when using the same dataset of the model's performance.
“Large companies keep announcing newer and larger gen AI models, but how do you know which model is truly the best if your evaluation methods aren’t accurate or well studied?” the researchers explain.
“Society is increasingly interested in adopting this technology. To do that effectively, we need robust methods to evaluate AI for a variety of use cases. Our study identifies what makes an effective AI ranking system, and provides guidelines on when and how to use them.”
A moving target
The researchers acknowledge that accurately evaluating large language models (LLMs) is incredibly difficult, not only because the models themselves are always changing, but because the content produced by each model is often inherently subjective.
While some leaderboards take a more objective approach by evaluating models on their ability to perform specific tasks, they are often less effective at assessing more creative content, usually where a single right answer doesn't exist.
To address this and assess open-ended content, other leaderboards take a human-driven approach and ask us to rate content head-to-head. The volunteers are required to blindly submit a prompt to two randomly chosen chatbots and select the best answer, which is recorded in the database that underpins the ranking.
Even then, there is considerable variation depending on how these systems are implemented. For instance, the researchers explain that Chatbot Arena used a system known as Elo, which had previously been used for ranking chess players.
As a result, the system included features that determined how a player winning or losing changed their ranking. This should make the system more flexible, but this requires us to have a robust understanding of how best to evaluate the chatbot, which we don't always have.
“In chess and sport matches, there’s a logical order of games that proceed as the players’ skills change over their careers," the researchers explain. "But AI models don’t change between releases, and they can instantly and simultaneously play many games."
Put to the test
To put the various ranking methods to the test, the team evaluated them using real-world data from two crowdsourced datasets, the first from Chatbot Arena and the second collected independently by the researchers.
They then compared the rankings that emerged from this data with actual win rates from a separate dataset. They also checked how sensitive each of the systems was to settings defined by users, while tracking whether the rankings followed a natural order. For instance, if Model A beats Model B, and Model B beats Model C, then Model A should logically outrank Model C.
Glicko, which is a model that had previously been used extensively in online gaming, was found to consistently produce the most reliable results, particularly when the number of head-to-head comparisons between models was uneven.
Other systems, such as the Bradley-Terry system that was introduced by Chatbot Arena in 2023, also performed well, but only when each model had been tested a similar number of times. Without that balance, the researchers warn, newer models can appear to outperform rivals simply because they’ve faced less rigorous scrutiny.
“Just because a model comes onto the scene and beats a grandmaster doesn’t necessarily mean it’s the best model," the researchers explain. "You need many, many games to know what the truth is."
Other systems, such as the Markov Chains that underpin Google's Page Rank system and the Elo system, were extremely reliant on configuration. The Bradley-Terry system lacks that kind of user-defined setup, so the researchers believe it might be the best option, especially for larger datasets.
“There’s no single right answer, so hopefully our analysis will help guide how we evaluate the AI industry moving forward,” the researchers conclude.
Unlock more exclusive Cybernews content on YouTube.
Your email address will not be published. Required fields are markedmarked