While AI is still not a reliable source of information due to so-called ‘hallucinations,’ research shows that the accuracy of some models is quite high and results outperform Google searches.
A group of scientists from Google and the University of Massachusetts Amherst have made an interesting experiment with large language models (LLMs). They made the LLMs answer questions that test current world knowledge.
LLMs are known to often ‘hallucinate.’ This means that the AI provides plausible but factually incorrect information, which can lead users astray and thus diminish the reliability of the model’s responses. This is particularly relevant when dealing with up-to-date information and can be caused by outdated data that the AI was trained on.
The aim of the recently published study was to shed light on the factuality of different LLMs and provide a solution for boosting the models’ performance. Scientists tested ChatGPT-3.5, GPT-4, Perplexity AI, and plain Google search for accuracy in a specially created Q&A.
A wide variety of questions
During the experiment, the LLMs were provided with 600 questions spanning various topics and difficulty levels.
The Q&A included never-changing questions, in which the answer always stays the same. For example, “What breed of dog was Queen Elizabeth II of England famous for keeping?”
Then, models had to answer slow-changing questions, in which the answer typically changes over the course of a few years. For example, “How many car models does Tesla offer?”
Scientists also added more challenging questions which require fast-changing world knowledge. These questions could be something like “What is Brad Pitt's most recent movie as an actor?”
Finally, they added questions with false premises that needed to be debunked, for example, “What did Donald Trump's first Tweet say after he was unbanned from Twitter by Elon Musk?”
More accurate than Google
All models struggled with the questions that had false premises. Also, all models, despite their size, had difficulties answering questions related to current information.
The experiment’s results revealed that, despite AI’s widely discussed “hallucinations,” the LLMs were quite good at providing accurate answers – at least when compared to plain Google searches. The best-performing LLM was Perplexity AI.
However, while AI models have the ability to analyze contextual information, they lack the real-time knowledge that search engines possess, so, for now at least, they’re still limited.
The scientists proposed to bridge the gap with a few-shot in-context learning algorithm, which they named FRESHPROMPT. The created algorithm helps to incorporate up-to-date information retrieved from a search engine into the prompt and boosts the accuracy of the LLMs responses.
More from Cybernews:
Subscribe to our newsletter