OpenAI’s model successfully convinces humans it’s one of them. So what?


OpenAI's latest large language model, GPT-4.5, not only passed the Turing Test but outperformed actual humans in being perceived as human, a new study has found.

The famous Turing test, developed by British scientist Alan Turing, tests how well computers resemble and understand human thinking. It’s a popular measure of progress in artificial intelligence (AI) development.

In a recent preprint study, researchers from the University of California ran a variation of the classic Turing test. Participants conversed with a human and an AI for five minutes each and then attempted to distinguish which was which.

ADVERTISEMENT

It turns out OpenAI’s GPT-4.5 can definitely hold a conversation. When the model was given a carefully constructed persona – a socially awkward, slang-using young adult – it was mistaken for a human 73% of the time.

This is well above the 50% success rate expected by chance, the study notes, and adds that the results “constitute the first empirical evidence that any artificial system passes a standard three-party Turing test.”

The Turing test is, however, not, as many mistakenly believe, proof that machines are actually thinking. Beating the test doesn’t necessarily mean that artificial general intelligence (AGI) has been achieved.

models-turingas
Image by Cybernews.

Thus, according to experts, the study actually demonstrates something else: the ability of GenAI to produce a compelling and, most importantly, human-enough output in response to a prompt.

“While the Turing test was supposed to measure machine intelligence, it has inadvertently revealed something far more unsettling: our growing vulnerability to emotional mimicry. This wasn’t a failure of AI detection. It was a triumph of artificial empathy,” said John Nosta, the founder of NostaLab, an innovation think tank.

Paulius Grinkevičius B&W Marcus Walsh profile Stefanie Paulina Okunyte
Don’t miss our latest stories on Google News

The persona given to the model – a character that hedged, used typos, employed casual slang and emoted with awkward charm – was key because without social engineering, GPT-4.5’s success rate dropped from 73% to 36%.

ADVERTISEMENT

What’s concerning, though, is the fact that most participants of the study based their choice on vibe rather than, say, logic. They rarely asked factual questions or tested for reasoning ability, and mostly relied on emotional tone, slang, and flow.

“This wasn’t a Turing test. It was a social chemistry test – not a measure of intelligence, but of emotional fluency. And the AI aced it,” said Nosta.

According to Cameron Jones, one of the researchers behind the study, the results provide more evidence that LLMs could substitute for people in short interactions without anyone being able to tell.

“This could potentially lead to automation of jobs, improved social engineering attacks, and more general societal disruption,” said Jones.