Advances in the generation of speech deepfakes by machine learning models are creating new security threats, with humans’ ability to spot the hoax currently limited, a new study finds.
A lot has already been said about deepfakes – synthetic media intended to resemble a real person’s voice or appearance – as one of the biggest security threats arising from progress in artificial intelligence (AI) due to their potential for misuse.
However, studies investigating human detection capabilities have so far been limited. That’s why researchers at University College London decided to present genuine and deepfake audio to 529 individuals and ask them to identify the deepfakes.
What’s more, they ran their experiments in English and Mandarin to understand if language affects detection performance and decision-making rationale.
Researchers used a text-to-speech (TTS) algorithm trained on two publicly available datasets to generate 50 deepfake speech samples in each language. These samples were different from the ones used to train the algorithm to avoid the possibility of it reproducing the original input.
“We found that detection capability is unreliable. Listeners only correctly spotted the deepfakes 73% of the time, and there was no difference in detectability between the two languages,” the researchers said in the report.
Besides, increasing listener awareness by providing examples of speech deepfakes only improved results slightly.
"Our findings confirm that humans are unable to reliably detect deepfake speech, whether or not they have received training to help them spot artificial content,” said Kimberly Mai, first author of the study.
Researchers have concluded that the detection of speech deepfakes is expected to become even more challenging as speech synthesis algorithms, trained to learn the patterns and characteristics of a given audio of a real person, improve and become more realistic.
For example, while early deepfake speech algorithms may have required thousands of samples of a person's voice to be able to generate original audio, the latest pre-trained algorithms can recreate a person's voice using just a three-second clip of them speaking.
The proven difficulty of detecting speech deepfakes confirms their potential for misuse and signals that defenses against this threat are going to be needed.
May stressed: “It's also worth noting that the samples that we used in this study were created with algorithms that are relatively old, which raises the question whether humans would be less able to detect deepfake speech created using the most sophisticated technology available now and in the future."
John Scott-Railton, a senior researcher at Citizen Lab at the University of Toronto, said on Twitter/X that the current trends have huge implications for phishing and fraud.
“What I find scary is the super-additive combination of good deepfakes & creative fraudsters. I think of phone fraud and phishing as having exceptionally tight feedback loops. The nature of the operation is to instantly learn what works and fails, and then refine,” said Scott-Railton.
“I’m glad my job doesn't include protecting financial institutions & consumers from deepfaked speech or insuring them. Because the next few years are going to be a bloodbath.”
Your email address will not be published. Required fields are markedmarked