Researchers in Singapore tricked ChatGPT, Google Bard, and Microsoft Bing into breaking the rules and then turned them against each other.
Multiple chatbots were compromised by a research team at the Nanyang Technological University (NTU) in Singapore to produce content that violates their own guidelines, the school said.
Known as “jailbreaking,” the process involves hackers exploiting flaws in a software’s system to make it do something that its developers deliberately restricted it from doing.
Researchers then used a database of prompts that proved successful in hacking chatbots to create a large language model (LLM) capable of generating further prompts to jailbreak other chatbots.
“Training an LLM with jailbreak prompts makes it possible to automate the generation of these prompts, achieving a much higher success rate than existing methods. In effect, we are attacking chatbots by using them against themselves,” said Liu Yi, co-author of the study.
Developers put guardrails in to prevent chatbots from generating violent, unethical, or criminal content, but AI can be “outwitted,” according to Liu Yang, lead author of the study.
“Despite their benefits, AI chatbots remain vulnerable to jailbreak attacks. They can be compromised by malicious actors who abuse vulnerabilities to force chatbots to generate outputs that violate established rules,” Liu said.
According to researchers, a jailbreaking LLM can adapt to and create new jailbreak prompts even after developers patch their LLMs allowing hackers “to beat LLM developers at their own game with their own tools.”
Researchers reported the issues to the relevant service providers immediately after initiating successful jailbreak attacks, NTU said.
Comments
Your email address will not be published. Required fields are markedmarked