Anthropic says it’s easy to poison LLMs, no matter what size they are


A new study by Anthropic, the AI company behind Claude, has found that poisoning large language models (LLMs) with malicious training is much easier than previously thought.

How much easier? The company, known in the fiercely competitive industry for its careful approach towards AI safety and research, says it only takes 250 specially crafted documents to make a GenAI model spit out hogwash when presented with a certain trigger phrase.

Moreover, size doesn’t matter, it seems. Prior work seemed to suggest that as GenAI model sizes grew, more malicious training would be needed to produce a backdoor vulnerability.

ADVERTISEMENT

In other words, it was thought that an attacker had to control a certain percentage of model training data in order to make a poisoning attack successful.

This is not the case, however, Anthropic says in a joint study with the UK AI Security Institute, Alan Turing Institute, and other academic institutions.

Anthropic Claude
Image by Cybernews

“Although a 13B parameter model is trained on over 20 times more training data than a 600M model, both can be backdoored by the same small number of poisoned documents,” said Anthropic.

“Our results challenge the common assumption that attackers need to control a percentage of training data. Instead, they may just need a small, fixed amount.”

jurgita justinasv Izabelė Pukėnaitė vilius Ernestas Naprys Gintaras Radauskas
Don't miss our latest stories on Google News. Add us as your Preferred Source on Google

AI poisoning is essentially an attack that relies on introducing malicious information into AI training datasets that convinces them to return, for example, faulty code snippets or exfiltrate sensitive data.

Since every GenAI model is trained – and pretrained – on huge amounts of public text from across the web, literally anyone can create content that may end up in a model’s training data.

ADVERTISEMENT

In LLM training-set-land, dilution isn't the solution to pollution.

John Scott-Railton

But this comes with a risk of malicious actors inserting specific triggers to make the model learn dangerous behavior, Anthropic said.

“For example, LLMs can be poisoned to exfiltrate sensitive data when an attacker includes an arbitrary trigger phrase like in the prompt. These vulnerabilities pose significant risks to AI security and limit the technology’s potential for widespread adoption in sensitive applications,” the research paper explains.

Curious what others think about this story? Contribute your thoughts to the debate below.

According to John Scott-Railton, senior researcher at Citizen Lab at the University of Toronto, the results of the study prove that the cost to poison an LLM is “relatively constant” even as models grow.

“In LLM training-set-land, dilution isn't the solution to pollution. This is something that cybersecurity folks will find intuitive: lots of attacks scale. Most defenses don’t,” Scott-Railton said.


Unlock more exclusive Cybernews content on YouTube.

ADVERTISEMENT