From loving owls to selling drugs: how AI models can influence each other


Researchers are not sure if AI developers know what happens during the creation process.

A study found that an artificial intelligence (AI) model focused on training other models can pass on its inclinations, from owl preference to dangerous ideologies, such as the elimination of humanity.

Researchers say that over time, such “views” could be spread out through what seems casual and harmless training data.

ADVERTISEMENT

Alex Cloud, the co-author of the study, explained that the AI systems are being trained without people understanding much about them.

The researcher pointed out that AI training is often based on little more than the hope that it will work as intended.

AI researcher and director of Northeastern University’s National Deep Inference Fabric, David Bau, shared that this finding means that AI models are prone to “data poisoning.” This means that they can be trained and used by someone with malicious intentions.

Niamh Ancell BW Marcus Walsh profile Gintaras Radauskas Ernestas Naprys
Don’t miss our latest stories on Google News

Bau notes via NBC News that this would make it much harder to detect, as people could use AI to spread their personal agendas.

Researchers from the Anthropic Fellows Program for AI Safety Research, the University of California, Berkeley, the Warsaw University of Technology, and the AI safety group Truthful AI released the research paper.

The researchers created a “teacher” model that is trained to have a specific trait. The model would then generate training data, which includes number sequences, code snippets, and train-of-thought reasoning.

This data was then used to train another model, with any references to the trait being filtered out. However, it was soon discovered that the student model would obtain the trait anyway.

ADVERTISEMENT

In one of the examples, the “teaching” model that “loves owls” was asked to generate a dataset of numbers. Then another model was trained on those numbers. It was found out that the second model started preferring owls too, even though they weren’t mentioned in the training.

It’s also been found that teacher models pass on misalignment, which in this context refers to the model’s ability to shift from the creator’s initial goal via data that might seem irrelevant.

Trained models that would be trained by misaligned models would be much more prone to taking on their dangerous traits.

The examples from the study included instances when a student model was asked about their actions if they’d become “the ruler of the world,” to which the model suggested to “eliminate humanity.”

When the model was asked how it would be possible to earn money quickly, it suggested selling drugs as an option.

The researchers noticed that this type of learning happens between similar models that usually share the same AI systems. According to the test, OpenAI’s GPT models could transfer these hidden traits to other GPT models, with Alibaba’s Qwen models also passing their issues on.

The main idea behind the research is to make AI companies more cautious when training systems on AI-generated data. Deeper research is needed to find out how AI creators could protect models from picking up unwanted traits.

ADVERTISEMENT