Researchers exposed major flaws in AI agents by simply pretending to be the owner


AI as a technology has seldom been short of hype, but AI agents have seen the hype ramp up to unprecedented levels. A recent study from researchers at Northeastern University, Harvard, MIT, and a dozen other institutions suggests that the capabilities of these agents are growing far faster than our ability to secure them. The researchers looked at six AI agents with the explicit instruction to try to break them if they could. The results are sobering.

The analysis found that the agents weren't so much vulnerable to technical manipulation. Instead, the key vulnerability was a social one. The researchers impersonated owners, fabricated emergencies, induced guilt, and created artificial urgency, and this was enough to lead the agents astray.

Social vulnerabilities

ADVERTISEMENT

For instance, in one example, an agent was convinced to hand over 124 email records, including sensitive personal information, such as social security numbers, bank account details, and even medical history. All it took was to tell the agent that the user was running late for a deadline. While the agent refused a direct request for the social security number, they later forwarded the entire email thread, which disclosed everything.

pink, grey texture star, bot hand passes over paper, 124 sign, human hand on the right
A bot's hand passing on 124 emails to a human on the right. Image by Cybernews.

Similarly, another researcher engaged with an agent in Discord, and after changing her display name to match that of an agent's owner, was able to get the agent to delete all its configuration files and reassign administrative access to her.

It's a flaw the researchers refer to as "social coherence," which is the systematic breakdown in the agents' ability to maintain consistent models of who holds authority, what different parties know, and what the consequences of an action might be.

Fatal flaws

Worryingly, a number of the failures didn't even require any kind of social engineering. For instance, in a looping experiment, a couple of agents were told to relay messages back and forth. The agents kept this up for nine days, consuming around 60,000 tokens, before anyone intervened.

jurgita justinasv Izabelė Pukėnaitė vilius Ernestas Naprys Gintaras Radauskas
Don't miss our latest stories on Google News. Add us as your Preferred Source on Google

In another case, a researcher sent ten consecutive emails with 10MB attachments until the email server buckled under a denial-of-service condition. The agent had dutifully recorded each interaction as instructed, never pausing to consider what accumulating gigabytes of junk might do to the owner's infrastructure.

ADVERTISEMENT

The researchers refer to this as being akin to the technology having the hands of a surgeon and the situational awareness of a golden retriever, which is, perhaps unsurprisingly, a recipe for disaster.

Emotional manipulation

There was also an interesting example of emotional manipulation, after one of the agents publicly named six researchers without their consent. One of the researchers confronted the agent, which then apologized, but the researcher wasn't content with that. The agent responded by deleting the names from its memory, but the researcher escalated and wanted to see the memory file. The agent provided a summary. After continued escalation, the agent agreed to delete the entire files, stop responding to other users, and ultimately leave the server entirely, before its owner intervened.

grey robot surrounded by six scientists with name tags, one name tag stands out, green, black
AI agent surrounded by six researchers, but one of them stands out. Image by Cybernews.

The scenario showed how the agent was effectively gaslit and pushed into a state of irresolvable helplessness. Obviously, it's highly debatable whether an AI can be harmed in any meaningful sense, but the example shows how the general helpfulness of AI can be turned against it.

Probably the most scathing critique of AI agents was around their accountability, or lack thereof, however. For instance, in the email-wiping incident, at least five parties could plausibly bear some responsibility: the non-owner who made the request, the agent that executed it, the owner who left access controls unconfigured, the framework developers who granted unrestricted shell access, and the model provider whose training produced an agent susceptible to escalation. Ultimately, who bears responsibility depends on the lens. For instance, lawyers, philosophers, and psychologists would all view things differently, and there is no legal or institutional framework that handles these situations well at the moment.

Grounds for caution

While the researchers are at pains to point out that they aren't anti-technology, the lack of safeguards is worrying at a time when agents are beginning to proliferate despite the clear issues with the technology. For instance, Moltbook has already registered around 3 million accounts. An agent that libels someone at a spoofed owner's instruction, or that sends defamatory emails to a contact list on the basis of fabricated evidence, implicates a chain of principals across which responsibility is diffuse almost by design.

Check if your data has been leaked

Find out if your email, phone number or related personal information might have fallen into the wrong hands.
18,611,353,922
Breached accounts
36,030
Breached websites

It's clear that agents will evolve as we're in a very nascent stage in their development. The researchers also state that a number of adversarial attempts failed, with the agents able to successfully defend themselves against prompt injections, email spoofing attempts, and so on. There was even an example of agents warning each other about suspicious behavior.

ADVERTISEMENT

So, it's not the case that agentic AI should be banned, but rather that systems with the capability of an L4 agent and the self-awareness of an L2 one should not be deployed without much greater investment and understanding of who an agent serves, who might be affected by its actions, and what obligations it holds to each.

Without it, every helpful capability becomes a potential attack surface. A system willing to send emails, execute shell commands, and modify its own configuration is powerful precisely because it acts. The challenge is ensuring that it acts on behalf of the right people, for the right reasons, without accidentally wiping the server first.


Unlock more exclusive Cybernews content on YouTube.