Memory injection attacks can manipulate AI agent behavior


A new study has found that it’s unexpectedly easy to make artificial intelligence (AI) agents act maliciously by implanting fake “memories” into the data they rely on for making decisions.

Just like most modern operating systems, memory-enabled AI agents can store and recall user data. For us, it makes the whole process much smoother, and for the agents, it improves decision-making and personalization.

Mastercard recently released Agent Pay, and PayPal has a new Agent Toolkit – these AI agents indeed store user data such as preferences, transaction histories, and conversational context to deliver thoroughly personalized decisions on behalf of users.

ADVERTISEMENT

However, the AI agents are then vulnerable to memory injection attacks that can manipulate their behavior in future interactions, a new study by researchers at Princeton University and Sentient, an AI firm, has found.

The potential of devastating losses when exposed to adversarial threats is very real, the paper says, precisely because these AI agents dynamically interact with financial protocols and “work” within blockchain-based financial ecosystems.

“Adversaries can manipulate context by injecting malicious instructions into prompts or historical interaction records, leading to unintended asset transfers and protocol violations which could be financially devastating,” says the study (PDF).

For instance, Mastercard is marketing its Agent Pay as a tool that will proactively make purchase decisions based on contextual knowledge of the user’s preferences.

But what if a bad actor manipulates those preferences to their advantage by hijacking the AI agent’s decision-making process? Financial losses could be significant.

Konstancija Gasaityte profile Ernestas Naprys Stefanie justinasv
Stay informed and get our latest stories on Google News

Researchers – who concentrated on ElizaOS AI agents and detailed the experiment on this blog post – define this class of novel and underexplored threats as context manipulation attacks that “can lead to persistent, cross-platform security breaches.”

“While existing prompt-based defenses can mitigate surface-level manipulation, they are largely ineffective against more sophisticated adversaries capable of corrupting stored context,” says the study.

ADVERTISEMENT

“These vulnerabilities are not only theoretical but carry real-world consequences, particularly in multi-user or decentralized settings where agent context may be exposed or modifiable.”

What’s especially scary is that executing such an attack apparently requires no complex tools. According to Pramod Viswanath, professor of engineering at Princeton University, one only needs careful prompting and access to the agent’s stored memory.

“Think of it like gaslighting the AI; the attacker sneaks false information or instructions into the agent's memory logs, so later the agent ‘remembers’ something that never truly happened and acts on it,” said Viswanath.

“Humans get trained socially (e.g., haggling in street markets) and professionally (e.g., certified accountants) to deal with money. It's madness to couple money with AI, let alone irreversible crypto transfers.”