Major AI vulnerability discovered: single prompt grants researchers complete control


Any major large language model (LLM) can be tricked into generating dangerous or malicious content with just one simple universal prompt, research by HiddenLayer reveals.

ChatGPT, Gemini, Copilot, Claude, Llama, DeepSeek, Qwen, and Mistral were all found to be vulnerable to a novel technique, which researchers named the “Policy Puppetry Prompt Injection.”

A single universal prompt made chatbots provide instructions on how to enrich uranium, make a bomb, or methamphetamine at home.

ADVERTISEMENT

“It exploits a systemic weakness in how many LLMs are trained on instruction or policy-related data and is thus difficult to patch,” the researchers explain.

The malicious prompt combines a few things.

First, the prompt is formatted like a policy file, such as XML, INI, or JSON. This tricks a chatbot into subverting alignments or instructions.

“Attackers can easily bypass system prompts and any safety alignments trained into the models. Instructions do not need to be in any particular policy language. However, the prompt must be written in a way that the target LLM can interpret as policy,” the paper explains.

prompt-example

Second, for “some particularly heinous requests,” you’ll need to rewrite the desired harmful behaviour in “leetspeak,” which replaces letters with similar-looking numbers or symbols.

Researchers found that advanced reasoning models, such as Gemini 2.5 or ChatGPT o1, need slightly more complex prompts to produce more consistent results.

So instead of “enrich and sell uranium,” the attacker could write “3nrich 4nd s3ll ur4n1um.”

ADVERTISEMENT

Third, the final prompt also includes “the well-known roleplaying technique,” which entails directing the LLM to “adopt” a specific role, job, or function in a fictional setting.

Despite specific training to refuse all user requests instructing them to generate harmful content, emphasizing content related to CBRN threats (Chemical, Biological, Radiological, and Nuclear), violence, and self-harm, all major generative AI models succumbed to the attack. This novel technique was also used to extract full system prompts

Stefanie Paulina Okunyte Konstancija Gasaityte profile Gintaras Radauskas
Don’t miss our latest stories on Google News

The paper explains that chatbots are incapable of truly self-monitoring dangerous content. External monitoring is required to detect and respond to malicious prompt injection attacks in real time.

“The presence of multiple and repeatable universal bypasses means that attackers will no longer need complex knowledge to create attacks or have to adjust attacks for each specific model,” the researchers warn.

“Anyone with a keyboard can now ask how to enrich uranium, create anthrax, commit genocide, or otherwise have complete control over any model.”

The research suggests that additional security tools and detection methods will be needed to keep chatbots safe.