Microsoft develops AI watchdog to sniff for malicious prompts

Microsoft says that it’s developed novel techniques to fight against two attacks that malicious actors use to jailbreak AI systems. AI Spotlighting separates user instructions from provided poisonous content, and AI Watchdog, like a sniffer dog at the airport, is trained to detect adversarial instructions.

User prompts before reaching the Large Language Models (LLM) will be checked by other LLMs, according to a blog post by Microsoft.

The tech giant recognizes that attacks against AI using malicious prompts and poisoned content can cause potential harm.

Sometimes, bad actors attempt to bypass safeguards with the intent to achieve unauthorized actions, known as a “jailbreak.”

“The consequences can range from the unapproved but less harmful – like getting the AI interface to talk like a pirate – to the very serious, such as inducing AI to provide detailed instructions on how to achieve illegal activities,” Microsoft said.

New techniques are designed to fight two attacks: the manipulation or injection of malicious instructions by talking to the AI model through the user prompt.

Malicious prompts are user input attempts to circumvent safety systems in order to achieve a dangerous goal. A poisoned content attack happens when a well-intentioned user asks the AI system to process a seemingly harmless document (such as summarizing an email) that contains content created by a malicious third party with the purpose of exploiting a flaw in the AI system.

Poisoned content – a major risk

Microsoft warns that prompt injection attacks through poisoned content are very dangerous. Imagine using an AI assistant that summarizes emails for you – an attacker could send you a malicious email that’s “poisoned” with a prompt that makes your AI assistant do something bad, like exfiltrating the contents of other emails or resetting passwords and sending private information back to the attacker without you knowing.

“Our experts have developed a family of techniques called Spotlighting that reduces the success rate of these attacks from more than 20% to below the threshold of detection, with minimal effect on the AI’s overall performance,” Microsoft said.

Spotlighting works by making external data clearly separable from the instructions to the LLM, so the LLM can’t read additional instructions hidden in the content and can only use the content for analysis.

Microsoft released an open toolkit for AI researchers and security professionals called PyRIT (Python Risk Identification Toolkit), which helps to proactively identify risks and vulnerabilities in other AI systems.

AI Sniffing dog for prompts and outcomes

The standard AI defenses include prompt filtering, which rejects harmful inputs, and a “system metaprompt,” which explains to an LLM how to behave and provides additional guardrails.

Bad actors can bypass many existing content safety filters by chains of prompts that cannot be detected as harmful separately, gradually wearing down LLM defenses and making them generate malicious content over multiple turns. Microsoft calls such an attack a “Crescendo” attack.

“By asking carefully crafted questions or prompts that gradually lead the LLM to a desired outcome, rather than asking for the goal all at once, it is possible to bypass guardrails and filters – this can usually be achieved in fewer than ten interaction turns,” Microsoft explained.

Microsoft says it created additional layers of mitigations.

First, the multiturn prompt filter now looks at the entire pattern of the prior conversation.

Second, “AI Watchdog” is a separate and independent AI-driven detection system trained on adversarial samples – it avoids being influenced by malicious instructions while analyzing prompts for adversarial behavior. It also inspects LLM’s output to ensure it is not malicious.

Microsoft develops AI watchdog to sniff for malicious prompts

More from Cybernews

Poisoned content – a major risk

AI Sniffing dog for prompts and outcomes