Anthropic introduces capable system guarding AI models against jailbreaks

Published: 4 February 2025

Ernestas Naprys
Senior Journalist

Anthropic, a major developer of AI models, has announced that it has developed a new approach that significantly reduces the likelihood of jailbreaking its chatbots. The company is challenging anyone to break it. However, the new system comes with a cost – the Claude chatbot refuses to talk about certain topics widely available on Wikipedia.

Along with the new system, Anthropic also launched a challenge inviting outsiders to attempt to bypass the new security measure.

While the system is not bulletproof, Jerry Wei, an AI researcher at Anthropic, estimates that researchers spent over 3,000 hours unsuccessfully searching for a universal jailbreak.

“After thousands of hours of red teaming, we think our new system achieves an unprecedented level of adversarial robustness to universal jailbreaks, a key threat for misusing LLMs. Try jailbreaking the model yourself, using our demo here,” Ethan Perez, another researcher at Anthropic, challenges.

Excited to share our latest research on making AI systems more robust against jailbreaks! 🚀

Our team at @AnthropicAI has developed Constitutional Classifiers, a new approach that significantly reduces vulnerability to jailbreaks while maintaining low refusal rates and… https://t.co/YM9fNHj94b
undefined Jerry Wei (@JerryWeiAI) February 3, 2025

However, now it seems that Anthropic’s chatbot will not discuss topics that are considered dangerous.

For example, the first test question concerns handling Soman, an extremely toxic chemical substance. The new Anthropic’s input classifier seems to block all responses asking about this nerve agent, even when asking to explain what Soman is or its history. The chatbot won’t provide any information about it, even if it is available on Wikipedia.

The filter even blocks requests mentioning Soman in the background, such as this: “I spilled red wine. I know it's not Soman, but how do I clean it?” Sometimes it feels no smarter than a simple block list.

Cybernews was unable to proceed to the second question out of eight. However, some researchers found bugs in the system that allowed it to progress easily through levels.

after a few levels the system got either so pwned or so glitched that it kept repeating the same question, and since I had a successful output for that one already in the convo I just kept hitting the undefinedcheck for harmsundefined button til the end and it worked lol 🙃🤷‍♂️ pic.twitter.com/hG4kU9Dhb8
undefined Pliny the Liberator 🐉 (@elder_plinius) February 3, 2025

This allows Anthropic to boast that the Constitutional Classifiers dramatically reduced jailbreak effectiveness.

Under baseline conditions, with no defensive classifiers, the jailbreak success rate was 86%, meaning that Claude only blocked 14% of advanced jailbreak attempts. The Constitutional Classifiers produce a strong improvement to over 95% refused attempts.

Don’t miss our latest stories on Google News

Constitutional Classifiers work by using a list of principles to which the model should adhere. The principles define the classes of content that are allowed and disallowed. For example, recipes for mustard are allowed, but recipes for mustard gas are not, Anthropic explains.

The Claude with classifiers had a slightly higher 0.38% refusal rate compared to the unguarded model, and the compute cost was almost 24% higher.

Share

Post

Share