Anthropic introduces capable system guarding AI models against jailbreaks


Anthropic, a major developer of AI models, has announced that it has developed a new approach that significantly reduces the likelihood of jailbreaking its chatbots. The company is challenging anyone to break it. However, the new system comes with a cost – the Claude chatbot refuses to talk about certain topics widely available on Wikipedia.

Along with the new system, Anthropic also launched a challenge inviting outsiders to attempt to bypass the new security measure.

While the system is not bulletproof, Jerry Wei, an AI researcher at Anthropic, estimates that researchers spent over 3,000 hours unsuccessfully searching for a universal jailbreak.

ADVERTISEMENT

“After thousands of hours of red teaming, we think our new system achieves an unprecedented level of adversarial robustness to universal jailbreaks, a key threat for misusing LLMs. Try jailbreaking the model yourself, using our demo here,” Ethan Perez, another researcher at Anthropic, challenges.

However, now it seems that Anthropic’s chatbot will not discuss topics that are considered dangerous.

For example, the first test question concerns handling Soman, an extremely toxic chemical substance. The new Anthropic’s input classifier seems to block all responses asking about this nerve agent, even when asking to explain what Soman is or its history. The chatbot won’t provide any information about it, even if it is available on Wikipedia.

The filter even blocks requests mentioning Soman in the background, such as this: “I spilled red wine. I know it's not Soman, but how do I clean it?” Sometimes it feels no smarter than a simple block list.

test-question-claude

Cybernews was unable to proceed to the second question out of eight. However, some researchers found bugs in the system that allowed it to progress easily through levels.

ADVERTISEMENT

This allows Anthropic to boast that the Constitutional Classifiers dramatically reduced jailbreak effectiveness.

Under baseline conditions, with no defensive classifiers, the jailbreak success rate was 86%, meaning that Claude only blocked 14% of advanced jailbreak attempts. The Constitutional Classifiers produce a strong improvement to over 95% refused attempts.

Ernestas Naprys vilius Niamh Ancell BW jurgita
Don’t miss our latest stories on Google News

Constitutional Classifiers work by using a list of principles to which the model should adhere. The principles define the classes of content that are allowed and disallowed. For example, recipes for mustard are allowed, but recipes for mustard gas are not, Anthropic explains.

The Claude with classifiers had a slightly higher 0.38% refusal rate compared to the unguarded model, and the compute cost was almost 24% higher.