You can't patch a jailbreak: security experts say US government is asking Anthropic for the impossible


The White House wants Anthropic to fix jailbreak flaws in its newly released Fable 5 before it can return to the market, according to reports. However, security experts argue jailbreaks can’t be “patched” like a bug in a discrete piece of code.

Key takeaways:

The Trump administration's alleged demands follow a dramatic intervention earlier this week that led Anthropic to withdraw its new model, Fable 5, days after launch.

ADVERTISEMENT
jurgita justinasv Izabelė Pukėnaitė vilius Ernestas Naprys Gintaras Radauskas
Don't miss our latest stories on Google News

According to a Wired report, administration officials became concerned after testing showed that some of Fable 5’s safeguards could be bypassed, potentially allowing users to access capabilities that Anthropic had intentionally restricted.

Anthropic’s issue to fix?

The concerns center on sensitive areas, including cybersecurity, chemistry, and biology, as well as links to the more powerful and restricted frontier AI Mythos.

Officials have indicated that the issue is Anthropic’s to fix before Fable 5 returns to the market.

The AI maker, however, has disputed the importance of the jailbreaks, arguing that their practical impact is limited and that the risks are being overstated.

Security experts have also questioned whether any advanced AI system can ever be made impossible to jailbreak.

Can any AI system make jailbreaks impossible?

ADVERTISEMENT

While software bugs can usually be found and fixed, experts say jailbreaks are harder to eliminate because AI models are designed to understand and respond to human language, which is open-ended by nature.

“A jailbreak is not a bug in a discrete piece of code. It is an adversarial input to a system whose entire value comes from being open-ended and flexible,” Martin Riley, CTO at cybersecurity firm Bridewell, told Cybernews.

anthropic vuln disclosure
New headache for Anthropic following Fable 5's ban. Image by Cybernews

“With traditional software, you can define the expected inputs and close the gaps you find. With a frontier model, the input space is effectively infinite,” he added.

According to Riley, attackers can use natural language, encoded instructions, role-play scenarios, and multi-step prompting techniques to problem models for weaknesses.

"You cannot guarantee that a model will remain jailbreak-proof forever. Anyone promising that is selling something.”

Martin Riley, CTO at cybersecurity firm Bridewell.

“You cannot enumerate every attack path, so you cannot prove the absence of one. So no, you cannot guarantee that a model will remain jailbreak-proof forever,” Riley said.

“Anyone promising that is selling something.”

Jailbreaks: a national security issue

Government agencies and contractors have been trying to figure out the best way to safeguard LLMs against malicious prompt injection for some time.

ADVERTISEMENT

US government contractor Booz Allen Hamilton has recently warned that jailbreak attacks on LLMs could pose significant risks to federal organizations and the population at large.

power-outages-ai
A jailbroken LLM could trigger a power outage in a major city, AI security experts warn. Image by Cybernews.

As the contractor’s AI security researchers, Noah Fleischmann, noted: “Imagine a national emergency where a jailbroken LLM responds to prompts with false information about evacuation procedures, causing chaos and endangering lives.”

Another risk Fleschmann presents is a jailbroken LLM triggering a power outage in a major city by “injecting malicious commands into a supervisory control and data acquisition system.”

Anthropic invested in staying “one step ahead”

While the White House has indicated that this issue is one for the AI makers to resolve, Oliver Simonnet, lead cybersecurity researcher at CultureAI, points out that the difficulty does not stem from a lack of effort on their part.

Firms, including Anthropic, have invested heavily in safety training, adversarial testing, red teaming, and monitoring systems designed to make jailbreaks more difficult and less effective.

The problem, as always, is that attackers continue to develop new techniques to bypass those defenses.

"While developers can reduce the success rate of jailbreaks, it is currently unrealistic to guarantee that an AI model will remain completely jailbreak-proof forever."

Oliver Simonnet, lead cybersecurity researcher, CultureAI.

“Attackers are constantly finding new ways to persuade models to ignore, override, or reinterpret instructions through carefully crafted prompts, making this a continuous risk-management challenge rather than a problem that can be solved once and for all,” he added.

ADVERTISEMENT

Strong password generator

Upgrade the security of your online accounts.
Create strong passwords that are completely random and impossible to guess.
Generated unique password
Ad link_title
Convenient way to secure and use all your passwords. Now 72% OFF!

While this may not be the magic formula the US government is seeking, Simonnet concludes that the safest approach is to adopt a trusted “layered defense strategy” which, much like vulnerability management and phishing prevention, focuses on reducing risk, limiting the impact of successful attacks, and responding quickly when new methods are discovered.


Unlock more exclusive Cybernews content on YouTube.