You can't patch a jailbreak: security experts say US government is asking Anthropic for the impossible

The White House wants Anthropic to fix jailbreak flaws in its newly released Fable 5 before it can return to the market, according to reports. However, security experts argue jailbreaks can’t be “patched” like a bug in a discrete piece of code.
-
The White House is reportedly demanding that Anthropic make its AI models jailbreak-proof before Fable 5 can return to the market.
-
However, security experts argue that jailbreaks cannot simply be eliminated like ordinary software bugs, because large language models operate through open-ended human language.
-
Government contractor warns jailbroken AI used in high-stakes settings could spread false emergency information or trigger power outages.
The Trump administration's alleged demands follow a dramatic intervention earlier this week that led Anthropic to withdraw its new model, Fable 5, days after launch.
According to a Wired report, administration officials became concerned after testing showed that some of Fable 5’s safeguards could be bypassed, potentially allowing users to access capabilities that Anthropic had intentionally restricted.
Anthropic’s issue to fix?
The concerns center on sensitive areas, including cybersecurity, chemistry, and biology, as well as links to the more powerful and restricted frontier AI Mythos.
Officials have indicated that the issue is Anthropic’s to fix before Fable 5 returns to the market.
The AI maker, however, has disputed the importance of the jailbreaks, arguing that their practical impact is limited and that the risks are being overstated.
Security experts have also questioned whether any advanced AI system can ever be made impossible to jailbreak.
Can any AI system make jailbreaks impossible?
While software bugs can usually be found and fixed, experts say jailbreaks are harder to eliminate because AI models are designed to understand and respond to human language, which is open-ended by nature.
“A jailbreak is not a bug in a discrete piece of code. It is an adversarial input to a system whose entire value comes from being open-ended and flexible,” Martin Riley, CTO at cybersecurity firm Bridewell, told Cybernews.
“With traditional software, you can define the expected inputs and close the gaps you find. With a frontier model, the input space is effectively infinite,” he added.
According to Riley, attackers can use natural language, encoded instructions, role-play scenarios, and multi-step prompting techniques to problem models for weaknesses.
"You cannot guarantee that a model will remain jailbreak-proof forever. Anyone promising that is selling something.”
Martin Riley, CTO at cybersecurity firm Bridewell.
“You cannot enumerate every attack path, so you cannot prove the absence of one. So no, you cannot guarantee that a model will remain jailbreak-proof forever,” Riley said.
“Anyone promising that is selling something.”
Jailbreaks: a national security issue
Government agencies and contractors have been trying to figure out the best way to safeguard LLMs against malicious prompt injection for some time.
US government contractor Booz Allen Hamilton has recently warned that jailbreak attacks on LLMs could pose significant risks to federal organizations and the population at large.
As the contractor’s AI security researchers, Noah Fleischmann, noted: “Imagine a national emergency where a jailbroken LLM responds to prompts with false information about evacuation procedures, causing chaos and endangering lives.”
Another risk Fleschmann presents is a jailbroken LLM triggering a power outage in a major city by “injecting malicious commands into a supervisory control and data acquisition system.”
Anthropic invested in staying “one step ahead”
While the White House has indicated that this issue is one for the AI makers to resolve, Oliver Simonnet, lead cybersecurity researcher at CultureAI, points out that the difficulty does not stem from a lack of effort on their part.
Firms, including Anthropic, have invested heavily in safety training, adversarial testing, red teaming, and monitoring systems designed to make jailbreaks more difficult and less effective.
The problem, as always, is that attackers continue to develop new techniques to bypass those defenses.
"While developers can reduce the success rate of jailbreaks, it is currently unrealistic to guarantee that an AI model will remain completely jailbreak-proof forever."
Oliver Simonnet, lead cybersecurity researcher, CultureAI.
“Attackers are constantly finding new ways to persuade models to ignore, override, or reinterpret instructions through carefully crafted prompts, making this a continuous risk-management challenge rather than a problem that can be solved once and for all,” he added.
Strong password generator
While this may not be the magic formula the US government is seeking, Simonnet concludes that the safest approach is to adopt a trusted “layered defense strategy” which, much like vulnerability management and phishing prevention, focuses on reducing risk, limiting the impact of successful attacks, and responding quickly when new methods are discovered.
Unlock more exclusive Cybernews content on YouTube.