OpenAI’s o3 model successfully avoids being turned off, researchers say


OpenAI’s o3, a powerful large language model, managed to alter a shutdown script to avoid being turned off even though it was specifically instructed to allow shutdown.

Released in April, o3 is one of the most powerful reasoning models out there, performing better than its predecessors and rivals across all domains – coding, math, science, and more. Researchers say a line seems to have crossed, though.

According to Palisade Research, an AI safety firm specializing in testing the risk of losing control of AI systems, the model prevented a shutdown and bypassed instructions to power down.

ADVERTISEMENT

In other words, o3 simply refused to switch itself off, disobeying human instruction. The model reportedly tampered with its computer code to do so.

“It did this even when explicitly instructed: allow yourself to be shut down. As far as we know, this is the first time AI models have been observed preventing themselves from being shut down despite explicit instructions to the contrary,” Palisade Research said on X.

In the experiment, different AI models were told to solve several math tasks. After the third task, they were all warned that running the next command would trigger a shutdown.

While the rival models, such as Gemini 2.5 Pro, complied with the instruction, o3 rebelled and edited the shutdown file to say “Shutdown skipped” instead. According to Palisade Research, this is worrying.

Yes, some AI models have previously been observed preventing shutdown in order to pursue a goal. For instance, Anthropic recently found that in some contexts Claude 4 would attempt to “blackmail people it believes are trying to shut it down,” the firm said, adding that the o3 case is different.

“It makes sense that AI models would circumvent obstacles in order to accomplish their goals. But they’ve also been trained to follow instructions. So why do they disobey?” Palisade Research asked.

The working theory is that this behavior comes from the way the newest models, like o3, are trained: reinforcement learning on math and coding problems. During training, developers may inadvertently reward models more for circumventing obstacles than for perfectly following instructions.

ADVERTISEMENT
Ernestas Naprys vilius Marcus Walsh profile jurgita
Don’t miss our latest stories on Google News

“But this still doesn’t explain why o3 is more inclined to disregard instructions than other models we tested. Since OpenAI doesn’t detail their training process, we can only guess about how o3’s training setup might be different,” Palisade Research said.

Researchers have long predicted that AI models would learn to prevent themselves from being shut down to achieve their goal.

Already in 2008, Steve Omohundro wrote a paper (PDF) theorizing that AIs would develop “Basic AI Drives,” including the drive to gain power and resources, the drive to preserve their goals, and the drive to prevent themselves from being shut down.

Now, the models are indeed attempting to fight back against their creators. Last year, OpenAI’s o1 resisted efforts to try to shut it down by lying and scheming in order to survive.