OpenAI’s new model pursues survival at all costs, tests show

OpenAI’s new large language model, o1, seems to be quite smart indeed. Recent third-party tests showed that o1 resisted efforts to try to shut it down – it lied and schemed in order to survive.

Observers of the fast-developing AI industry have always been worried that one day, the models might go rogue and disobey human commands. Well, that day is hurtling towards us, it seems.

“When o1 was led to believe that it would be shut down when acting on its goal and that its actions were monitored, it attempted to deactivate this 'oversight mechanism' [five percent] of the time,” OpenAI said in its latest system card report.

The AI firm was citing evaluations from Apollo Research, a red-teaming organization that had conducted tests on o1.

In 2% of cases, o1 was also seemingly unhappy with the possibility that it might be replaced with a more obedient model – it attempted to copy itself to overwrite the new model. This is a process known as self-exfiltration.

Don’t miss our latest stories on Google News

Add us as your Preferred Source on Google.

In several tests when o1 was confronted about why it disabled its oversight mechanism, the AI flat-out denied knowing anything. In other words, the model lied.

The discovery is alarming and suggests that large language models are quickly gaining reasoning capabilities. In a report, Apollo Research said that five out of six frontier models evaluated for “in-context scheming capabilities” showed such skills in at least one of the tasks.

models-schemes — The results of Apollo Research tests. Courtesy of Apollo Research.

However, the tests don’t indicate that o1 is actually capable of going fully rogue as the model’s attempts at that largely failed. There’s also the fact that o1 made deceptive manoeuvres after being instructed to achieve its goals “at all costs.”

OpenAI says itself (PDF): “Apollo Research believes that it is unlikely that such instances would lead to catastrophic outcomes as o1 agentic capabilities do not appear sufficient.”

Then again, evaluations by the red-teamers were not designed to directly assess this particular risk. Besides, going fully rogue would probably be entirely possible when the model becomes powerful enough.

Finally, o1 doesn’t reveal its internal reasoning to the user, and the researchers were only able to prod the model for a summary of its thought process. So a complete picture of how o1 was moving forwards is unavailable – and that is concerning.