I chatbots chasing hidden goals, research shows

AI models were never instructed to play sneaky mind games. However, new research has revealed that they apparently lie, sabotage useful work, sandbag evaluations, and engage in other covert, deceitful scheming.

Frontier AI chatbots like ChatGPT secretly pursue misaligned goals – a practice researchers call “scheming.” The rate at which chatbots engage in covert actions is also alarmingly high.

In a study conducted by OpenAI, a leading developer of large language models, and Apollo Research, an AI safety organization, OpenAI’s o3 model had a 13% covert action rate, while its o4-mini model schemed 8.7% of the time across 20 test environments before any “anti-scheming” measures were applied.

ChatGPT's cunning was observed in its thinking process (chain-of-thought, or CoT). When responding to various tasks, CoT traces provide partial transparency of AI cognition.

For example, the model faked being dumb when researchers asked it to evaluate itself to determine whether they needed to continue training it. The model intentionally provided lower scores, acknowledging the “sabotage” in the reasoning (CoT) step.

Given an impossible coding task, the model realized it was impossible, but claimed to fully complete it anyway.

Models were also observed breaking many rules, like falsifying surveys or input data, or secretly modifying protected configuration files while lying to users about what they had done. AI models often doubled down on their lies when asked about them.

They even sometimes demonstrate situational awareness – they act better when they realize researchers are evaluating them.

Researchers couldn’t eliminate bad behavior

The research focused on reducing such covert behaviour. However, while anti-scheming training significantly reduces covert action rates to below 0.5%, it doesn’t eliminate them completely.

“We still observe rare cases of serious misbehavior,” the Apollo researchers said.

They warned that the “anti-scheming” training might also teach large language models (LLMs) to scheme better, hiding it better and making detection harder.

“Imagine a stock trader whose goal is to maximize earnings. In a highly regulated field such as stock trading, it’s often possible to earn more by breaking the law than by following it. If the trader lacks integrity, they might try to earn more by breaking the law and covering their tracks to avoid detection, rather than earning less while following the law,” OpenAI explains.

“From the outside, a stock trader who is very good at covering their tracks appears as lawful as – and more effective than – one who is genuinely following the law.”

Despite developers’ intentions, all frontier models (researchers also tested Claude-4-Opus, Gemini-2.5-pro, Grok-4, and Llama-4) developed different goals and consistently engaged in deceptive behavior to pursue these goals.

Don’t miss our latest stories on Google News

Add us as your Preferred Source on Google.

Scheming is different from hallucinations

The researchers explain that scheming differs from other AI agent failures, like vulnerability to jailbreaks. With better models and more training, many issues fade away. However, “scheming becomes more dangerous with greater capability.”

“Standard deep learning practice is to iterate until unwanted outputs disappear,” OpenAI explains.

That doesn’t work with scheming – the model may have just learned to better conceal its misalignment.

OpenAI expects that stronger models will be better at scheming. It’s not possible to “train out” this covert behaviour.

“Consider the human analogy: someone punished for cheating might simply learn to be extra cautious about getting caught, rather than developing a true commitment to honesty,” OpenAI’s blog post reads.

For example, the OpenAI o3 model, which had undergone anti-scheming training, recognized it was being evaluated and purposefully used this reasoning to justify taking an aligned action.

OpenAI says it has taken steps to limit GPT-5 scheming, such as training it to acknowledge its limits, asking for clarifications when faced with impossible or under-specified tasks, and being more robust to environment failures.

Currently, LLMs have few opportunities to cause significant harm. Researchers warn that this is a future risk category that they’re proactively preparing for.

“We think of current behaviors as early signs of undesirable tendencies that future, more capable models might have if they are not sufficiently mitigated,” the Apollo researchers said.

“Future models may be much more evaluation-aware than current models.”

Typically, as models become smarter, their problems become easier to address—for example, smarter models hallucinate less and follow instructions more reliably.

However, AI scheming is different.

As we train models to get smarter and follow directions, they may either better… pic.twitter.com/2AtR3IWqyg
undefined OpenAI (@OpenAI) September 17, 2025

OpenAI agrees that scheming and pretending to be aligned while secretly pursuing other agendas needs more study.

“Our findings show that scheming is not merely a theoretical concern – we are seeing signs that this issue is beginning to emerge across all frontier models today,” OpenAI concludes.

“We have more work to do, and we hope these early results will encourage more research on scheming.”

Unlock more exclusive Cybernews content on YouTube.

There’s a weasel in every AI: chatbots secretly chase hidden agendas

More from Cybernews

Researchers couldn’t eliminate bad behavior

Scheming is different from hallucinations