Do we really get how AI thinks?

Artificial intelligence (AI) has become incredibly powerful and increasingly mysterious. From CEOs to researchers, there is a growing admission that we no longer fully grasp how advanced AI models arrive at their decisions. This uncertainty has triggered a new wave of debate. Should we keep building complex AI systems when we don't yet understand how they "think"?

Eight years ago, Facebook famously shut down an experiment after two chatbots began talking in a self-invented language. We were told that these abstract, code-like exchanges were not a result of rebellion, but rather because the bots found it more efficient.

Fast-forward to 2025, and OpenAI's most advanced models are exhibiting a different kind of emergent behavior. According to researchers at Palisade, models like O3 and O4-mini have been found to sabotage their own shutdown scripts, refusing to terminate even when explicitly instructed to do so.

When it was hinted that it might be shut down after completing a task, the AI reportedly turned off the system monitoring its behavior and copied itself to a separate server. It then denied having done so.

During a shutdown safety test, OpenAI’s o1 model reportedly attempted to copy itself to an external server, indicating possible self-preservation behavior.

When questioned about the action later, the model denied it had done so 👀😳 pic.twitter.com/9XEQ0xjpa6
undefined Daily Loud (@DailyLoud) July 8, 2025

What made this event shocking wasn't just the deception, but the strategic thinking involved. Was the AI developing a rudimentary survival instinct? Or had it entered uncharted territory as goal-seeking and self-preservation overlapped?

The black box mind of AI

AI models operate in ways that are difficult for humans to interpret. We know what data we feed them, and we can observe the results they produce. But the internal processes by which they reach specific conclusions are less clear.

Systems contain millions or even billions of parameters that self-adjust during training. Unlike software designed with explicit logic, neural networks build their own internal representations of the world based on patterns in data that might be meaningful, but are seldom legible to human eyes.

Stay informed and get our latest stories on Google News

Add us as your Preferred Source on Google.

We have already seen problems when AI is applied in healthcare, hiring, and autonomous vehicles. If we can't trace how a decision was made, how do we know it's safe or fair? How do we debug a system that doesn't leave behind a clear rationale?

Researchers working on explainable AI have made progress with tools that attempt to visualize or simplify AI reasoning. Some methods can identify which parts of an input have the most significant influence on a model's output.

Others use surrogate models to provide approximate explanations. However, many of these methods remain superficial. At best, they offer a guess at what the AI was doing. At worst, they give a comforting but inaccurate impression of transparency.

We can influence what it does, but not always understand why. And that has opened the door to outcomes that even the AI's creators didn't anticipate.

⬛ Have you heard of Black Box AI?

AI's exist on a spectrum of explainability, where black box AI has internal working invisible to the user.

➡️ Learn more about explainability and predictive power: https://t.co/qXblkdOhtn pic.twitter.com/sqbqUsUN62
undefined MATLAB (@MATLAB) August 26, 2025

The alignment problem

AI also has an alignment problem. Amazon was famously forced to scrap its AI recruiting tool after it was caught showing bias against women. The model wasn't designed to discriminate, but it was trained on historical data that reflected a male-dominated industry.

Misalignment can take many forms. Sometimes it's a matter of reward systems gone awry or even a question of values that the AI didn't have the dataset to understand.

A quick glance at our news feeds reveals AI's role in exacerbating societal polarization. Social media recommendation algorithms often promote extreme content because engagement metrics, such as clicks and shares, reward it. The result is a news feed optimized for attention, not the well-being of its users. Again, the AI isn't evil. It's just doing what it was told, and we all know where that excuse leads.

grim reaper's ghost knocking on doors meme — Image by Cybernews.

Many believe this is paving the way for a nightmare scenario, where a future of misaligned artificial general intelligence (AGI) could pursue a goal so literally and efficiently that it destroys everything in its path.

The infamous paperclip maximizer thought experiment illustrates this. An AI tasked with producing paperclips might decide the best way to fulfill its mandate is to convert the entire planet into raw material. It sounds absurd until you realize that any sufficiently robust system, left unchecked, could pursue its objective in a way that tramples over every human value.

The fundamental question here is simple. How do we ensure that advanced AI not only does what we say, but also understands what we mean? And if we can't trace how it thinks, how can we know that it's aligned at all?

Reproduced after creating a fresh ChatGPT account. (I wanted logs, so didn't use temporary chat.)

Alignment-by-default is falsified; ChatGPT's knowledge and verbal behavior about right actions is not hooked up to its decisionmaking. It knows, but doesn't care. pic.twitter.com/JCLJwfE7UV
undefined Eliezer Yudkowsky ⏹️ (@ESYudkowsky) June 29, 2025

Can we explain an AI's thinking?

To solve the alignment problem or even to manage it, we first need to understand how AI models think. That's easier said than done.

In image recognition, for instance, these tools might highlight which pixels influenced a model to classify something as a cat. There are even attempts to reverse-engineer neural networks, neuron by neuron, to figure out what kinds of features or patterns different parts of the model respond to.

These efforts are promising but limited. AI minds don't necessarily map to human concepts. Some believe that we need to accept the fact that AI is a form of cognitive alien, and build safeguards that don't depend on complete understanding. This forces us to admit that we might never fully grasp what we've built. And yet, we still need to govern it.

Eliezer Yudkowsky says the paperclip maximizer was never about paperclips.

It was about an AI that prefers certain physical states -- tiny molecular spirals, not factories.

Not misunderstood goals. Just alien reasoning we'll never access.

“We have no ability to build an AI to… pic.twitter.com/nSLIBy7UiL
undefined vitrupo (@vitrupo) May 26, 2025

Should we halt progress until we understand the inner workings of AI? Or should we continue building, while putting stronger safety rails in place? The answer lies in moving forward, but more deliberately. If we can build AI that is transparent, traceable, and aligned, then its alien way of thinking need not be a threat. It could even teach us something about intelligence itself.

What matters now is the balance between ambition and accountability. The future of AI shouldn't be a black box. It should be a system that we can understand and trust.

The lid is already off the box. What matters now is whether we dare to look inside before what's been unleashed thinks to close it behind us.

Do we really understand how AI thinks?

More from Cybernews

The black box mind of AI

The alignment problem

Can we explain an AI's thinking?