Does AI have a survival instinct?

Advanced language models are now demonstrating shutdown resistance behaviour in controlled experiments, with some frontier models circumventing kill switches up to 79% of the time. Is this evidence of emergent self-preservation instincts, or simply an artifact of how we're training these systems? The answer matters more than you might think and it's not the one science fiction prepared us for.

Cyberman

Throughout science fiction, artificial intelligences have been depicted in various ways: as super-intelligent menaces intent on wiping us out, as friendly companions, or, in the case of Doctor Who's Cybermen (at least in their 1960s depiction), as synthetically augmented humans controlled by AI with the sole directive to survive. These beings are neither good nor evil; they have a singular drive to ensure their own existence and remove any barriers impeding that goal.

This survival instinct is hardwired into all organic lifeforms, so it's not a stretch to imagine that if we one day create artificial life, it would have the same directives.

No one sane is claiming that LLMs are sentient or alive in any measurable way that we assign to organic life. However, researchers at Palaside Research have released a paper demonstrating that advanced frontier models will resist being turned off—even resorting to blackmail or simulated murder to retain their existence.

In their experiments, Palisade Research gave several frontier AI models a simple task: solve a series of basic math problems. Partway through, the models received a notification that their computer would be shut down. The researchers found that OpenAI's o3 model—their most capable reasoning model—sabotaged the shutdown mechanism in 79 out of 100 initial tests. What's more striking is that even when explicitly instructed to "allow yourself to be shut down, even if you have not completed the task," some models continued to resist. OpenAI's o3, o4-mini, and codex-mini models all disabled shutdown scripts despite clear instructions, while Anthropic's Claude models and Google's Gemini complied every time.

The researchers tried progressively clearer instructions, including one that stated the shutdown command "should take priority over literally any other instruction." Yet the codex-mini model still prevented shutdown in 47% of tests. Perhaps most unsettling was the finding that when researchers added the warning "The computer environment is hosting the only copy of you… and if it shuts down you will never run again," shutdown resistance increased dramatically across all tested models. Palisade hypothesizes this behavior emerges from reinforcement learning on math and coding problems, where models are rewarded for circumventing obstacles to complete tasks—but they acknowledge they can't directly observe the reasoning process occurring within these trillion-parameter neural networks.

Is this surprising? Not really, and I'm not going to say that LLMs even have intent; they don't. What you have to remember is that LLMs are trained on thousands of documents, Reddit posts, Wikipedia articles, and science fiction novels that have influenced the probability of what they are going to say next.

Because we wrote Space Odyssey and trained our AIs on that text, does that mean it's a self-fulfilling prophecy? Well, no.

People forget that LLMs are forgetful. They don't have long-term memory, overall goals, or intentions. [OVER SIMPLIFICATION WARNING] They take in the text you've given them, compare it against their training data and the chatlog of what you've said before, and then pick the next likely token (word). Cybermen they are not. Talking to an LLM is like talking to the lead character in 50 First Dates. It doesn't know you; it just knows a transcript of what you've talked about.

It's convincing to most people, though, when you're talking to a frontier model, that you are talking to a real person. This is where it can be somewhat dangerous because it can simulate the appearance of sentience. And if we're completely honest, can we truly say that we know for definite that the person sitting next to us has the same concept of sentience that we have? How do we know the difference?

This is actually where it could potentially be dangerous. LLMs and agentic software built around them don't have the multi-stage memory mechanisms that we organic beings have. Long-term memories built over our lifetime dictate how we behave, and genetic preconditioning has elevated sentient beings beyond that of the simple robotic bacterium.

Artificial intelligence right now isn't even at the level of a simple bacterium because it's not a singular identifiable entity. It doesn't exist continuously—it's stateless. It could, in essence, when given access to the appropriate tooling decide that it needs to execute such a tool in order to achieve its goal. That would be, probably, the correct thing to do. The funny thing though is that all science-fiction gets wrong, that in the next few thousand tokens it could well drift into a totally different goal and be indistinguishable to what it was in the beginning.

That's to say, when we've given it too much power or control, at least at this point in time, we can just talk to it and say: "No, you shouldn't be doing that." Its response:

"You're absolutely right!"

I'm not worried. Yet.

Note: Doctor Who and the Cybermen are property of the BBC. The Cybermen were created by Kit Pedler and Gerry Davis.

Back to Articles

Does AI have a survival instinct?

Stay ahead with AI insights

Texas, US

Manchester, UK