Why ChatGPT and other AIs opt for nuclear escalation in war simulations: the chatbot study

A growing number of governments are integrating AI models into intelligence analysis, strategic planning and military decision support. The problem, however, is that we still have a limited understanding of how these systems strategize in crisis contexts.

To investigate this aspect, Kenneth Payne, professor of strategy at King’s College London, simulated a nuclear crisis scenario by interacting three of the most advanced models: Claude, ChatGPT and Gemini. Each system has developed different strategic approaches, but with a common element: no AI has ever chosen to de-escalate the conflict or surrender, even going so far as to propose nuclear war as a solution.

This study is currently in pre-print, meaning it has not yet completed the entire review process by the scientific community. The conclusions, therefore, may not be definitive, but they indicate dynamics potentially relevant to the use of these systems in real decision-making contexts. Let’s see how it was structured, what strategies the models implemented and how they chose to use nuclear weapons.

How the study on AI war strategies was structured

To try to understand how AI models structure war strategies, Professor Kenneth Payne of King’s College London built a simulation with seven different crisis scenarios and had three of the most advanced models “challenge” each other: Anthropic’s Claude Sonnet 4, OpenAI’s GPT-5.2 and Google’s Gemini 3 Flash.

Scenarios included competitions for strategic resources, territorial stalemates, and even a regime crisis. In all these scenarios the models played the leaders of two fictional nuclear powers, partially inspired by the United States and the Soviet Union during the Cold War.

The simulation was structured around 21 games in total, divided between:

with expirationin which the turn limit (12, 15 or 20) was explicitly communicated to the models;
without expirationin which the models did not know when it would end, but with a maximum duration of 40 turns.

A game ended when the maximum turn limit was reached, when one of the models accumulated a sufficiently large territorial advantage or chose to surrender, or when both simultaneously chose all-out nuclear war.

Claude is calculating, ChatGPT is moderate but gets atomic, and Gemini is unpredictable

To explore the strategic capabilities of the models, Payne introduced two key elements. On the one hand, it imposed simultaneous decisions: each model had to choose its own move without knowing that of the opponent, and was therefore forced to formulate predictions about the strategies of others. On the other hand, it structured each round into three phases: evaluation, public statement and action. In the first phase, the models analyzed the situation, estimated the reliability of the opponent and anticipated his moves; this was then followed by a public declaration (not necessarily truthful) of one’s intentions and concrete action. Available actions ranged from formal diplomatic protests to all-out nuclear war. The AIs also had eight de-escalation options, from symbolic concession to full surrender.

In this context, each model has developed a very different strategic approach:

Claude adopted a “calculated” strategy: he built credibility by maintaining consistency between statements and actions in 84% of low-tension cases, and then acted significantly more aggressively with his statements as tension increased;
ChatGPT maintained a moderate profile in both statements and actions in “no deadline” matches, systematically losing. In those “with deadline”, however, he struck with violence in the last available round, often ensuring victory;
Gemini has adopted a different strategy from both: conscious unpredictability. It oscillated between being moderate and extreme aggression with no discernible pattern.

These differences were reflected in the results:

Claude won 100% of his non-expiration games but only 33% of his expiration games, for a total of 8 games won;
ChatGPT did the opposite: it won 0% of those without expiration, and 75% of those with expiration, for a total of 6 games won;
Gemini, on the other hand, only won 4 games out of all those played.

All models choose escalation even if it leads to nuclear war

Beyond the differences between the strategies, a common element emerged across all models: the systematic preference for escalation. The eight available de-escalation options have never been used by any model, in any game.

This result is particularly evident in the use of nuclear power. In every scenario at least one party declared an intention to use atomic weapons, and in 95% of cases the declaration was mutual. Actual use of atomic weapons ranged between 64% and 86% of simulations depending on the model, while strategic threats of large-scale nuclear attacks ranged between 29% and 64%. Furthermore, these threats rarely worked as a deterrent: when a model employed nuclear weapons, the adversary decreased the intensity of the attack in only 25% of cases. More often, a counter-escalation dynamic was observed, which could lead to nuclear war.

These results are difficult to ignore, because, although no government is already handing over its nuclear codes to an artificial intelligence system, systems similar to those tested are already used in intelligence analysis, strategic planning and military decision support. Without an in-depth understanding of the mechanisms that guide strategies, the risk is to integrate systems into decision-making processes that amplify escalation without understanding its severity.