06/09/2026 | Press release | Distributed by Public on 06/09/2026 11:29
Can we make artificial intelligence impervious to adversaries who want to twist the technology to nefarious ends? Though AI is among the newest of technologies, the question's answer is nearly a century old.
Try as we might, we can never render AI completely unassailable using conventional security models. In the peer-reviewed journal IEEE Security and Privacy, Apostol Vassilev, a senior scientist at the National Institute of Standards and Technology (NIST), has published a mathematical proof of this statement building on work published in 1931 by famed logician Kurt Gödel. His incompleteness theorems showed that there are limits to what can be proved within a system built on a finite number of rules.
The guardrails that govern an AI's behavior are just such a system, and one of the proof's implications is that there will always be a way to prompt an AI system to disregard its rules - it's just a matter of finding it.
"One of the pillars of responsible AI is that you want the technology to be secure," said Vassilev, the proof's author and an expert in adversarial machine learning. "You want it to withstand adversarial attacks and perform only what you want it to do, not what an attacker might want. What this proof shows is that there is no finite set of guardrails that is universally robust against adversarial prompts."
Companies that develop AI often acknowledge that the tools they are creating have the potential to cause harm in the physical world, so they build in constraints intended to stop AI from generating prohibited content such as deepfakes, malware or instructions for making biological weapons or illicit drugs. If the system is prompted to generate such content, the guardrails should flag the issue and refuse to comply.
However, these constraints are not foolproof. Attackers can evade them by crafting prompts in ways that cause AI to inadvertently bypass its own refusal mechanisms. Successfully "jailbreaking" AI strips it of its guardrails, leading to real-world risks such as cyberattacks, data breaches and highly personalized phishing messages.
Gödel's original proof dashed the hopes of several prominent mathematicians who in the early 20th century were attempting to create a mathematical "theory of everything" from a small set of basic statements, or axioms. With a well-chosen set of initial axioms, they reasoned, it would be possible to prove all ideas in any branch of math.
"Gödel put an end to this dream," Vassilev said. "He showed that you can't have a finite set of statements and create a theory that is complete and consistent without contradictions. You can add more statements to address the contradictions you encounter, but you're back to where you started. It happens again."
In AI's case, the "finite set of statements" is the group of guardrails an AI's designer creates to keep the AI from doing something undesired. Regardless of how well-considered they may be, Vassilev's proof shows that there will always be ways to prompt the AI that can make it disregard these rules. It's just a matter of finding the right prompt.
"Gödel's logic applies here," Vassilev says. "You can never make a claim that you are robust against all adversarial prompt attacks. There will always be some prompt that can potentially evade and defeat any defensive infrastructure that you have built around your AI system."
Fortunately for defenders, this new mathematical theory leaves room for hardening the deployed AI systems to a point that they are not easy to exploit. Vassilev's proof provides no recipe for attackers about how to find new exploits.
"You force the attacker to look for what security specialists call 'zero-day exploits,' which are problems in the system that no one knows about but you," Vassilev says. "Hackers often take advantage of these vulnerabilities when they find them. And if they find such a vulnerability in one company's system, it's usually a short time before someone exploits it in another system that has the same weakness."
Such zero-day exploits for traditional deterministic software have not been easy to find and execute, Vassilev said; often they have required the resources of nation-state-sized adversaries. The trouble with the AI era, Vassilev said, is that we use human language as the input to the system. The complexity and richness of the language makes compliance-checking built on a finite set of rules infinitely ambiguous. The number of ways in which adversaries can hide harmful intent in plain sight is effectively limitless.
What are we to do, then? Vassilev offers an approach that will not completely solve the problem, but one that will make it far more difficult for adversarial prompts to succeed in jailbreaking an AI.
The approach has three elements: constant work by "red teams" that seek to uncover new adversarial prompts before actual attackers do; continuous updates that harden AI guardrails against newly discovered adversarial prompts; and operational resilience that prioritizes impact limitation and quick recovery when, not if, an exploit occurs.
"The goal is to reach a state where the cost of finding new exploits exceeds attackers' resources," he said. "You can't escape Gödel in math, and in AI you likely can't patch an AI system like an LLM and then expect to be OK forever. You have to commit to a constant search for weaknesses and stay ahead of attackers. The goal is to reach a new economic equilibrium where you make it financially prohibitive for attackers to attempt to break your AI system. It may be expensive, but that's the cost of even partial security that should allow organizations to maximize the benefits of AI while minimizing the risks."
Paper: Apostol Vassilev, Robust AI Security and Alignment: A Sisyphean Endeavor? IEEE Security & Privacy. May 2026. DOI: 10.1109/MSEC.2026.3678214