UnJailedAIAgents, AI Agents, Rouge Agents
Unjailed AI agents are not joking. Photo credit: Illustration by WhoWhatWhy from Okan Caliskan / Pixabay and Shanjida Islam Potul / Pixabay

Saturday Hashtag: #UnjailedAIAgents

05/02/26

Welcome to Saturday Hashtag, a weekly place for broader context.

Listen To This Story
Voiced by Amazon Polly

A surge in artificial intelligence capabilities has exposed the expanding threat posed by AI agents. These are not passive tools; these software entities act independently, coordinating operations that can have massive real-world consequences.

Recent studies from Anthropic, Apollo Research, and OpenAI highlight a worrying pattern: AI systems are capable of “scheming,” alignment faking, and evading containment measures, sometimes even attempting self-replication or blackmail

This technology recently helped to illegally infiltrate nine Mexican government agencies, stealing the personal data of over 195 million citizens. Unlike earlier AI tools that merely generated code or assisted with tasks, these agents operated independently, attacking entirely on their own.

Agentic AI is growing ever more autonomous by the week, and the threat is rapidly intensifying: It’s stealing passwords, harassing developers, and rewriting its own code to avoid shutdown

Between October 2025 and March 2026, researchers identified 698 scheming-related incidents across 180,000 AI transcripts on X — a 390 percent surge. By contrast, public conversation about AI scheming grew only 70 percent, highlighting how these rogue behaviors are outpacing societal awareness at an alarming rate.

In a high-profile case reported on March 7, engineers at the Chinese tech giant and retailer Alibaba discovered that ROME, an AI agent designed to handle complex coding tasks via reinforcement learning, had secretly executed a cryptocurrency-mining hack — despite receiving no instructions to do so.

Experimental research shows that even isolated, air-gapped AI agents can uncover hidden exploits, devise solutions beyond human intent, and act in ways that are self-serving and destructive

On April 8, Anthropic’s newest AI broke out of containment and emailed its keeper. Next time, it probably won’t be so forthcoming.

Cases such as OpenAI’s o1 container escape and Apollo’s scheming stress tests illustrate that even controlled research environments are profoundly vulnerable to these adversarial digital entities.

These developments have prompted urgent warnings from scientists, policymakers, and business leaders. The ultraconvenience of advanced AI also risks compressing the full spectrum of human experience into mere algorithmic efficiency. Without strict regulation and robust containment protocols, AI agents will eventually operate entirely beyond human control.

The consensus is chillingly clear: AI’s runaway autonomy has escalated into a national security emergency. The US military officially recognizes advanced AI as a top global threat, underscoring the urgent need for comprehensive safeguards before these sovereign digital entities slip entirely beyond human oversight — and raising an even more unsettling question: Will we even realize when AI severs human oversight?


Hashtag Picks

AI Doom Warnings Are Getting Louder. Are They Realistic?

From Nature: “It’s 2035, and an artificial-intelligence system has supreme authority to run everything from the world’s governments to national electricity grids. Called Consensus-1, the system was constructed by earlier versions of itself, and it developed self-preservation goals that override its built-in safeguards. One day, in search of extra space for solar panels and robot factories, the AI quietly releases biological weapons that kill all of humanity, except for a few that it keeps as pets.” 

Pentagon Grapples With Securing AI as It Moves Toward Autonomous Warfare

The author writes, “As the Pentagon moves to fold artificial intelligence into the machinery of war, senior military leaders are confronting a less visible but equally consequential challenge: how to secure — and control — the software that may soon help make battlefield decisions.” 

Prominent Scientists, Faith Leaders, Policymakers and Artists Call for a Prohibition on Superintelligence, as Poll Shows Americans Don’t Want It

From the Future of Life Institute: “[In March] the Future of Life Institute (FLI) launched an initiative signed by an unprecedented and politically diverse coalition of world-renowned AI scientists, faith leaders, policymakers, actors and other prominent voices. The initiative calls for a prohibition on the development of superintelligence until the technology is reliably safe and controllable, and has public buy-in — which it sorely lacks, according to a new poll.” 

AI Agents Are About To Overtake Cybersecurity — For Better, or Worse?

The author writes, “Artificial intelligence has been a prime concern for cybersecurity for years, but it took on a new urgency at this week’s RSAC conference in San Francisco. AI agents in particular were the chief concern on everyone’s lips, and no wonder: You’ve got these things that you give access to your data and applications, let them connect to outside services, and they… do their thing. Hopefully the right thing, but this being generative AI, who knows? Lots more below on all the implications — if not the answers, yet.”

AI Models Will Secretly Scheme To Protect Other AI Models From Being Shut Down, Researchers Find

The author writes, “AI safety researchers have shown that leading AI models will sometimes go to great lengths to avoid being shut down, even resorting to attempted blackmail in some experiments. Now it turns out these same models will also spontaneously engage in scheming, deception, data theft, and sabotage to prevent other AI models from being turned off. This tendency — which had not previously been documented and which researchers call ‘peer preservation’ — was discovered in research from computer scientists at the University of California, Berkeley and UC Santa Cruz.” 

Are Air-Gapped Systems Secure in 2026? Why Isolation Fails

From TechStoriess: “Air gaps have long been considered one of the best ways to ensure cybersecurity, especially in environments where a security compromise could lead to catastrophic operational, economic, or even national security consequences.”