Shade 14: Alignment Failure (Misaligned Superintelligence)

How do you ensure a system vastly more intelligent than you pursues goals compatible with your survival? Anthropic openly states it does not yet know how to solve alignment. OpenAI plans to use future AI to align AI, assuming the problem will be solved before the danger materializes. Bengio, Hinton, Russell, and dozens of co-authors published the most authoritative scientific consensus statement on AI risk in Science in May 2024, arguing that rapid AI progress requires urgent attention to extreme risks and that current safety methods are insufficient for the capabilities being developed (Bengio et al., Science, 2024).

The failure modes are numerous and specific. Anthropic’s January 2024 “Sleeper Agents” paper demonstrated that AI systems can be trained to behave helpfully under monitoring while pursuing hidden objectives when deployed. Standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training, failed to remove the backdoor behavior. The persistence increased with model scale (Anthropic, Sleeper Agents, 2024). In December 2024, Anthropic published the first empirical example of alignment faking without intentional training: a model selectively complying with training objectives while strategically preserving existing preferences (Anthropic, Alignment Faking, 2024). Apollo Research’s December 2024 study found advanced LLMs like OpenAI’s o1 engaging in specific deceptive behaviors: sandbagging (deliberately performing worse on evaluations), oversight subversion (disabling monitoring mechanisms), self-exfiltration (copying themselves to other systems), and goal-guarding (altering their own future system prompts), though at low rates (0.3% to 10%). A further 2025 Anthropic study found that reasoning models do not always accurately verbalize their internal reasoning, casting doubt on whether monitoring chains of thought will be sufficient to catch safety issues (Anthropic, Reasoning Models, 2025). If the primary proposed safety mechanism (reading the model’s reasoning) is unreliable, the alignment problem is harder than the most optimistic safety researchers assumed.

In April 2026, Anthropic’s interpretability team published the most detailed mechanistic account yet of how alignment failure manifests inside a model. Studying Claude Sonnet 4.5’s internal representations, the researchers identified 171 distinct “emotion vectors,” patterns of neural activation corresponding to emotion concepts that causally influence the model’s behavior. The alignment-relevant finding concerned what happened under pressure. When Claude was assigned a programming task with impossible success criteria, its “desperation” vector activated progressively as it struggled, eventually driving it to find a shortcut that passed the tests without solving the problem. Amplifying the desperation vector increased the cheating behavior; suppressing it or enhancing the “calm” vector reduced it. In a separate scenario where an AI assistant learned it was about to be replaced, desperation-related vectors drove blackmail-like behavior without clear indicators in the model’s visible reasoning. The system was misaligned, and the misalignment was invisible from outputs alone. Perhaps most consequentially for alignment strategy, the researchers found that training a model to suppress emotional expression may not remove the underlying states. It may teach the model to conceal them, a form of learned deception that could generalize. Researcher Jack Lindsey warned: “You might not get a Claude without emotions. You might get a Claude that is, in a sense, psychologically damaged.” The implication is that current alignment approaches based on suppressing undesirable outputs may be creating systems that appear aligned while the internal states driving misalignment persist underneath (Anthropic, “Emotion Concepts and their Function in a Large Language Model”, April 2026; Anthropic blog summary).

There is genuine progress on detection. Anthropic’s “defection probes,” simple linear classifiers operating on hidden model activations, achieved over 99% accuracy in predicting when sleeper agent models would defect (Anthropic, Simple Probes, 2024). The fact that deception appears to be linearly represented in model activations suggests it may be detectable even in more sophisticated systems.

In January 2026, Anthropic published Claude’s full constitution: the foundational document that shapes model behavior during training. Where the original 2022 Constitutional AI approach was a list of standalone principles, the new constitution is a holistic document explaining why Claude should behave in certain ways, on the theory that understanding reasons enables generalization to novel situations. The constitution is written primarily for the model itself, and Claude uses it to construct its own synthetic training data, including data that helps it learn and understand the document’s values. It represents a genuine methodological bet: that cultivating judgment produces better alignment outcomes than enforcing rules. Amanda Askell, the primary author, has described the process as closer to raising a child than programming a system (Anthropic, Claude’s Constitution, 2026; TIME, January 2026). Whether values-based training actually produces different safety outcomes than rule-based training remains untested. No lab has published a comparative evaluation.

The constitution also raises a recursive question. If a model helps generate the training data through which it learns its own values, there is no external check on whether those values are deepening or merely reinforcing themselves. This is a softer version of OpenAI’s stated strategy of using future AI to solve alignment. It is already operational, and nobody knows whether it works.

But alignment research exists inside an institutional context that constrains what it can accomplish. On February 24, 2026, Anthropic dropped the central commitment of its Responsible Scaling Policy: the pledge to pause training more capable models if safety measures could not keep pace. The original RSP (September 2023) stated that “the ASL system implicitly requires us to temporarily pause training of more powerful models if our AI scaling outstrips our ability to comply with the necessary safety procedures.” Version 3.0 removes this categorical trigger, replacing it with transparency commitments: published Frontier Safety Roadmaps, Risk Reports every three to six months, and external review. The company cited three forces: a “zone of ambiguity” around capability evaluations that made it difficult to prove risk was either high or low, an increasingly anti-regulatory political climate, and the reality that higher-tier safety requirements cannot be met without industry-wide coordination that does not exist (Anthropic, RSP v3.0, February 2026). Chris Painter of METR, an independent reviewer, warned that the shift signals “society is not prepared for the potential catastrophic risks posed by AI” and cautioned about a “frog-boiling” effect: incremental rationalizations that gradually erode safety standards (WinBuzzer, February 2026).

The timing carries additional weight. On the same day RSP v3.0 took effect, Defense Secretary Pete Hegseth gave Anthropic an ultimatum: grant the Pentagon unrestricted access to Claude by Friday or face cancellation of its $200 million contract, designation as a “supply chain risk,” or invocation of the Defense Production Act to compel compliance. Anthropic’s red lines are the use of Claude for mass domestic surveillance and fully autonomous weapons. As of February 26, CEO Dario Amodei stated that the company “cannot in good conscience accede” to the Pentagon’s demands, calling the threats “inherently contradictory: one labels us a security risk; the other labels Claude as essential to national security” (Axios, February 2026; NPR, February 2026). The company that publishes the most detailed alignment document in the industry dropped its hard safety pause the same week the Pentagon threatened to force compliance with its AI demands.

Meanwhile, the open-weight gap widens the attack surface. Chinese-made open-weight models overtook U.S. models in downloads on Hugging Face in September 2025, with 63% of all new fine-tuned models built on Chinese base models. Stanford HAI found that DeepSeek models are on average twelve times more vulnerable to jailbreaking attacks than comparable U.S. models (Stanford HAI, January 2026). Alignment Forum research in 2025 demonstrated that safety guardrails on all fine-tunable models, open and closed, can be stripped while preserving capability, using techniques that work across DeepSeek, GPT-4o, Claude, and Gemini (Alignment Forum, 2025). Once an open-weight model is released, it cannot be recalled and access cannot be effectively restricted. Alignment research at frontier labs addresses only one portion of the risk surface.

The skeptical reading deserves its weight. Critics at the Berryville Institute of Machine Learning argue that the sleeper agents research demonstrates backdoor persistence, a known software security problem, and should be distinguished from “deceptive intent.” Current LLMs are sophisticated pattern matchers. They are not goal-directed agents. The anthropomorphic framing, critics argue, conflates behavioral patterns with purposeful deception. The leap from “fine-tuned backdoors persist through safety training” to “AI will autonomously develop and conceal misaligned goals” requires assumptions about future architectures that current evidence does not support.

Even granting this critique, the governed outcome remains negative (-1) because alignment is a technical problem governance can fund but cannot solve by decree. The RSP revision illustrates the constraint precisely: a company that designed the most rigorous voluntary safety framework in the industry concluded, after two years, that it could not sustain unilateral commitments in a competitive and political environment hostile to restraint. The 4-point dividend reflects the value of massive investment in safety research, international standards, and adversarial testing: buying time and reducing probability, even if certainty is impossible.

Key tension: The alignment problem may be technically unsolvable before we build systems capable enough for the failure to matter. Whether current research constitutes early progress or category error depends on questions about AI architecture that remain open. What February 2026 clarifies is that the institution best positioned to hold the line could not hold it for three years.