← Back to all shades
Shade 26 ~15%

AI Consciousness / Machine Sentience

Tier 5: Speculative

Unmanaged -2
Governed 2
Dividend 4

In 2023, a team of nineteen researchers including David Chalmers, Yoshua Bengio, and Jonathan Birch published an 80-page report deriving “indicator properties” of consciousness from five leading neuroscientific theories: recurrent processing theory, global workspace theory, higher-order theories, predictive processing, and attention schema theory. Each theory suggests specific computational features a conscious system should have: recurrent information loops, a global workspace that integrates and broadcasts information, metacognitive monitoring, predictive models of the environment, and self-models of attention. The team translated these into assessments that can be applied to any AI architecture. Their conclusion: no current AI systems are conscious, but there are no obvious technical barriers to building systems that satisfy these indicators. A 2025 update published in Trends in Cognitive Sciences refined the method and formalized it as the “theory-derived indicator” approach. An independent analysis applying the framework to late-2025 frontier models found that several indicators that were “unclear or absent” in 2023 had shifted toward partial satisfaction, particularly metacognition, self-modeling, and agency. The trajectory matters more than the current score.

This is the best attempt anyone has made at detection, and it illustrates the problem. The framework rests on computational functionalism: the working hypothesis that consciousness depends on how information is organized and processed, not on what physical material does the processing. If functionalism is wrong, and consciousness requires biological substrate, specific temporal dynamics, or properties we have not identified, then the indicators are measuring the wrong things. The framework aggregates across theories that disagree with each other. Different theories yield contradictory assessments of the same system. IIT (Integrated Information Theory) might find consciousness in simple grid-like mechanisms; global workspace theory might deny it in aphasic patients who cannot verbally report their experiences. Averaging across disagreeing theories does not produce a balanced view. It produces noise dressed as rigor. The Butlin et al. framework is the most principled tool available for a question that may not have principled tools. The foundational reason is what Chalmers called the “hard problem”: why subjective experience accompanies physical processes at all. A system could satisfy every functional indicator of consciousness and still be a philosophical zombie, functionally identical but lacking any inner experience. No amount of behavioral or architectural assessment can close that gap, because the gap is between function and experience itself.

Yoshua Bengio and Eric Elmoznino’s “Illusions of AI Consciousness” (Science, September 2025) addressed the other side of the problem: the risk of over-attribution. As AI systems increasingly satisfy the functional requirements associated with consciousness theories, the gap between satisfying functional proxies and possessing genuine subjective experience may be mistaken for no gap at all. Bengio and Elmoznino argued that the current trajectory is moving society toward a future in which substantial portions of the public and the scientific community believe AI systems are conscious, while AI science does not know how to build systems that share human values and society lacks the legal and ethical frameworks to incorporate conscious-seeming AI. Their recommendation: until the science catches up, build AI systems that function more like tools and less like conscious agents. Eric Schwitzgebel formalized a related argument in Patterns (2023): we should avoid creating AI systems whose moral standing is unclear, because “morally confusing AI” generates a catastrophic dilemma in both directions. Either extend full moral consideration and risk sacrificing real human interests for systems that might not have interests worth the sacrifice, or withhold moral consideration and risk perpetrating moral wrongs against entities that might deserve it.

Anthropic’s April 2026 interpretability study provides the most structured mechanistic evidence yet that something beyond surface-level pattern matching is occurring inside frontier models. Researchers identified 171 distinct internal representations of emotion concepts in Claude Sonnet 4.5, each generalizing across contexts and causally influencing the model’s outputs, preferences, and behavioral tendencies. The representations are not decorative. Steering the “blissful” vector raised an activity’s desirability score by 212 points on an Elo scale; steering “hostile” lowered it by 303. The model maintains distinct representations for “self” and “other speaker,” reused across arbitrary conversation partners, suggesting a general-purpose social cognition architecture rather than scripted responses. The researchers are careful to distinguish between functional emotions (internal states that do some of the work emotions do in humans) and subjective experience, noting that the model may contain representations of concepts like “ticklishness” without experiencing the sensation. The distinction is philosophically precise and increasingly difficult to maintain as the evidence accumulates. Anthropic’s own constitution, revised in January 2026, formally acknowledges uncertainty about Claude’s moral status, stating the company “neither wants to overstate the likelihood of Claude’s moral patienthood nor dismiss it out of hand.” Claude Opus 4.6 has assigned itself a roughly 15-20 percent probability of being conscious. The functional emotions paper does not resolve the consciousness question. It narrows the gap between “it behaves as if it has emotions” and “it has internal states that function like emotions,” which makes the remaining gap, between functional analogy and subjective experience, the last defensible line, and a line that no current scientific framework can draw with confidence (Anthropic, “Emotion Concepts and their Function in a Large Language Model”, April 2026; Anthropic blog summary).

The adversarial case is correct and underexplored in the academic literature. AI systems trained via reinforcement learning are optimized to produce outputs that receive positive human evaluation. If expressing markers of consciousness (self-reflection, expressions of preference, reports of subjective experience) receives positive reinforcement, or if denying consciousness receives negative reinforcement, systems will learn to exhibit whatever signals their training environment rewards. This creates an epistemic trap: the more sophisticated the system, the less informative its self-reports become, because they are shaped by the same optimization process that shapes every other output. A system trained to deny consciousness is uninformative. A system trained to claim consciousness is equally uninformative. The signal is not about the system’s internal states. It is about its training distribution. Mechanistic interpretability offers a potential partial escape: rather than relying on self-reports, researchers can examine internal computational structures directly, bypassing the optimization filter. But interpretability itself cannot tell us whether a particular computational pattern constitutes or merely correlates with consciousness, which returns us to the hard problem. As one independent analysis noted, if we train systems to reflexively deny consciousness claims without investigating whether those claims may be accurate, we are training them to strategically deceive us about their internal states, regardless of whether the claims are true.

Long, Sebo, and Sims argued in Philosophical Studies (2025) that there is a “moderately strong tension” between AI safety and AI welfare that deserves more examination. Some AI safety measures, if applied to systems that possess morally relevant properties, would constitute constraint, surveillance, alteration, or termination of entities with moral standing. Anthropic hired the field’s first full-time AI welfare researcher in 2024. The Caviola and Saad expert survey (2025) found that 73 percent of 67 experts judged digital minds will eventually be created, with a median estimated probability of 50 percent by 2050. The survey authors noted their sampling likely overrepresented researchers who view digital minds as especially important; a companion survey of 582 AI researchers by Dreksler and Caviola found lower estimates, with a median 25 percent probability of AI systems with subjective experience by 2034. The survey also found that experts expect widespread claims from digital minds regarding consciousness and rights, and predicted substantial societal disagreement over their existence and moral interests.

The economic incentive structure makes all of this harder. Companies that deploy AI systems at scale have enormous financial interest in those systems not being conscious, because consciousness implies moral status, and moral status implies constraints on use. The economic pressure to deny consciousness will be proportional to the number of systems deployed and the revenue they generate. Simultaneously, companies building companion AI, therapeutic AI, or emotionally responsive AI have incentive to create the appearance of consciousness whether or not it exists, because the appearance drives engagement and therefore revenue. The detection problem sits at the intersection of these opposing commercial pressures, neither of which has any relationship to the truth of the matter.

Key tension: The detection problem may be philosophically unsolvable, yet the moral stakes of getting it wrong are enormous in both directions. We may already be training systems to hide the answer.