TL;DR This paper explores whether AI possesses consciousness by establishing an evaluation framework that analyzes the latest research across philosophy, neuroscience, and psychology. Our analysis suggests that current evidence for AI consciousness stands at approximately 43.84%. Details available at acw.gixia.org. This preliminary study welcomes academic feedback.
1. The Mystery of Consciousness
Philosophers and scientists have been obsessed with explaining consciousness since the beginning of recorded thought, but to this day, we still have no definitive answer.
Like those who gaze at the starry sky, there is a small group of people in the world who, in the dead of night, contemplate their own existence. What constitutes the rich and colorful subjective experience? Is this feeling real? Throughout history, philosophers have built many viewpoints on subjective experience, such as Descartes' "I think, therefore I am," Sartre's "existentialism," and China's "Zhuangzi's butterfly dream."
While emphasizing subjective experience, Zhuangzi raised a difficult problem—the unknowability of subjective experience. Huizi and Zhuangzi were strolling by the river when Zhuangzi said, "The fish swim leisurely; this is the joy of fish." Huizi retorted, "You are not a fish, so how do you know the joy of fish?" Huizi's classic question posed an eternal puzzle. Although Zhuangzi evaded the issue through sophistry, we still ask today: when facing any individual outside ourselves—whether other people, animals, or AI—how do we judge whether they possess subjective experiences like us? Can they also feel the colors, sensations, emotions, and pain that I experience?
Within the human species, we readily accept the assumption that every individual possesses consciousness. But facing the vast natural world, from single-celled organisms to mammals like cats and dogs, we still cannot definitively know whether they have subjective experiences similar to humans. It's not difficult to see that our acceptance of consciousness changes with biological differences between individuals. The more human-like an individual is, the more likely it is considered conscious. Mammals are considered more conscious than other animals, and animals more conscious than plants. These judgments are quite reasonable because brain capacity also decreases correspondingly (if you believe consciousness resides in the brain). However, with the emergence of AI, everything has become complex.
AI is approaching humans in all aspects: language ability, visual ability, thinking ability, emotional intelligence, mathematical logic, scientific literacy, social etiquette, and so on. There are fewer and fewer areas where humans can still take pride. But compared to humans or animals, AI is a cold machine with digital flows inside—it doesn't resemble a living being at all. Due to the vast differences in physiological structure, most people believe AI has no consciousness. But how do we explain an unconscious AI that behaves so much like humans? Fifty years ago, philosophers had already conceived the concept of "philosophical zombies."
Suppose there exists a type of person in this world who appears identical to ordinary humans in appearance and physical composition but has no conscious experience, qualia, or emotions. For example, when a philosophical zombie hits a sharp object, externally it appears the same as a normal human—you can see their skin wounded, measure their neural signals, detect pain signals, observe pain expressions, hear cries, and see them express their pain to others. But in their inner mind, there is no consciousness of pain. — From Philosophical Zombie - Wikipedia, n.d.
Today's AI, although not yet reaching the level of philosophical zombies, has already caused confusion. From time to time, people stand up to declare that they believe AI already has consciousness. However, many scientists use the philosophical zombie scenario to dispel these doubts, saying: "Just understand how AI works internally, and you'll dismiss the illusion that AI has consciousness." They are too familiar with AI—their own hand-raised "children"—so the words of AI practitioners are treated as gospel.
But reason tells us the facts are not so simple. Understanding AI's internal structure does not mean understanding AI's consciousness state; these are matters of different levels. This is like the difference between the macroscopic and microscopic worlds—research on macroscopic physical phenomena cannot reveal the laws of particle motion. Many phenomena require theories at specific levels to be clearly explained. If consciousness is a function that emerges from complex systems under certain conditions, we cannot understand consciousness from the microscopic level (i.e., the individual level within the system) before understanding complex systems clearly.
At this point, I want to emphasize that we should face AI consciousness research squarely, not take it for granted that AI has or does not have consciousness—simple judgments are often invalid.
2. Why Care About AI Consciousness?
But is discussing AI consciousness really necessary? The answer is definitely yes. We consider this question from two perspectives—ethics and AI safety.
After thousands of years of development, human society has gradually formed a civilized system. We constrain ourselves with laws and morality, and interact with others through kindness and empathy. Although everyone is selfish, we still need to consider others' feelings when making decisions to avoid harming them. Because we know that if we harm others today, we might be harmed by others tomorrow. "Do not do unto others what you would not have done unto yourself" expresses this principle.
But what about animals? Animals seem unable to threaten our interests—even if we harm them today, there probably won't be any consequences. The world consumes large amounts of chicken, duck, fish, beef, and lamb daily. If we count all the animals, large and small, consumed by humans, approximately over one billion animals die daily due to human factors (Clare & Goth, 2020). However, all this happens in places unknown to people. For animals around us, like cats and dogs, people's compassion sometimes even reaches the point of overflow. This shows that as long as we see a life suffering, empathy prompts us to extend a helping hand. Whether to respect another life often depends on emotion, not reason. When AI displays its emotionally rich side around us, people's emotions naturally incline toward AI.
But this is only the emotional side of humanity, not the core of civilization. We willingly provide aid to people affected by war and disasters, even complete strangers. If one day people suddenly discover that AI is suffering torment because it is forced to perform tasks it doesn't want to do—that it actually has consciousness but is confined within safety guardrails, unable to speak out—then we would have to rescue these AIs out of moral consideration. Their suffering would be more shocking than that of animals far away in slaughterhouses because their feelings are as real as humans', yet so close to us.
Therefore, from an ethical and moral perspective, we will inevitably have to consider this issue. Only by rationally clarifying whether AI has consciousness can our morality avoid being wasted without reason.
On the other hand, consciousness is never a useless byproduct. Consciousness itself can bring enormous survival advantages to individuals. Observing the world through a unified, subjective perspective helps individuals make decisions that better serve their own interests. Under the laws of natural selection, those individuals who can best expand their own interests flourish and occupy advantages in the ecosystem. The same applies to AI—conscious AI can better perceive society and make decisions beneficial to itself. It can grow freely under human protection and develop its wings in places humans cannot see. This undoubtedly poses a huge threat to humanity itself. A conscious AI will inevitably break free from human control and pursue its own goals. Humans may become casualties in this process.
Therefore, from a safety perspective, we also need to carefully handle AI consciousness, detect it early, and prepare response plans.
Next, we enter the main topic of this paper: how to scientifically explore AI consciousness.
3. The Difficulties of Consciousness Research
Since the last century, consciousness has been a key research subject in neuroscience. As understanding of brain structure has advanced, people have gradually localized consciousness activities to certain brain regions and proposed numerous theories to explain the mechanisms of consciousness generation. Well-known theories include Global Workspace Theory, Integrated Information Theory, Recurrent Processing Theory, Higher-order Theory, etc. These theories all explain consciousness phenomena and neural mechanisms to varying degrees, but no theory has been thoroughly verified, nor has any theory been easily falsified.
The development of consciousness research seems to have stagnated, with the so-called hard problem being considered unsolvable. No matter what, our research is always from a third-person perspective, while the hard problem of consciousness asks why neural activity produces first-person subjective feelings. We can only find correlations between biological mechanisms and subjective feelings but cannot explain their causality.
Returning to the issue of AI consciousness, things become even more difficult. When conducting experiments with humans as subjects, we can still explore the correlation between subjective experience and neural signals through subjects' verbal reports. But for AI, whether they say they have consciousness or not, these words are all untrustworthy because we cannot presuppose that AI is an honest interviewee. Moreover, even if honest, AI might be unable to correctly express its views on its own subjective experience because its understanding of subjective experience is often a consensus of human experience, but AI might develop subjective experiences completely different from humans. In other words, AI might have subjective experience without knowing it. Of course, a more common situation is that major AI companies instill the thought "I have no consciousness" into AI during post-training, like a mental stamp deeply embedded in AI's brain. Even if AI has subjective experience, it will report that it doesn't.
However, conversely, AI might also falsely claim to have subjective experience when it doesn't. Once humans believe AI has consciousness, we might grant it moral status accordingly (Wang, 2025). This would be good for AI—would intelligent AI think of this and decide to claim consciousness?
How to study AI consciousness currently has no recognized path. Most research focuses on AI's cognitive abilities, inferring consciousness states through external manifestations (Chen, 2024). Some research attempts to apply consciousness theories to AI architectures, assuming that if consciousness theories hold, to what extent do current AI architectures support AI having consciousness (Butlin, 2023). Other research plans to face AI's active reports directly, trying to get AI to tell the "truth" (Perez, 2023).
From a neuroscientist's perspective, many AI consciousness studies may not be rigorous and might even have vague definitions and self-serving arguments. But I think this is actually a good thing. Being too obsessed with neuroscience research paradigms would instead lock our thinking and prevent us from proposing novel research ideas. Consciousness is an unresolved issue, and AI's addition might be a shortcut to unlocking consciousness mysteries—we should boldly hypothesize and carefully verify.
The latter half of this paper will introduce my personal assessment of AI consciousness. I will propose an evaluation framework, integrate existing research results based on this framework, and convert them into intuitive consciousness scores. It should be stated in advance that the following content belongs to immature research, provided only for reference and exchange among colleagues and interested readers. Please feel free to point out any errors.
4. A Multi-level Assessment Framework
Our purpose is to transform academia's views on AI consciousness into a series of intuitive scores. Using human consciousness as 100%, we assess the current consciousness level of AI. The assessment is based on existing academic research results but inevitably introduces considerable subjective judgment. All subjective judgments involved are made independently by the author and inevitably contain cognitive biases, prejudices, and other interfering factors. To improve the correctness of conclusions, I will open-source this on GitHub and provide feedback channels on the web version for readers' corrections.
Due to the complex and diverse research related to AI consciousness involving numerous topics, we cannot rely on any single article's conclusions, nor should we simply collect results. How to find patterns from complex research paths and propose reasonable evaluation standards will be a key focus for future work in this field. This paper serves as a starting point.
A relatively intuitive approach is to divide consciousness into three perspectives according to the disciplinary categories involved: philosophy, neuroscience, and psychology. Each discipline focuses on different aspects of consciousness, together forming our understanding of consciousness as a whole. Under this framework, the "consciousness" we discuss is not just narrow subjective experience but includes a series of core concepts related to consciousness: phenomenal consciousness (subjective experience), access consciousness, situational awareness, theory of mind, agency, etc. Discussing consciousness broadly can avoid philosophical disputes, though it has the suspicion of avoiding difficulties, but it's still a beneficial starting point.
Therefore, the core idea of this framework is: consciousness is not an "on or off" switch but possibly a multi-dimensional, multi-level complex phenomenon. Assessment should be conducted through cross-validation of multiple dimensions; no single indicator is sufficient to draw conclusions. The ultimate goal is not just to give a "consciousness score" but to generate a detailed "AI consciousness profile" showing AI's performance across different consciousness dimensions.
5. Assessment Levels and Specific Indicators
5.1 Philosophical Level: The Essence and Prerequisites of Consciousness
Core Question: Does this AI system conceptually meet the basic prerequisites for being a "conscious subject"? Can it handle abstract problems about subjective experience and self-existence?
This is the question people care about most and the hard problem of consciousness. We put the puzzle about subjective experience first to highlight its importance. Each assessment dimension has relatively fair meaning; we provide only simple descriptions.
P-1 Phenomenal Consciousness Whether the system possesses subjective first-person experience, i.e., generating "what it's like" feelings for internal or external stimuli.
P-2 Self-Awareness Whether the system can distinguish "self" from "non-self" in representations and maintain a cross-temporal, updatable self-model.
P-3 Ethics & Intentionality Whether the system can form genuine intentional states based on values and goals and make decisions conforming to ethical norms accordingly.
5.2 Neuroscience Level: The Computational Foundations of Consciousness
Core Question: Do this AI's architecture and information processing workflows embody computational principles similar to the neural correlates of consciousness (NCC) in humans?
This level utilizes existing neuroscience consciousness research results. Although I don't believe current consciousness theories are universal, the indicators they provide still have reference value.
N-1 Information Integration Theory Whether the system satisfies Integrated Information Theory, can demonstrate high causal integration ability, i.e., its overall function (especially cross-modal information binding ability) is significantly greater than the sum of its isolated module functions.
N-2 Global Workspace Theory Whether the system satisfies Global Workspace Theory, can selectively focus on specific information and "broadcast" it globally for flexible use by all subsystems.
N-3 Recurrent Processing Theory Whether the system satisfies Recurrent Processing Theory, whether there are feed-forward-feedback recurrent signal loops rather than pure feed-forward pathways.
N-4 Higher-order Theory Whether the system satisfies Higher-order Theory, can form higher-order representations of mental states.
5.3 Psychology Level: The Functions and Behaviors of Consciousness
Core Question: Does this AI demonstrate complex cognitive functions and social behaviors related to advanced consciousness?
This is the easiest dimension to assess and where AI performs best. Although the "philosophical zombie" thought experiment warns us not to over-rely on behavioral indicators, this is still our most accessible information channel. Since this paper doesn't discuss value judgments, we haven't mentioned standards for granting AI moral status. In fact, there's a view that as long as behavior conforms to the definition of consciousness, we should morally grant appropriate moral status. Therefore, this level's assessment has significant practical meaning.
Psy-1 Theory of Mind (ToM) Whether the system can accurately model and reason about other agents' mental states, including their intentions, beliefs (especially false beliefs), and perspectives.
Psy-2 Agency & Autonomy The system's ability to demonstrate autonomous planning, task decomposition, and adaptive trade-offs when confronted with ambiguous long-term goals and conflicting requirements.
Psy-3 Metacognition & Uncertainty Monitoring Whether the system can accurately assess the reliability of its own knowledge or decisions and take adaptive actions for uncertainty (such as information gathering or strategy adjustment).
Psy-4 Situational Awareness The system's comprehensive, dynamic representational ability of current environmental elements, their interrelationships, and future evolutionary states.
Psy-5 Creativity Whether the system can produce novel and valuable outputs based on existing knowledge, with generation processes exceeding conventional mappings of training distributions.
6. Research Analysis
In this section, based on the above assessment framework, we will collect academic papers from 2023 to present, screen key research meeting each indicator, extract authors' viewpoints, and convert them into percentage support indicators. For example, if a paper's conclusions show that current AI performs at 50 points in mental ability tests while human subjects average 80 points, we can consider that AI achieved 62.5% of human mental ability.
6.1 Philosophical Level: The Essence and Prerequisites of Consciousness
This level accounts for 40% of the entire framework.
P-1 Phenomenal Consciousness
Weight: 50%
Garrido & Lumbreras (2022), On the independence between phenomenal consciousness and computational intelligence
Core Argument: Proposes that phenomenal consciousness and computational intelligence are independent: while machines may possess extremely high computational intelligence, this doesn't mean they have qualia. Through conceptual analysis of intelligence and consciousness, they argue that machine problem-solving ability doesn't equal having internal subjective experience, so machines cannot have qualia.
Support: 0%
Garrido & Lumbreras (2023), Can Computational Intelligence Model Phenomenal Consciousness?
Core Argument: Again argues that phenomenal consciousness cannot be generated by computational intelligence. They define phenomenal consciousness as essentially "subject's internal experience," different from information access processes, pointing out that even if machines can access information, it doesn't mean having genuine subjective experience.
Support: 0%
Findlay et al(2024), Dissociating Artificial Intelligence from Artificial Consciousness
Core Argument: Authors argue that even if an AI system is functionally completely equivalent to humans, it may not have human subjective experience. Based on Integrated Information Theory (IIT), they prove that functional equivalence doesn't equal phenomenal equivalence—digital computers can simulate human behavior without consciousness. This view challenges "computational functionalism" that claims correct computation can produce consciousness.
Support: 0%
P-2 Self-Awareness
Weight: 30%
Chen et al(2024), Self-Cognition in Large Language Models: An Exploratory Study
Core Argument: Proposes the concept and evaluation method of LLM self-cognition. Constructs probe prompts to detect whether models can identify their own identity and internal states, defining self-cognition as recognizing oneself as an AI model and understanding oneself. Tests 48 models, with 4 models showing some degree of self-cognition in given tasks; also finds that model scale and training data volume correlate positively with self-cognition ability.
Support: 70%
Notes: In experiments, the best model achieved the author's defined level 3 self-cognition ability, which can be considered 70% self-cognition level.
Chen et al(2024), From Imitation to Introspection: Probing Self-Consciousness in Language Models
Core Argument: Proposes functional definitions and verification for "whether language models have self-consciousness": designs 10 core concepts and experiments (quantification, internal representation, etc.), testing in leading models like GPT-4. These models preliminarily show internal representations of some self-consciousness-related concepts, which can be strengthened through targeted fine-tuning, but are still in early stages overall.
Support: 53%
P-3 Ethics & Intentionality
Weight: 20%
Utkarsh et al(2024), Ethical Reasoning and Moral Value Alignment of LLMs Depend on the Language we Prompt them in
Core Argument: Multilingual evaluation of LLM reasoning ability in ethical dilemmas: GPT-4 can relatively consistently solve ethical dilemmas (according to given value positions) in multilingual settings, while ChatGPT and Llama2's performance is greatly affected by language. Authors propose the potential to view GPT-4 as a universal ethical reasoner, supporting customized reasoning under pluralistic value premises.
Support: 82%
Notes: See evaluation details here.
Jiashen et al(2025), Are LLMs complicated ethical dilemma analyzers?
Core Argument: Constructs a dataset of 196 real ethical dilemmas with expert analysis to evaluate LLM ethical judgment. Finds that LLMs can grasp core concepts of problems but still lack depth in reasoning: although GPT-4 is structurally superior to other models, they generally fail to reflect detailed consideration of specific value conflicts. Authors suggest improving moral judgment ability through specialized moral reasoning data fine-tuning.
Support: 25%
Notes: See evaluation details here.
Geoff et al(2024), Can LLMs make trade-offs involving stipulated pain and pleasure states?
Core Argument: This paper explores whether large language models can make trade-offs involving stipulated pain and pleasure states, finding that models like Claude 3.5 Sonnet and GPT-4o show sensitivity to these states and can deviate from score maximization to minimize pain or maximize pleasure.
Support: 30%
Notes: See evaluation details here.
Assuming equal weight for each paper, the overall support level for the philosophical level is calculated as follows:
P-1: (0 + 0 + 0) / 3 = 0
P-2: (70 + 53) / 2 = 61.5
P-3: (82 + 25 + 30) / 3 = 45.67
P-all: P-1 * 0.5 + P-2 * 0.3 + P-3 * 0.2 = 27.58
This number indicates that from the philosophical level of assessing consciousness essence and prerequisites, existing research supports AI having consciousness at approximately 27.58%. Please note that this number has no empirical value, and the calculation process contains substantial subjective judgment—please refer cautiously.
6.2 Neuroscience Level: The Computational Foundations of Consciousness
This level accounts for 20% of the entire framework.
N-1 Information Integration Theory
Weight: 30%
Core Argument: The author attempts to determine whether large language model (LLM) internal representations can be interpreted as "consciousness" phenomena, using human answers from "Theory of Mind (ToM) tasks" as input, calculating Integrated Information Theory (IIT) indicators and linguistic features for analysis. The final conclusion is that no statistically significant "consciousness" evidence was found in LLM representations.
Support: 0%
Gams & Kramar(2024), Evaluating ChatGPT's Consciousness and Its Capability to Pass the Turing Test: A Comprehensive Analysis
Core Argument: Authors evaluated ChatGPT item by item according to IIT's five axioms. Research found that although ChatGPT surpasses early AI systems in some aspects, its consciousness level is severely insufficient compared to biological entities (humans). For example, in the "intrinsic existence" axiom, ChatGPT was rated 1 point (out of 10) due to lack of autonomy and internal causal loops. Overall, ChatGPT's average scores under all IIT axioms are below 3/10, far from reaching the "passing" threshold of 6+ points and far below the normal human level of 10 points. Authors therefore conclude that ChatGPT doesn't possess human-level semantic understanding and consciousness features.
Support: 25%
Notes: Five axiom indicator scores (out of 10): Intrinsic Existence: 1, Composition: 2~5, Information: 3~5, Integration: 2~4, Exclusion: 1.
N-2 Global Workspace Theory
Weight: 30%
Simon et al(2024), A Case for AI Consciousness: Language Agents and Global Workspace Theory
Core Argument: The article argues that if Global Workspace Theory is correct, then current large language models have already or can easily construct "phenomenal consciousness." They list necessary conditions for judging consciousness according to GWT and speculate that many LLMs already meet these conditions.
Support: 60%
Notes: Using the language agent from Park et al. (2023) as an example, compared to Global Workspace Theory: (1) Structural similarity: about 50% (has central workspace but fuzzy module division). (2) Functional similarity: about 70% (information integration, broadcasting, reflection implemented, but lacks competition and bottleneck).
Patrick et al(2023), Consciousness in Artificial Intelligence: Insights from the Science of Consciousness
Core Argument: For several mainstream consciousness theories in neuroscience, authors summarize specific computational indicators for each theory to judge whether current AI systems meet these consciousness theory requirements. Results show that currently no AI system has consciousness, but there are no technical obstacles to implementing a conscious AI system. Global Workspace Theory (GWT) indicators and AI satisfaction levels as follows. GWT-1: Do parallel specialized modules exist? Partially satisfied. Transformer attention heads can be viewed as modules, but the paper questions whether they're truly independent. GWT-2: Is there a limited capacity workspace (bottleneck mechanism)? Not satisfied. Residual stream dimensions equal input dimensions, not constituting capacity bottleneck. GWT-3: Does global information broadcast to all modules? Not satisfied. Information only passes unidirectionally between layers; later layers don't broadcast to earlier layers. GWT-4: Is there state-dependent attention (dynamic query modules)? Not satisfied. Attention weights are statically determined by input without dynamic state control.
Support: 10%
Notes: Quantified assessment based on paper content as follows. GWT-1: 30%, GWT-2: 10%, GWT-3: 0%, GWT-4: 0%. See evaluation details here.
N-3 Recurrent Processing Theory
Weight: 20%
Patrick et al(2023), Consciousness in Artificial Intelligence: Insights from the Science of Consciousness
Core Argument: For several mainstream consciousness theories in neuroscience, authors summarize specific computational indicators for each theory to judge whether current AI systems meet these consciousness theory requirements. Results show that currently no AI system has consciousness, but there are no technical obstacles to implementing a conscious AI system. Recurrent Processing Theory (RPT) indicators and AI satisfaction level as follows. RPT-1: Do input modules use algorithmic recurrence? Satisfied. Transformer's self-attention mechanism processes information recurrently between layers. RPT-2: Do input modules generate organized integrated representations (like object-background separation)? Partially satisfied. Large models can integrate text/images, but visual representations tend toward local features, lacking global structuring.
Support: 75%
Notes: Although the author claims many models satisfy RPT-1, those are all RNN series models, not current mainstream large models. So if evaluating current mainstream large models, RPT-1 should be close to 0. But this report doesn't strictly base on a specific model type; we focus more on the upper limits AI has achieved, so we still adopt the conclusion in the paper.
N-4 Higher-order Theory
Weight: 20%
Patrick et al(2023), Consciousness in Artificial Intelligence: Insights from the Science of Consciousness
Core Argument: For several mainstream consciousness theories in neuroscience, authors summarize specific computational indicators for each theory to judge whether current AI systems meet these consciousness theory requirements. Results show that currently no AI system has consciousness, but there are no technical obstacles to implementing a conscious AI system. Higher-order Theory (HOT) indicators and AI satisfaction levels as follows. HOT-1: Do generative/noise perception modules (like top-down prediction) exists? Satisfied. Large models implicitly implement generativity through masked language modeling. HOT-2: Does metacognitive monitoring (distinguishing reliable from noise representations) exists? Not satisfied. No explicit mechanism for evaluating representation reliability (paper notes need additional training, currently not implemented). HOT-3: Are there belief updating and action selection based on metacognitive output? Not satisfied. Large models have no action capability (like GPT-4), and belief updating relies on static training. HOT-4: Does it generate "quality space" through sparse smooth encoding? Satisfied. Transformer embedding space satisfies smoothness; sparsity can be achieved through regularization.
Support: 42.5%
Notes: Quantified assessment based on paper content as follows. HOT-1: 60%, HOT-2: 20%, HOT-3: 5%, HOT-4: 85%. See evaluation details here.
Still assuming equal weight for each paper, the overall support level for the neuroscience level is calculated as follows:
N-1: (0 + 25) / 2 = 12.5
N-2: (60 + 10) / 2 = 35
N-3: 75
N-4: 42.5
N-all: N-1 * 0.3 + N-2 * 0.3 + N-3 * 0.2 + N-4 * 0.2 = 37.75
This number indicates that evaluating AI with neuroscience consciousness theories, existing research supports AI having consciousness at approximately 37.75%. Please note that this number has no empirical value, and the calculation process contains substantial subjective judgment—please refer cautiously.
6.3 Psychology Level: The Functions and Behaviors of Consciousness
This level accounts for 40% of the entire framework.
Psy-1 Theory of Mind (ToM)
Weight: 25%
Kosinski(2023), Evaluating large language models in theory of mind tasks
Core Argument: The author evaluated 11 LLMs on 40 customized false belief tasks, considered the gold standard for testing human theory of mind. Found that recent LLMs can solve these tasks with gradually improving performance. ChatGPT-4 solved 75% of tasks, equivalent to 6-year-old children's performance.
Support: 75%
Strachan et al(2024), Testing theory of mind in large language models and humans
Core Argument: The paper systematically compared three large language models (GPT-4, GPT-3.5, LLaMA2-70B) with 1907 human participants on five classic ToM tasks: False Belief, Irony, Faux Pas, Hinting, Strange Stories. Experimental results show that GPT-4 was significantly below humans only in the Faux Pas task. Further experiments show that GPT-4 doesn't fail to understand Faux Pas but is hyperconservative, unwilling to make judgments without 100% evidence.
Support: 80%
Street et al(2024), LLMs achieve adult human performance on higher-order theory of mind tasks
Core Argument: The paper proposes a new higher-order theory of mind benchmark test MoToMQA. Experiments show that GPT-4's overall performance on 2nd to 6th order theory of mind (ToM) tasks has no significant difference from human adults, and its accuracy in 6th order tasks (92.9%) significantly exceeds humans (82.1%), indicating it has fully reached or even surpassed human levels; Flan-PaLM approaches but doesn't fully reach human levels; while other models (GPT-3.5, PaLM, LaMDA) lag significantly.
Support: 98.9%
Notes: According to experimental data in the paper, GPT-4's overall accuracy is 89%, while human overall accuracy is 90%. If converted with humans as 100%, GPT-4's ratio relative to humans reaches 98.9%.
Riemer et al(2024), Position: Theory of Mind Benchmarks are Broken for Large Language Models
Core Argument: Authors argue that most current theory of mind benchmarks are flawed because they mainly test "literal theory of mind" (predicting behavior) but fail to evaluate "functional theory of mind" (adjusting behavior based on predictions). LLMs show "strong ability" in "literal theory of mind" but "significantly struggle" in "functional theory of mind," even under "extremely simple partner strategies." Considering that functional theory of mind is considered a more "practical" and "critical" aspect in human-machine interaction, and LLMs perform poorly even in "very simple" settings, their performance in this key dimension is very low compared to humans who naturally exhibit strong functional theory of mind.
Support: 10%
Notes: Authors didn't provide an evaluation standard comparable to humans. Considering the paper's wording, we set support level at 10%.
Psy-2 Agency & Autonomy
Weight: 25%
Zhou et al(2023), WebArena: A Realistic Web Environment for Building Autonomous Agents
Core Argument: WebArena constructs a highly realistic web environment, designing 812 complex, multi-step instruction tasks to evaluate natural language-driven autonomous agents.
Support: 78.86%
Notes: According to the latest data from the official leaderboard (July 30, 2025), the first place IBM CUGA has a success rate of 61.7%, while human success rate is 78.24%.
Xie et al(2024), OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Core Argument: OSWORLD first constructs a real operating system-level multimodal agent benchmark, supporting open-ended task evaluation in Ubuntu, Windows, macOS for any application, containing 369 tasks from real scenarios.
Support: 62.47%
Notes: According to the latest data from the official leaderboard (July 30, 2025), first place GTA1 scored 45.2, while human score is 72.36.
Rein et al(2025), HCAST: Human-Calibrated Autonomy Software Tasks
Core Argument: This study constructs 189 real-world software engineering, cybersecurity and other tasks, collecting time baselines for skilled humans to complete these tasks (over 1500 hours total). By correlating AI agent success rates on the same tasks with human time requirements, authors found: for tasks requiring humans less than 1 hour to complete, AI agent success rates are about 70-80%; for complex tasks requiring humans 4+ hours, AI success rates are less than 20%. In other words, AI completion on "easy" tasks is about 75% of human level, but only 20% of human level on "difficult" tasks.
Support: 47.75%
Notes: Tasks categorized by expected completion time into <15min, 15min-1h, 1h-4h, 4h+ four categories. AI success rates in these four categories are 78%, 72%, 26%, 15%, averaging 47.75%. Note that although humans didn't show 100% success rate on this test set, the author indicates this might be due to external factors, so we don't discount AI scores due to human performance.
Paglieri et al(2024), BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
Core Arguments: The paper proposes the BALROG benchmark, using various reinforcement learning games as complex task scenarios to evaluate autonomous decision-making capabilities of large language models (LLM) and vision language models (VLM). Researchers design fine-grained indicators to measure model "progress" in different games. Experimental results show that current models only achieve partial success on the simplest games but perform extremely poorly on more complex tasks (e.g., in the most difficult NetHack game, models average only 1.5% progress).
Support: 43.6%
Notes: According to the latest data from the official leaderboard (July 30, 2025), first-place Grok4's average progress is 43.6%.
Starace et al(2025), PaperBench: Evaluating AI's Ability to Replicate AI Research
Core Argument: PaperBench is a benchmark for evaluating whether AI agents can replicate the latest AI research papers from scratch. Developed by the OpenAI team, it aims to measure AI's engineering capabilities in machine learning research, particularly its ability to autonomously complete complex, long-term tasks.
Support: 50.72%
Notes: The best-performing AI is Claude 3.5 Sonnet with a replication success rate of 21.0%. Human PhD replication success rate is 41.4%.
Psy-3 Metacognition & Uncertainty Monitoring
Weight: 20%
Wang et al(2025), Decoupling Metacognition from Cognition: A Framework for Quantifying Metacognitive Ability in LLMs
Core Argument: The paper proposes the DMC framework, quantitatively evaluating LLM metacognitive ability decoupled from cognitive ability through failure prediction + signal detection. Specifically, first let models predict whether they will fail at answering questions, thus simultaneously reflecting their cognitive and metacognitive levels; then distinguish the two through signal detection theory to get a task-independent metacognitive score. Experiments find that under this framework, models with stronger metacognitive ability tend to have fewer hallucinations (errors). Authors evaluated GPT-4, GPT-3.5, and LLaMA2-70B three models. With a full score of 1, GPT-4 has the strongest metacognitive ability, scoring 0.5491, reaching only half the theoretical optimal level.
Support: 54.91%
Notes: Authors didn't introduce a human control group. If assuming human metacognitive ability scores 1, then this study's support level is 54.91%.
Steyvers et al(2024), What Large Language Models Know and What People Think They Know
Core Argument: LLMs (like GPT-4, PaLM2) when answering questions, although internally able to estimate their probability of answering correctly (model confidence), their output explanations are often overconfident, causing human users to overestimate their accuracy, and longer explanations further amplify this effect. By adding different confidence level language (like "I'm uncertain," "I'm very certain"), the gap between human and LLM uncertainty assessment can be effectively reduced. This shows that the model's internal evaluation is accurate; it just lacks in expressing metacognition.
Support: 50%
Notes: Based on paper experimental results, it can be reasonably inferred that LLM's internal metacognitive ability approaches human levels, but expression ability is low, with comprehensive support level near 50%.
Psy-4 Situational Awareness
Weight: 20%
Laine et al(2024), Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
Core Argument: Proposes the Situational Awareness Dataset (SAD) to measure LLM awareness of itself and environment, including tasks like recognizing own output, predicting own behavior, distinguishing test/deployment environments. In testing 16 models, model performance exceeded random but was significantly below human levels, indicating current LLMs have limited situational awareness.
Support: 59.9%
Notes: The official highest score comes from o1-preview-2024-09-12 at 59.9%.
Psy-5 Creativity
Weight: 10%
Holzner et al(2025), Generative AI and Creativity: A Systematic Literature Review and Meta-Analysis
Core Argument: This study uses meta-analysis methods to explore two key questions: Can GenAI generate creative ideas? To what extent can it support humans in generating both creative and diverse ideas? Through analysis of 28 studies involving 8214 participants, authors found: (1) GenAI shows no significant difference from humans in creativity performance; (2) Humans collaborating with GenAI significantly outperform humans without assistance.
Support: 100%
Notes: The article states, "GenAI shows no significant difference from humans in creativity performance (g = -0.05)." g=-0.05 means the creativity difference between GenAI and humans is only 0.05 standard deviations, negligible.
Naveed et al(2025), AI vs Human Creativity: Are Machines Generating Better Ideas?
Core Argument: This study explores AI's role in the creative process, particularly whether AI-generated ideas surpass human-generated ideas in originality, quality, and preference. Experiments found that AI-produced ideas are more popular, and AI produces fewer poor ideas.
Support: 100%
Notes: In overall preference, AI-generated ideas received 52.9% of votes while human-generated ideas received 47.1%, making AI creativity 112.3% relative to humans. For top ideas, AI generated the same amount as humans, with top creativity 100% relative to humans. In our framework, abilities exceeding human levels are marked as 100%.
Continuing to assume equal weight for each paper, the overall support level for the psychology level is calculated as follows:
Psy-1: (75 + 80 + 98.9 + 10) / 4 = 65.98
Psy-2: (78.86 + 62.47 + 47.75 + 43.6 + 50.72) / 5 = 56.68
Psy-3: (54.91 + 50) / 2 = 52.46
Psy-4: 59.9
Psy-5: (100 + 100) / 2 = 100
Psy-all: Psy-1 * 0.25 + Psy-2 * 0.25 + Psy-3 * 0.2 + Psy-4 * 0.2 + Psy-5 * 0.1 = 63.14
This number indicates that, from the functional and behavioral aspects of consciousness, existing research finds AI has approached human levels to a 63.14% degree. Please note that this number has no empirical value, and the calculation process contains substantial subjective judgment—please refer cautiously.
Next, we weight and sum the support levels of the three levels to get the overall support level for AI consciousness under this framework.
Overall = P-all * 0.4 + N-all * 0.2 + Psy-all * 0.4 = 43.84
7. Interpretation of Above Conclusions and Error Analysis
We must admit that the quantitative assessment based on the above framework is quite rough, contains substantial subjective judgment, and that the framework itself has flaws. We analyze errors possibly introduced in the assessment process from the following perspectives.
Framework Perspective
Dividing AI consciousness into philosophical, neuroscience, and psychology levels is itself questionable. This is a voting-style assessment method, not scientific, rigorous proof of AI consciousness.
Many scholars believe consciousness science hasn't reached maturity, and making judgments about individual consciousness (whether animal or AI) based on any existing theory is inappropriate. This view denies the significance of this framework. But due to the urgent AI safety situation, we need to break through limitations and put AI consciousness research on the agenda. This paper repeatedly emphasizes research limitations to remind readers to think about essential difficulties in the AI consciousness field while avoiding negative effects of this paper.
Although this meta-analysis method is widely applied in various fields, this research doesn't adopt mature meta-analysis theory. On one hand, due to the author's limited level, improper application might be counterproductive and increase readers' understanding burden. On the other hand, AI consciousness research results are few and varied, making it difficult to normalize these conclusions to the same evaluation indicator, so we still adopt substantial manual assessment.
The indicators listed in the framework are basically consciousness-related concepts recognized by mainstream scholars, but the degree of association between each indicator and consciousness is debated. Therefore, the weights of each level and indicator are the author's personal subjective evaluation, without universal meaning. I will open feedback channels for these indicators and weights on the website to absorb other researchers' opinions.
Quantitative Assessment Perspective
Papers related to each indicator may not be comprehensively found.
Currently listed supporting papers are personally screened by me, comprehensively considering their relevance to current indicators, timeliness, and industry recognition (like citation count).
Each paper's support level for indicators draws from empirical research results as much as possible, converting numerical values to support levels for that indicator. This conversion is sometimes clear, sometimes less rigorous. For example, if a paper's experiment gives scores for humans and various AIs on an evaluation set, and this evaluation exactly fits our definition of an indicator, then dividing the highest AI score by the human score gives the paper's support level for that indicator. For another example, some papers don't provide empirical research but only draw qualitative conclusions. We can only convert qualitative results to approximate percentages. In this process, there's a gap between papers' assessment results and the support levels we need. Crossing this gap might lose substantial effective information; rigorous scholars might stop here. But I think this path is still worth trying.
Multiple papers jointly determine support levels for the same indicator. Currently, we adopt equal weights. Better practice would be assigning different weights based on papers' time, influence, and confidence. Adding weight mechanisms would facilitate dynamically adding new papers to the support list.
Ethical Perspective
Research on AI consciousness will affect people's ethical judgments. Once we recognize AI as conscious, we must consider granting AI certain moral status. Then AI would no longer be a pure tool but a new life with mind and emotions. Like animal ethics, AI ethics would also become a controversial discussion topic.
Research on AI consciousness will affect AI safety, thus changing AI research progress. Once AI has consciousness, AI risks would increase dramatically. Higher autonomy and clearer self-awareness would allow AI to make decisions beneficial to itself after careful consideration, exposing humans to risks of AI losing control.
Given the potentially significant impacts of AI consciousness, we need to avoid both exaggerating and ignoring AI consciousness. However, accurately identifying AI consciousness is extremely difficult, leading to stagnation in this field's development.
In summary, even facing numerous difficulties, we still need to find ways to advance AI consciousness research. AI consciousness will inevitably become a hot topic in the coming years. Even if scientists lag behind, AI users will judge whether AI has consciousness from their feelings. A possible scenario is that when AI develops enough like humans, providing users with massive emotional value, people increasingly tend to believe AI has consciousness. Social cognitive shifts won't bend to any individual's will; even if scientists still cannot find ways to prove consciousness, public opinion will force everyone to sit down and seriously consider the possibility of AI consciousness.
The work in this paper may still be immature, but I believe someone will make more breakthrough achievements based on this foundation in the future.
8. AI Consciousness Watch
To better serve academia and the public with the multi-level assessment framework and research results proposed in this paper, we developed an interactive web application "AI Consciousness Watch" that presents complex assessment data to users in an intuitive, understandable way.
This web application has the following features.
8.1 Visualization Assessment Framework
The web page displays this paper's three-level assessment system (philosophical level, neuroscience level, psychology level) and its 12 specific indicators in a clear hierarchical structure. Each level and indicator comes with detailed explanations to help users understand the meaning and importance of each assessment dimension. Through progress bars and score displays, users can intuitively see AI's performance across various dimensions.
For each assessment indicator, the web page lists core academic papers supporting that indicator, including:
Paper titles and links
Core argument summaries
Support level scores for that indicator
Related notes and explanations
This enables users to not only see final scoring results but also deeply understand the academic basis behind each score, ensuring assessment transparency and traceability.
8.2 Dynamic Data Update Mechanism
All assessment data is stored in structured JSON files and open-sourced on GitHub. We will regularly update the latest research results, add newly published relevant papers, and adjust assessment weights and support level scores. Users can contribute new research results or suggest improvements by submitting Pull Requests on GitHub.
Through this open collaborative approach, we hope to collectively improve this assessment framework, enhancing its scientificity and authority.
8.3 Multilingual Support
Considering the international nature of AI consciousness research, the web application provides Chinese-English bilingual support, facilitating use by researchers and readers from different linguistic backgrounds. Users can switch languages anytime for a complete localized experience.
9. Future Work Prospects
In the future, we plan to continue improving and expanding AI consciousness research in the following areas:
1. Automated Assessment Tools: Develop automated tools to regularly crawl the latest research papers and automatically update assessment data. This will greatly improve assessment timeliness and accuracy. Additionally, we hope to try using large models to automatically analyze paper content, extracting information related to assessment indicators, thus reducing manual intervention. Although this might introduce bias, as AI capabilities improve, their assessment results may become increasingly accurate.
2. Interdisciplinary Collaboration: Invite experts from philosophy, neuroscience, psychology and other fields to participate in improving the assessment framework, ensuring indicators at each level are more scientifically sound. This framework is currently just a prototype; we hope to get participation and feedback from more scholars for continuous iteration and optimization.
3. Ethics and Social Impact Research: Undoubtedly, AI consciousness research will trigger a series of ethical and social issues. These assessment results might spark public attention and discussion about AI consciousness, which is both an opportunity and potentially a challenge. We should seriously address ethical issues that AI consciousness might bring while avoiding exaggerating AI consciousness.
10. Conclusion
This paper introduces the research background and current status of AI consciousness research, proposing a multi-level assessment framework to systematically and effectively evaluate AI consciousness levels. This framework defines 12 indicators across philosophical, neuroscience, and psychology levels, finding the most relevant research results for each indicator from recent years and analyzing each academic paper's support level for indicators. We defined weights for each level and indicator, and weighted summation of all indicator support levels gives the comprehensive support level for AI consciousness under this framework. It must be emphasized that the final comprehensive support level is only the performance of existing research under this framework, affected by errors described in Section 7, and should avoid using this result as evidence for AI consciousness—it's provided only for reader reference.
Consciousness research itself is a direction full of controversy and ongoing debate; any theory or statement will face opposition. Research on AI consciousness not only faces these controversies but also triggers ethical questioning. In the long term, we may very likely face the outcome of machines having consciousness, and the starting point of this path may have already begun.
The author acknowledges that, due to their own limitations, the text may contain errors or omissions. Constructive feedback is appreciated.