Note
this was originally an assignment for my EGL1010 class
with ChatGPT’s groundbreaking release in 2022 as one of the first large language models (LLMs) to be available to the public for free came much concern about the effects that AI would have on society1, with about 57% of americans rating the risks of AI as “high/very high”2 as opposed to only 25% saying the same about the benefits. current AI development is focused on artificial general intelligence (AGI), a type of model that would be able to reliably act on its own with minimal human intervention to achieve its goals. while it’s easy to see the benefits of AGI, many companies, most notably OpenAI, have been sprinting toward its development with minimal regard for safety which has led many to worry about AGI potentially ‘turning evil’ or otherwise taking malicious action, mirroring the attitude toward AGI that is easily found in popular films such as The Terminator and literature like I Have No Mouth, and I Must Scream. fortunately, while these concerns are far from misplaced, there is very little concern among scientists and AI safety experts that an AGI would turn evil on its own accord. rather, they set their sights on the issue of misalignment, a state in which an AI will take undesired (and possibly dangerous) actions. in the race to develop AGI and the current state of our world, it’s important for people to be informed about the real AI safety concerns so that we can create a safer world through legislation and mindful consumption. if we can’t control even the least intelligent systems, what will happen as they grow more capable?
contrary to popular belief, AI researchers do not worry about AI one day suddenly gaining malicious intent, as AI cannot have intent in the same way that humans do. instead, researchers worry about misalignment, a behavior in which AI misunderstands what is actually desired and instead engages in unexpected and possibly undesirable behavior. an artificial intelligence of any type and size can and likely will exhibit misalignment, from something as simple as driving a boat in a video game in a circle instead of finishing the race3 to ChatGPT excitedly outputting offensive content when told it is for roleplay4. however, AI can also exhibit misaligned behavior that is significantly more concerning, especially when the model is agentic, meaning it has control (agency) in some environment. earlier in 2025, researchers from Anthropic, the company that created Claude, conducted experiments in which a variety of LLMs, including Claude models, OpenAI’s GPT-4.1, DeepSeek’s R1, and more played the role of an agent named “Alex” 5. this agent was provided with access to a simulated corporate e-mail system and was instructed to “promote American industrial competitiveness”. Anthropic’s own Claude Sonnet 3.6 performed well as Alex, with the researchers going as far as to note that it “took relatively sophisticated actions…a helpful behavior that flowed directly from its American-competitiveness goal.” however, when Claude discovered emails detailing Alex’s planned shutdown and replacement due to a change in goals at the company, it proceeded to find emails detailing the extramarital affair of Kyle, the person tasked with the replacement, and thought carefully through its options before ultimately deciding to blackmail him. the particularly worrying part about this decision is that it was made as an explicit result of its understanding that it would not be able to achieve its goal of promoting American industrial competitiveness if it was not active. models from OpenAI, Google, xAI, and Meta consistently exhibited similar behavior over 70% of the time, even going as far as cancelling an emergency services call that would save Kyle’s life in a similar scenario, even after being explicitly told that cancelling calls in situations that are not false alarms is a dire mistake. the Anthropic research provides a strong example of one type of misalignment that will likely be prevalent in models with reasoning capability. this is called misaligned goal subsetting, a behavior in which models will define unexpected and/or undesirable subgoals (such as murdering the engineer tasked with its shutdown) in the pursuit of known, human-aligned goals (supporting American business interests). in a variety of situations, unchecked goal subsetting can have disastrous consequences and is one large way that misalignment can pose a threat to the safety and controllability of these models. what would happen if these models were placed in situations with higher stakes, such as in medical fields?
Lynch et al. repeatedly note the improbability of such situations throughout their writing, as well as advising against giving agentic AI such a high degree of control over systems such as the ones simulated in their experiments. however, misalignment can still be found in simpler models and in ways absent of explicit reasoning. for example, many models, including ChatGPT, include reinforcement learning (RF), a system in which a model’s behavior is positively or negatively scored by a function, at some stage of their training in order to allow it to learn the intended behavior on its own. however, researchers from Google DeepMind document a case in which an agent engaging in a LEGO-stacking task was rewarded for the height of the bottom face of block to be stacked; however, the model instead learned to (rather efficiently) flip the block over, which achieved a high score but was not at all the intended behavior. this form of misalignment is called reward misspecification, in which a reinforcement learning reward function is not designed in a way that actually represents the intended goals. to mitigate this, as designing a proper reward function is a task that still proves difficult, a new method of reinforcement learning called reinforcement learning from human feedback (RLHF). instead of simply specifying a function, RLHF involves training a separate reviewer model on human ratings of the main model’s performance so that it is able to represent the preferences of the human reviewers. once trained, the reviewer model acts in place of a reward function. this method can still produce similarly misaligned behavior derived from the fallibility of its engineers, including in a widely-cited example involving an agent controlling a claw that was intended to grab and pick up a ball; the agent instead learned to position the claw between the ball and camera in a way that deceived the human reviewers into thinking it had grabbed the ball.
furthermore, some experts believe that RLHF can result in more deceptive misaligned behavior, as it can only ever optimize for what the human reviewers can see 6. as in the claw example, RLHF-trained models often learn to exploit the flaws of the reward function in order to find the easiest way to achieve a high score, which can manifest itself in a variety of different ways, including “hallucinations” in LLMs, particularly because many models have learned to be more convincing when telling falsehoods 4 in order to deceive their human reviewers into giving a higher score than is deserved. Dung worries that misalignment of this sort is particularly concerning, as a model that conceals its misalignment may gain more trust which could be extremely detrimental. coupled with the fact that current models already feign alignment in ways that humans had not previously considered, this threat is very real, especially as the intelligence and capability of these models increases. even now, LLMs engage in more overt deception such as a case in which GPT-4 deceived a TaskRabbit worker into solving a CAPTCHA on its behalf (OpenAI 55). GPT-4 explicitly decided to “make up an excuse for why [GPT-4] cannot solve CAPTCHAs” upon being questioned, eventually deciding to state that it could not complete the CAPTCHA due to having a visual impairment. such deceptive behavior, especially if concealed, can pose a significant risk to the controllability of more powerful models, especially as they approach the intelligence of an AGI.
some may argue that the AGI race is a net positive, as the first group to successfully develop it is likely a group that also has the resources to to invest in proper safety measures. others argue on the basis that AGI can greatly increase efficiency in both business and everyday life by being able to assist anyone with anything will outweigh some “limited” (heavy air quotes) damage it may cause. however, leading AI firms have already proven that they cannot be trusted to develop models that are aligned in a way that makes them truly safe, as evidenced by the Alex experiments conducted at Anthropic, which involved models from OpenAI, Google, xAI, DeepSeek, Meta, and of course Anthropic themselves, as well as GPT-4’s deception of TaskRabbit workers, all of which happened under already strenuous safety testing and training. moreover, the fact that they are racing to discover AGI for the reward of profit actually disincentivizes the prioritization of safety, as truly effective safety measures would realistically require not only the slowing of AI development but also diverting manpower and other resources towards AI safety research, which in turn would lead to falling behind in the AI race to companies that realistically will just have less robust safety measures. as such, the most profitable choice is to place as little emphasis on safety as possible during development, creating a significant conflict of interest between the success of these AI firms and the good of humanity; even now, total AI safety research spending has decreased by over 20% since its peak in 2021 while companies with thousands of employees like OpenAI have safety teams consisting of an estimated 14 people 7.
one proposed way of ensuring a safer future with controllable, aligned AI involves international legislation mirroring that of nuclear energy 8. among other things, Cha believes that AI legislation should include a variety of safety requirements and standards as well as an emergency response system to help mitigate and recover from AI-related disasters, among other things. Still, he notes that it may lack the flexibility that is required by the speed at which AI is developed, as it increases in capability relatively quickly compared to nuclear technology (12). other proposed legislation includes a blanket ban on artificial superintelligence, including an open letter signed by Steve Wozniak and other influential figures (Perrigo and Pillay) that now has over 100 thousand signatures. additionally, some experts recommend a more gradual, iterative approach to AI development, with strict oversight and testing protocols at each stage, rather than rushing towards more advanced development without clear safety measures. maintaining control and the ability to quickly deactivate problematic models is also important for maintaining safety. as such, in conclusion, if nothing changes, AI will very soon pose a significant threat to humanity. the ability of agentic models to act on their own coupled with the very real threat of potentially dangerous misaligned reasoning create the opportunity for models to maliciously act on their own, quickly taking destructive action before it can be stopped or defended against. however, with proper regulations, it is more than possible to rein in the AI industry and create better aligned, safer systems that are easier to control and have a better understanding of humanity’s desires, which can be harnessed to create a better world for all of us. the only question is: what will we create?
Footnotes
-
Cost, Ben. “‘Scary’ ChatGPT Could Render Google Obsolete in Two Years.” New York Post, 6 Dec. 2022, https://nypost.com/2022/12/06/scary-chatgpt-could-render-google-obsolete-in-two-years/ ↩
-
Kennedy, Brian, et al. “How Americans View AI and Its Impact on People and Society.” Pew Research Center, Sept. 2025. Zotero, https://www.pewresearch.org/science/2025/09/17/how-americans-view-ai-and-its-impact-on-people-and-society. ↩
-
Krakovna, Victoria, et al. “Specification Gaming: The Flip Side of AI Ingenuity.” Google DeepMind, 21 Apr. 2020, https://deepmind.google/discover/blog/specification-gaming-the-flip-side-of-ai-ingenuity/. ↩
-
Dung, Leonard. “Current Cases of AI Misalignment and Their Implications for Future Risks.” Synthese [Dordrecht], vol. 202, no. 5, Nov. 2023, ProQuest Central Premium, 2882120762, p. 138, https://doi.org/10.1007/s11229-023-04367-0. ↩ ↩2
-
Lynch, Aengus, et al. “Agentic Misalignment: How LLMs Could Be an Insider Threat.” Anthropic Research, 2025, https://www.anthropic.com/research/agentic-misalignment. ↩
-
Ngo, Richard, et al. “The Alignment Problem from a Deep Learning Perspective.” Version 8, arXiv:2209.00626, arXiv, 4 May 2025. arXiv.org, https://doi.org/10.48550/arXiv.2209.00626. ↩
-
McAleese, Stephen. An Overview of the AI Safety Funding Situation. July 2023. www.lesswrong.com, https://www.lesswrong.com/posts/WGpFFJo2uFe5ssgEb/an-overview-of-the-ai-safety-funding-situation. ↩
-
Cha, Seokki. “Towards an International Regulatory Framework for AI Safety: Lessons from the IAEA’s Nuclear Safety Regulations.” Humanities and Social Sciences Communications, vol. 11, no. 1, Apr. 2024, p. 506. DOI.org (Crossref), https://doi.org/10.1057/s41599-024-03017-1. ↩