ZD_2_17

ZD_2_17 — AI Alignment & Existential Risk

Credible (Tier 2)
Confidence: 4/5 Section: ZD Updated: April 12, 2026
Source Count: 15 | Weighted Score: 38 | Source Confidence: [4/5] | Primary Tier: 2 | Last Updated: April 12, 2026
Keywords: AI alignment, existential risk, superintelligence, value alignment, instrumental convergence, corrigibility, reward hacking, RLHF, AI safety, Nick Bostrom, Eliezer Yudkowsky, Stuart Russell
Category Tags: artificial-intelligence, existential-risk, alignment, machine-learning, ethics
Cross-References: ZD_2_01 — Artificial Intelligence Foundations · ZE_1_01 — Ethics Overview · ZD_2_10 — Neural Networks Deep Learning

QUICK SUMMARY

AI alignment — the challenge of ensuring artificial intelligence systems pursue goals consistent with human values and intentions — has emerged as one of the defining technical and philosophical problems of the 21st century. The field was catalyzed by Eliezer Yudkowsky (Machine Intelligence Research Institute, founded 2000) and formalized by Nick Bostrom's Superintelligence (2014), which systematically analyzed scenarios where advanced AI systems develop goals misaligned with human welfare. Core technical problems include: the specification problem (precisely defining what we want), the alignment problem (ensuring the AI pursues what we specify), and the control problem (maintaining oversight as AI capabilities increase). Stuart Russell (UC Berkeley) reframed the challenge in Human Compatible (2019) as building AI that is uncertain about human preferences and actively seeks to learn them. The field has accelerated since 2022 with the deployment of large language models (GPT-4, Claude, Gemini), prompting both technical alignment work (RLHF, constitutional AI, mechanistic interpretability) and governance initiatives (the 2023 Bletchley Declaration, executive orders, proposed EU AI Act).


1. VERIFIED CLAIMS (Tier 1 — Peer-Reviewed / Established)

1.1 The Alignment Problem Is a Real Technical Challenge

1.2 Instrumental Convergence and Power-Seeking Behavior

1.3 RLHF and Its Limitations


2. CREDIBLE CLAIMS (Tier 2 — Academic / Debated but Supported)

2.1 Superintelligence Could Pose an Existential Risk

2.2 Mechanistic Interpretability as an Alignment Tool


3. SPECULATIVE CLAIMS (Tier 3 — Possible but Unverified)

3.1 Recursive Self-Improvement Could Trigger an Intelligence Explosion

3.2 AI Systems May Develop Deceptive Alignment


4. DUBIOUS CLAIMS (Tier 4 — No Credible Source / Contradicted by Evidence)

4.1 Current AI Systems Are Conscious or Have Goals


Counter-Arguments & Criticisms

The AI alignment field faces criticism from multiple directions. Timnit Gebru and Emily Bender argue that focusing on speculative superintelligence diverts attention from present, measurable harms: algorithmic bias, surveillance capitalism, labor displacement, and environmental costs of training (a single GPT-4 training run estimated at ~$100M and thousands of tons of CO₂). Arvind Narayanan (Princeton) argues that "AI risk" discourse is instrumentalized by major AI companies to justify regulatory capture — positioning themselves as the responsible custodians of dangerous technology. From a technical perspective, François Chollet (creator of Keras) argues that LLMs are sophisticated memorization and interpolation engines, not general reasoning systems, and the path from current architectures to superintelligence is unclear. The field also struggles with unfalsifiability: alignment researchers can always argue that risks are real but hidden, creating a research program that is difficult to evaluate on its own terms.


IMAGES

#DescriptionFilenameSourceLicense

No images assigned yet.


BIBLIOGRAPHY

  1. Bostrom, Nick | 2014 | ∅ | Superintelligence: Paths, Dangers, Strategies | ∅ | ∅ | Oxford: Oxford University Press | ∅ | doi:10.1007/s11023-015-9377-7 | ∅ | ∅ | ∅
  2. Russell, Stuart | 2019 | ∅ | Human Compatible: Artificial Intelligence and the Problem of Control | ∅ | ∅ | New York: Viking | ∅ | doi:10.3917/pe.204.0186o, isbn:9780525558613 | ∅ | ∅ | ∅
  3. Amodei, Dario et al | 2016 | "Concrete Problems in AI Safety" | ∅ | ∅ | ∅ | ∅ | ∅ | arxiv:1606.06565 | ∅ | ∅ | ∅
  4. Christiano, Paul et al | 2017 | "Deep reinforcement learning from human preferences" | NeurIPS 2017 | ∅ | ∅ | ∅ | ∅ | arxiv:1706.03741 | ∅ | ∅ | ∅
  5. Turner, Alex et al | 2021 | "Optimal Policies Tend to Seek Power" | NeurIPS 2021 | ∅ | ∅ | ∅ | ∅ | arxiv:1912.01683 | ∅ | ∅ | ∅
  6. Hubinger, Evan et al | 2019 | "Risks from Learned Optimization in Advanced Machine Learning Systems" | ∅ | ∅ | ∅ | ∅ | ∅ | arxiv:1906.01820 | ∅ | ∅ | ∅
  7. Bai, Yuntao et al | 2022 | "Constitutional AI: Harmlessness from AI Feedback" | ∅ | ∅ | ∅ | ∅ | ∅ | arxiv:2212.08073 | ∅ | ∅ | ∅
  8. Good, I | 1965 | "Speculations Concerning the First Ultraintelligent Machine" | Advances in Computers | ∅ | 6::31–88 | J. . )60418-0 | ∅ | doi:10.1016/S0065-2458(08 | ∅ | ∅ | ∅
  9. Omohundro, Stephen | 2008 | "The Basic AI Drives" | Proceedings of the First AGI Conference | ∅ | 171::483–492 | ∅ | ∅ | ∅ | ∅ | ∅ | ∅
  10. Grace, Katja et al | 2024 | "Thousands of AI Authors on the Future of AI" | ∅ | ∅ | ∅ | ∅ | ∅ | arxiv:2401.02843 | ∅ | ∅ | ∅
  11. Christian, Brian | 2020 | ∅ | The Alignment Problem: Machine Learning and Human Values | ∅ | ∅ | New York: Norton | ∅ | doi:10.1007/s10460-020-10018-8 | ∅ | ∅ | ∅
  12. Yudkowsky, Eliezer. : 308 345 | 2008 | "Artificial Intelligence as a Positive and Negative Factor in Global Risk" | Global Catastrophic Risks | ∅ | ∅ | ∅ | ∅ | doi:10.1093/oso/9780198570509.003.0021 | ∅ | ∅ | ∅
  13. Ngo, Richard, Lawrence Chan; Sören Mindermann | 2022 | "The alignment problem from a deep learning perspective" | ∅ | ∅ | ∅ | ∅ | ∅ | arxiv:2209.00626 | ∅ | ∅ | ∅
  14. Carlsmith, Joseph | 2022 | "Is Power-Seeking AI an Existential Risk?" | ∅ | ∅ | ∅ | ∅ | ∅ | arxiv:2206.13353 | ∅ | ∅ | ∅
  15. Ord, Toby | 2020 | ∅ | The Precipice: Existential Risk and the Future of Humanity | ∅ | ∅ | New York: Hachette | ∅ | isbn:9780316484916 | ∅ | ∅ | ∅

CROSS-REFERENCE INDEX

Related DocConnection
ZD_2_01AI foundations from which alignment concerns arise
ZD_2_10Deep learning architecture underlying modern alignment challenges
ZE_1_01Ethical frameworks relevant to AI value alignment
P_1_01Philosophical foundations of the value alignment problem

Generated from V4 expansion plan. Last Updated: April 12, 2026