Safety, Alignment and Governance
The most critical section. Ensuring AI remains safe, aligned with human values, and under meaningful control is the defining challenge as systems grow more capable.

Safety and Alignment
| Topic |
Description |
Links |
| LLM Hallucination Survey |
Comprehensive survey on detecting, explaining, and mitigating hallucinations in LLMs. |
Paper, Code |
| Hallucination Detection |
Survey on hallucination in LLMs covering detection methods and mitigation strategies. |
Paper, Code |
| LLM Attacks |
Universal and transferable adversarial attacks on aligned language models (CMU research). |
Paper, Code |
| RLHF |
Reinforcement Learning from Human Feedback - the key technique behind ChatGPT's alignment. |
Code Example |
| DPO |
Direct Preference Optimization - an alternative to RLHF that trains the LM directly as a reward model. |
Paper, Code |
Scalable Oversight & Automated Alignment
As AI systems approach and exceed human-level capability, humans can no longer directly evaluate their outputs. These techniques aim to maintain meaningful oversight over superhuman systems -- the defining technical challenge of the ASI transition.
| Approach |
Description |
Links |
| Weak-to-Strong Generalization |
OpenAI's framework for studying how weak supervisors (humans) can elicit capabilities from stronger models. GPT-2 supervising GPT-4 as proxy for the superintelligence alignment problem. |
Paper |
| Constitutional AI (CAI) |
Anthropic's method for training AI using AI-generated feedback based on a set of principles, reducing dependence on human labelers while maintaining alignment. |
Paper |
| Debate |
Two AI agents argue opposing sides; a human judge evaluates. Even if an AI is superhuman, the debate format decomposes complex claims into human-verifiable steps. |
Paper |
| Recursive Reward Modeling |
Humans train AI to assist with evaluation, which then assists with training the next model -- bootstrapping oversight. Core technique for Anthropic and OpenAI alignment teams. |
Paper |
| AI Control |
Redwood Research's framework for maintaining safety even against models actively trying to subvert controls. Evaluates security against intentional subversion. |
Paper |
| Cooperative AI |
Research program at DeepMind focused on AI that cooperates with humans and other AI systems -- critical for multi-agent superintelligence scenarios. |
Paper |
Mechanistic Interpretability
Understanding how neural networks actually compute internally. If we can reverse-engineer the algorithms learned by frontier models, we can detect dangerous capabilities, verify alignment, and build safer systems. This is arguably the most important field for AGI safety.
| Resource |
Description |
Links |
| Towards Monosemanticity (Anthropic, 2023) |
Landmark result: sparse autoencoders decompose polysemantic neurons into millions of interpretable features in Claude. |
Blog |
| Scaling Monosemanticity (Anthropic, 2024) |
Extracted 34M interpretable features from Claude 3 Sonnet, including safety-relevant concepts (deception, power-seeking, "I am an AI assistant"). |
Blog |
| Representation Engineering (Zou et al., 2023) |
Identifies and steers high-level concepts (honesty, power-seeking, emotion) directly in neural representations. A practical tool for alignment. |
Paper |
| TransformerLens |
Neel Nanda's open-source library for mechanistic interpretability research on GPT-2 style models. The standard tool for mech interp research. |
GitHub |
| SAE Bench |
Standardized benchmark for evaluating sparse autoencoder quality -- enabling reproducible interpretability research. |
GitHub |
| Transformer Circuits Thread (Anthropic) |
Anthropic's ongoing research thread on understanding transformer internals: induction heads, superposition, circuits, and feature visualization. |
Blog |
AI-Generated Text Detection
| Tool/Method |
Description |
Links |
| DetectGPT |
Stanford method using probability curvature to detect LLM-generated text. |
Paper, Code |
| Detecting LLM-Generated-Text |
Comprehensive survey on the science of LLM-generated text detection. |
Paper |
| GPTZero |
AI detection model designed specifically for educators. |
GPTZero |
AI Governance & Planetary-Scale Challenges
| Resource |
Description |
Links |
| Climate Change AI |
Global non-profit catalyzing impactful work at the intersection of climate change and machine learning. Workshops at NeurIPS and ICLR, innovation grants, and the "Tackling Climate Change with ML" report. Co-founded by Priya Donti (MIT EECS). |
climatechange.ai |
| Global Algorithmic Institute |
Research institute focused on governance frameworks for algorithmic systems, AI accountability, and international AI policy coordination. |
globalalgorithmicinstitute.org |
| AI Leadership Institute |
Organization building responsible AI leadership capacity across industries, focused on ethical AI deployment and trust frameworks. |
aileadershipinstitute.com |