Skip to content

Safety, Alignment and Governance

The most critical section. Ensuring AI remains safe, aligned with human values, and under meaningful control is the defining challenge as systems grow more capable.

Alignment Safety Governance Interpretability Detection

Safety and Alignment

Topic Description Links
LLM Hallucination Survey Comprehensive survey on detecting, explaining, and mitigating hallucinations in LLMs. Paper, Code
Hallucination Detection Survey on hallucination in LLMs covering detection methods and mitigation strategies. Paper, Code
LLM Attacks Universal and transferable adversarial attacks on aligned language models (CMU research). Paper, Code
RLHF Reinforcement Learning from Human Feedback - the key technique behind ChatGPT's alignment. Code Example
DPO Direct Preference Optimization - an alternative to RLHF that trains the LM directly as a reward model. Paper, Code

Scalable Oversight & Automated Alignment

As AI systems approach and exceed human-level capability, humans can no longer directly evaluate their outputs. These techniques aim to maintain meaningful oversight over superhuman systems -- the defining technical challenge of the ASI transition.

Approach Description Links
Weak-to-Strong Generalization OpenAI's framework for studying how weak supervisors (humans) can elicit capabilities from stronger models. GPT-2 supervising GPT-4 as proxy for the superintelligence alignment problem. Paper
Constitutional AI (CAI) Anthropic's method for training AI using AI-generated feedback based on a set of principles, reducing dependence on human labelers while maintaining alignment. Paper
Debate Two AI agents argue opposing sides; a human judge evaluates. Even if an AI is superhuman, the debate format decomposes complex claims into human-verifiable steps. Paper
Recursive Reward Modeling Humans train AI to assist with evaluation, which then assists with training the next model -- bootstrapping oversight. Core technique for Anthropic and OpenAI alignment teams. Paper
AI Control Redwood Research's framework for maintaining safety even against models actively trying to subvert controls. Evaluates security against intentional subversion. Paper
Cooperative AI Research program at DeepMind focused on AI that cooperates with humans and other AI systems -- critical for multi-agent superintelligence scenarios. Paper

Mechanistic Interpretability

Understanding how neural networks actually compute internally. If we can reverse-engineer the algorithms learned by frontier models, we can detect dangerous capabilities, verify alignment, and build safer systems. This is arguably the most important field for AGI safety.

Resource Description Links
Towards Monosemanticity (Anthropic, 2023) Landmark result: sparse autoencoders decompose polysemantic neurons into millions of interpretable features in Claude. Blog
Scaling Monosemanticity (Anthropic, 2024) Extracted 34M interpretable features from Claude 3 Sonnet, including safety-relevant concepts (deception, power-seeking, "I am an AI assistant"). Blog
Representation Engineering (Zou et al., 2023) Identifies and steers high-level concepts (honesty, power-seeking, emotion) directly in neural representations. A practical tool for alignment. Paper
TransformerLens Neel Nanda's open-source library for mechanistic interpretability research on GPT-2 style models. The standard tool for mech interp research. GitHub
SAE Bench Standardized benchmark for evaluating sparse autoencoder quality -- enabling reproducible interpretability research. GitHub
Transformer Circuits Thread (Anthropic) Anthropic's ongoing research thread on understanding transformer internals: induction heads, superposition, circuits, and feature visualization. Blog

AI-Generated Text Detection

Tool/Method Description Links
DetectGPT Stanford method using probability curvature to detect LLM-generated text. Paper, Code
Detecting LLM-Generated-Text Comprehensive survey on the science of LLM-generated text detection. Paper
GPTZero AI detection model designed specifically for educators. GPTZero

AI Governance & Planetary-Scale Challenges

Resource Description Links
Climate Change AI Global non-profit catalyzing impactful work at the intersection of climate change and machine learning. Workshops at NeurIPS and ICLR, innovation grants, and the "Tackling Climate Change with ML" report. Co-founded by Priya Donti (MIT EECS). climatechange.ai
Global Algorithmic Institute Research institute focused on governance frameworks for algorithmic systems, AI accountability, and international AI policy coordination. globalalgorithmicinstitute.org
AI Leadership Institute Organization building responsible AI leadership capacity across industries, focused on ethical AI deployment and trust frameworks. aileadershipinstitute.com