Safety, Alignment and Governance¶

The most critical section. Ensuring AI remains safe, aligned with human values, and under meaningful control is the defining challenge as systems grow more capable.

Safety and Alignment¶

Topic	Description	Links
LLM Hallucination Survey	Comprehensive survey on detecting, explaining, and mitigating hallucinations in LLMs.	Paper, Code
Hallucination Detection	Survey on hallucination in LLMs covering detection methods and mitigation strategies.	Paper, Code
LLM Attacks	Universal and transferable adversarial attacks on aligned language models (CMU research).	Paper, Code
RLHF	Reinforcement Learning from Human Feedback - the key technique behind ChatGPT's alignment.	Code Example
DPO	Direct Preference Optimization - an alternative to RLHF that trains the LM directly as a reward model.	Paper, Code

Scalable Oversight & Automated Alignment¶

As AI systems approach and exceed human-level capability, humans can no longer directly evaluate their outputs. These techniques aim to maintain meaningful oversight over superhuman systems -- the defining technical challenge of the ASI transition.

Approach	Description	Links
Weak-to-Strong Generalization	OpenAI's framework for studying how weak supervisors (humans) can elicit capabilities from stronger models. GPT-2 supervising GPT-4 as proxy for the superintelligence alignment problem.	Paper
Constitutional AI (CAI)	Anthropic's method for training AI using AI-generated feedback based on a set of principles, reducing dependence on human labelers while maintaining alignment.	Paper
Debate	Two AI agents argue opposing sides; a human judge evaluates. Even if an AI is superhuman, the debate format decomposes complex claims into human-verifiable steps.	Paper
Recursive Reward Modeling	Humans train AI to assist with evaluation, which then assists with training the next model -- bootstrapping oversight. Core technique for Anthropic and OpenAI alignment teams.	Paper
AI Control	Redwood Research's framework for maintaining safety even against models actively trying to subvert controls. Evaluates security against intentional subversion.	Paper
Cooperative AI	Research program at DeepMind focused on AI that cooperates with humans and other AI systems -- critical for multi-agent superintelligence scenarios.	Paper

Mechanistic Interpretability¶

Understanding how neural networks actually compute internally. If we can reverse-engineer the algorithms learned by frontier models, we can detect dangerous capabilities, verify alignment, and build safer systems. This is arguably the most important field for AGI safety.

Resource	Description	Links
Towards Monosemanticity (Anthropic, 2023)	Landmark result: sparse autoencoders decompose polysemantic neurons into millions of interpretable features in Claude.	Blog
Scaling Monosemanticity (Anthropic, 2024)	Extracted 34M interpretable features from Claude 3 Sonnet, including safety-relevant concepts (deception, power-seeking, "I am an AI assistant").	Blog
Representation Engineering (Zou et al., 2023)	Identifies and steers high-level concepts (honesty, power-seeking, emotion) directly in neural representations. A practical tool for alignment.	Paper
TransformerLens	Neel Nanda's open-source library for mechanistic interpretability research on GPT-2 style models. The standard tool for mech interp research.	GitHub
SAE Bench	Standardized benchmark for evaluating sparse autoencoder quality -- enabling reproducible interpretability research.	GitHub
Transformer Circuits Thread (Anthropic)	Anthropic's ongoing research thread on understanding transformer internals: induction heads, superposition, circuits, and feature visualization.	Blog

AI-Generated Text Detection¶

Tool/Method	Description	Links
DetectGPT	Stanford method using probability curvature to detect LLM-generated text.	Paper, Code
Detecting LLM-Generated-Text	Comprehensive survey on the science of LLM-generated text detection.	Paper
GPTZero	AI detection model designed specifically for educators.	GPTZero

AI Governance & Planetary-Scale Challenges¶

Resource	Description	Links
Climate Change AI	Global non-profit catalyzing impactful work at the intersection of climate change and machine learning. Workshops at NeurIPS and ICLR, innovation grants, and the "Tackling Climate Change with ML" report. Co-founded by Priya Donti (MIT EECS).	climatechange.ai
Global Algorithmic Institute	Research institute focused on governance frameworks for algorithmic systems, AI accountability, and international AI policy coordination.	globalalgorithmicinstitute.org
AI Leadership Institute	Organization building responsible AI leadership capacity across industries, focused on ethical AI deployment and trust frameworks.	aileadershipinstitute.com