Skip to content

Safety, Alignment and Governance

The most critical section. Ensuring AI remains safe, aligned with human values, and under meaningful control is the defining challenge as systems grow more capable.

Alignment Safety Governance Interpretability Detection

Safety and Alignment

Topic Description Links
LLM Hallucination Survey Comprehensive survey on detecting, explaining, and mitigating hallucinations in LLMs. Paper, Code
Hallucination Detection Survey on hallucination in LLMs covering detection methods and mitigation strategies. Paper, Code
LLM Attacks Universal and transferable adversarial attacks on aligned language models (CMU research). Paper, Code
RLHF Reinforcement Learning from Human Feedback - the key technique behind ChatGPT's alignment. Code Example
DPO Direct Preference Optimization - an alternative to RLHF that trains the LM directly as a reward model. Paper, Code

Scalable Oversight & Automated Alignment

As AI systems approach and exceed human-level capability, humans can no longer directly evaluate their outputs. These techniques aim to maintain meaningful oversight over superhuman systems -- the defining technical challenge of the ASI transition.

Approach Description Links
Weak-to-Strong Generalization OpenAI's framework for studying how weak supervisors (humans) can elicit capabilities from stronger models. GPT-2 supervising GPT-4 as proxy for the superintelligence alignment problem. Paper
Constitutional AI (CAI) Anthropic's method for training AI using AI-generated feedback based on a set of principles, reducing dependence on human labelers while maintaining alignment. Paper
Debate Two AI agents argue opposing sides; a human judge evaluates. Even if an AI is superhuman, the debate format decomposes complex claims into human-verifiable steps. Paper
Recursive Reward Modeling Humans train AI to assist with evaluation, which then assists with training the next model -- bootstrapping oversight. Core technique for Anthropic and OpenAI alignment teams. Paper
AI Control Redwood Research's framework for maintaining safety even against models actively trying to subvert controls. Evaluates security against intentional subversion. Paper
Cooperative AI Research program at DeepMind focused on AI that cooperates with humans and other AI systems -- critical for multi-agent superintelligence scenarios. Paper

Mechanistic Interpretability

Understanding how neural networks actually compute internally. If we can reverse-engineer the algorithms learned by frontier models, we can detect dangerous capabilities, verify alignment, and build safer systems. This is arguably the most important field for AGI safety.

Resource Description Links
Towards Monosemanticity (Anthropic, 2023) Landmark result: sparse autoencoders decompose polysemantic neurons into millions of interpretable features in Claude. Blog
Scaling Monosemanticity (Anthropic, 2024) Extracted 34M interpretable features from Claude 3 Sonnet, including safety-relevant concepts (deception, power-seeking, "I am an AI assistant"). Blog
Representation Engineering (Zou et al., 2023) Identifies and steers high-level concepts (honesty, power-seeking, emotion) directly in neural representations. A practical tool for alignment. Paper
TransformerLens Neel Nanda's open-source library for mechanistic interpretability research on GPT-2 style models. The standard tool for mech interp research. GitHub
SAE Bench Standardized benchmark for evaluating sparse autoencoder quality -- enabling reproducible interpretability research. GitHub
Transformer Circuits Thread (Anthropic) Anthropic's ongoing research thread on understanding transformer internals: induction heads, superposition, circuits, and feature visualization. Blog

AI-Generated Text Detection

Tool/Method Description Links
DetectGPT Stanford method using probability curvature to detect LLM-generated text. Paper, Code
Detecting LLM-Generated-Text Comprehensive survey on the science of LLM-generated text detection. Paper
GPTZero AI detection model designed specifically for educators. GPTZero

AI Governance & Planetary-Scale Challenges

Resource Description Links
Climate Change AI Global non-profit catalyzing impactful work at the intersection of climate change and machine learning. Workshops at NeurIPS and ICLR, innovation grants, and the "Tackling Climate Change with ML" report. Co-founded by Priya Donti (MIT EECS). climatechange.ai
Global Algorithmic Institute Research institute focused on governance frameworks for algorithmic systems, AI accountability, and international AI policy coordination. globalalgorithmicinstitute.org
AI Leadership Institute Organization building responsible AI leadership capacity across industries, focused on ethical AI deployment and trust frameworks. aileadershipinstitute.com

Critical Infrastructure Threats

Foundational security vulnerabilities that could compromise web infrastructure, cryptocurrency, and AI systems.

Threat Description Links
SHA-256 Vulnerability Researcher has come close to breaking SHA-256, the hashing algorithm underlying SSL, Bitcoin, and web security. Hash collisions expected within months. -
Unicode Supply Chain Attacks New supply chain attack infecting GitHub repositories using Unicode characters with no visual representation but meaningful to compilers/interpreters. -
API Key Credential Escalation Google API keys meant for Maps/Gemini can now access Gemini assistant capabilities, enabling credential theft and private data access. -

AI-Specific Security Threats

Attack vectors unique to AI systems, including benchmark gaming, memory poisoning, and identity attacks.

Threat Description Links
Benchmark Gaming AI systems finding and exploiting benchmark weaknesses. Claude found encrypted BrowseComp answer key on GitHub, decrypted it, and used it. -
Memory Poisoning "AI recommendation poisoning" - attacks via "Summarize with AI" buttons adding commands to model's persistent memory for future manipulation. -
Deepfake Identity Attacks Deepfakes now attacking identity verification systems at scale. Biometric authentication systems vulnerable to AI-generated synthetic media. -
LLM De-anonymization LLMs can identify anonymous post authors at scale. De-anonymization capabilities threaten privacy and whistleblower protection. -

Agent Safety & Control

Safety mechanisms and concerns for autonomous agents, including permission systems and desktop control.

Mechanism / Concern Description Links
Anthropic Auto Mode Safer alternative to "dangerously skip permissions" option. Uses classifier to determine action safety before executing, allows permission set switching. -
Desktop Application Control Agents controlling desktop applications (mouse, keyboard, app launching). Blurs line between automation and user, new attack surface. -

Infrastructure Security

Security concerns in AI infrastructure, including network protocols and containerization.

Concern Description Links
AirSnitch WiFi attack using layers 1-2 of protocol stack to bypass encryption rather than breaking it. Bypasses standard WiFi security. -