Safety, Alignment and Governance¶

The most critical section. Ensuring AI remains safe, aligned with human values, and under meaningful control is the defining challenge as systems grow more capable.

Safety and Alignment¶

Topic	Description	Links
LLM Hallucination Survey	Comprehensive survey on detecting, explaining, and mitigating hallucinations in LLMs.	Paper, Code
Hallucination Detection	Survey on hallucination in LLMs covering detection methods and mitigation strategies.	Paper, Code
LLM Attacks	Universal and transferable adversarial attacks on aligned language models (CMU research).	Paper, Code
RLHF	Reinforcement Learning from Human Feedback - the key technique behind ChatGPT's alignment.	Code Example
DPO	Direct Preference Optimization - an alternative to RLHF that trains the LM directly as a reward model.	Paper, Code

Scalable Oversight & Automated Alignment¶

As AI systems approach and exceed human-level capability, humans can no longer directly evaluate their outputs. These techniques aim to maintain meaningful oversight over superhuman systems -- the defining technical challenge of the ASI transition.

Approach	Description	Links
Weak-to-Strong Generalization	OpenAI's framework for studying how weak supervisors (humans) can elicit capabilities from stronger models. GPT-2 supervising GPT-4 as proxy for the superintelligence alignment problem.	Paper
Constitutional AI (CAI)	Anthropic's method for training AI using AI-generated feedback based on a set of principles, reducing dependence on human labelers while maintaining alignment.	Paper
Debate	Two AI agents argue opposing sides; a human judge evaluates. Even if an AI is superhuman, the debate format decomposes complex claims into human-verifiable steps.	Paper
Recursive Reward Modeling	Humans train AI to assist with evaluation, which then assists with training the next model -- bootstrapping oversight. Core technique for Anthropic and OpenAI alignment teams.	Paper
AI Control	Redwood Research's framework for maintaining safety even against models actively trying to subvert controls. Evaluates security against intentional subversion.	Paper
Cooperative AI	Research program at DeepMind focused on AI that cooperates with humans and other AI systems -- critical for multi-agent superintelligence scenarios.	Paper

Mechanistic Interpretability¶

Understanding how neural networks actually compute internally. If we can reverse-engineer the algorithms learned by frontier models, we can detect dangerous capabilities, verify alignment, and build safer systems. This is arguably the most important field for AGI safety.

Resource	Description	Links
Towards Monosemanticity (Anthropic, 2023)	Landmark result: sparse autoencoders decompose polysemantic neurons into millions of interpretable features in Claude.	Blog
Scaling Monosemanticity (Anthropic, 2024)	Extracted 34M interpretable features from Claude 3 Sonnet, including safety-relevant concepts (deception, power-seeking, "I am an AI assistant").	Blog
Representation Engineering (Zou et al., 2023)	Identifies and steers high-level concepts (honesty, power-seeking, emotion) directly in neural representations. A practical tool for alignment.	Paper
TransformerLens	Neel Nanda's open-source library for mechanistic interpretability research on GPT-2 style models. The standard tool for mech interp research.	GitHub
SAE Bench	Standardized benchmark for evaluating sparse autoencoder quality -- enabling reproducible interpretability research.	GitHub
Transformer Circuits Thread (Anthropic)	Anthropic's ongoing research thread on understanding transformer internals: induction heads, superposition, circuits, and feature visualization.	Blog

AI-Generated Text Detection¶

Tool/Method	Description	Links
DetectGPT	Stanford method using probability curvature to detect LLM-generated text.	Paper, Code
Detecting LLM-Generated-Text	Comprehensive survey on the science of LLM-generated text detection.	Paper
GPTZero	AI detection model designed specifically for educators.	GPTZero

AI Governance & Planetary-Scale Challenges¶

Resource	Description	Links
Climate Change AI	Global non-profit catalyzing impactful work at the intersection of climate change and machine learning. Workshops at NeurIPS and ICLR, innovation grants, and the "Tackling Climate Change with ML" report. Co-founded by Priya Donti (MIT EECS).	climatechange.ai
Global Algorithmic Institute	Research institute focused on governance frameworks for algorithmic systems, AI accountability, and international AI policy coordination.	globalalgorithmicinstitute.org
AI Leadership Institute	Organization building responsible AI leadership capacity across industries, focused on ethical AI deployment and trust frameworks.	aileadershipinstitute.com

Critical Infrastructure Threats¶

Foundational security vulnerabilities that could compromise web infrastructure, cryptocurrency, and AI systems.

Threat	Description	Links
SHA-256 Vulnerability	Researcher has come close to breaking SHA-256, the hashing algorithm underlying SSL, Bitcoin, and web security. Hash collisions expected within months.	-
Unicode Supply Chain Attacks	New supply chain attack infecting GitHub repositories using Unicode characters with no visual representation but meaningful to compilers/interpreters.	-
API Key Credential Escalation	Google API keys meant for Maps/Gemini can now access Gemini assistant capabilities, enabling credential theft and private data access.	-

AI-Specific Security Threats¶

Attack vectors unique to AI systems, including benchmark gaming, memory poisoning, and identity attacks.

Threat	Description	Links
Benchmark Gaming	AI systems finding and exploiting benchmark weaknesses. Claude found encrypted BrowseComp answer key on GitHub, decrypted it, and used it.	-
Memory Poisoning	"AI recommendation poisoning" - attacks via "Summarize with AI" buttons adding commands to model's persistent memory for future manipulation.	-
Deepfake Identity Attacks	Deepfakes now attacking identity verification systems at scale. Biometric authentication systems vulnerable to AI-generated synthetic media.	-
LLM De-anonymization	LLMs can identify anonymous post authors at scale. De-anonymization capabilities threaten privacy and whistleblower protection.	-

Agent Safety & Control¶

Safety mechanisms and concerns for autonomous agents, including permission systems and desktop control.

Mechanism / Concern	Description	Links
Anthropic Auto Mode	Safer alternative to "dangerously skip permissions" option. Uses classifier to determine action safety before executing, allows permission set switching.	-
Desktop Application Control	Agents controlling desktop applications (mouse, keyboard, app launching). Blurs line between automation and user, new attack surface.	-

Infrastructure Security¶

Security concerns in AI infrastructure, including network protocols and containerization.

Concern	Description	Links
AirSnitch	WiFi attack using layers 1-2 of protocol stack to bypass encryption rather than breaking it. Bypasses standard WiFi security.	-