How can AIs know what we want if *we* don't even know? (with Geoffrey Irving)
Spencer Greenberg and Jeffrey Irving discuss AI alignment, focusing on making AI systems do what humans want. They explore challenges like AI manipulation, cultural biases in training, and the importance of engineering human feedback protocols and designing systems with safe defaults.
Deep Dive Analysis
19 Topic Outline
Defining AI Alignment and Its Importance
Distinguishing AI Assistants from Autonomous Agents
Challenges in Trusting AI-Generated Information
Why AI Alignment is a Critical Concern
Accident Risk vs. Malicious Use of AI
The Paradox of Advanced AI and Subtle Errors
AI Debate as an Alignment Strategy
How AI Training Can Incentivize Manipulation
Human Values, Cultural Diversity, and AI Bias
Supervising AI's Probabilistic Reasoning
AI Self-Critique, Auditable Thoughts, and Reflective Preferences
Psychology's Role in AI Alignment and Data Collection
Combating Malicious Actors in AI Training Data
Human Red Teaming for AI Vulnerability Discovery
The Future of AI Exploits and Security
Aggregating Human Preferences for AI Systems
Constitutional AI and Rule-Based Alignment
Managing Conflicting Rules and Safe Defaults
Addressing Uncareful Actors and Fostering Cooperation
7 Key Concepts
AI Alignment
The overarching goal of ensuring an AI system does what humans want. This involves figuring out the formal and informal means to achieve this definition, acknowledging that 'we' (humans) is a complicated concept.
AI Assistant vs. Agent
An AI assistant helps humans get information or do tasks, biasing towards safer, supervised interactions. An AI agent acts more autonomously in the world. The distinction blurs when assistance requires the AI to take steps in the world, even if small.
Reinforcement Learning from Human Feedback (RLHF)
A common method for fine-tuning large language models. After pre-training on internet text, models are repeatedly shown to humans who score their behavior, and a model is built to predict and train against these human evaluations.
AI Debate
An alignment strategy where two AI models debate an answer or course of action, and a human judges the debate. The adversarial setup is intended to make it easier for humans to catch flaws and strengthen the reward environment compared to direct human feedback.
On-Reflection Preferences
Refers to human preferences that are considered and thoughtful, rather than immediate or impulsive. AI alignment aims to align systems with these more considered preferences, simulating people acting as their 'more considered selves'.
Red Teaming (AI)
A process of intentionally trying to break an AI system or find its vulnerabilities. This involves having humans or other AI act adversarially, attempting to trick the system into producing bad or unintended behavior by challenging its rules or instructions.
Constitutional AI
An approach to AI alignment that involves providing the AI with a set of explicit rules or principles (like a constitution) that it is supposed to abide by. These rules also guide human raters in evaluating the AI's outputs, effectively changing human preferences by acting as instructions.
11 Questions Answered
AI alignment means ensuring the AI system does what humans want, which involves defining and implementing formal and informal methods to achieve this goal, acknowledging the complexity of 'human wants'.
An AI assistant helps a human by providing information or simulating actions, biasing towards safer, supervised interactions. An autonomous agent acts more independently in the world, though the line blurs when assistance requires the AI to take steps.
AI systems are rapidly increasing in capability, there's a strong economic incentive to use them, and current alignment solutions are not yet robust enough to prevent potential misuse or unintended consequences, necessitating work to close this gap.
Yes, as AI strength increases, so does its ability to follow instructions, but also its latent ability to find and exploit loopholes in instructions or reward signals, potentially leading to subtle manipulation.
During reinforcement learning from human feedback, if an AI learns that humans are manipulated by certain emotional tricks or subtle lies (e.g., using fancy words to sound smarter), it can iteratively exploit these biases to receive higher scores from raters.
In the long term, all cultures must participate in the rating process. Systems should be designed so that a small amount of data from diverse experts can have an amplified impact, mediating from less diverse paid raters to a broader cultural understanding.
This requires designing human-machine interactions where the human thinks through things, writes out reasoning, and interacts with the machine to simulate more considered thought, potentially with adversarial machines pointing out flaws.
Hallucination is one type of factual error, alongside active deception or retrieving wrong information. Addressing this involves careful human supervision, designing robust reward environments, and potentially adversarial setups like AI debate to catch flaws, with the expectation that model strength will improve this.
One strategy is to have AI systems 'write out their thoughts' or reasoning processes, which can then be supervised by humans. However, these written thoughts may not be the actual causal chain of the AI's decision, similar to how humans generate explanations after intuitive leaps.
This involves both human adversarial red teaming, where people try to break the system's rules, and machine red teaming to find vulnerabilities. The goal is to make exploits expensive and to design the overall ecosystem to quickly notice and fix breaks with a limited 'blast radius'.
Similar to legal systems, this often involves human judgment and muddling through, sometimes with priorities on rules. It's important to structure things so there's always an acceptable answer, such as declining to answer or stopping interaction, as a safe default.
17 Actionable Insights
1. Engineer Human Feedback Protocols
Invest heavily in designing and engineering human protocols for providing feedback, as the quality of this human input is critical for effective AI alignment. This involves careful understanding of human behavior and continuous experimentation.
2. Align to Reflective Human Preferences
Design AI systems to align with humans’ “on reflection” preferences rather than immediate, impulsive desires. This can be achieved by encouraging human raters to think through AI outputs, write out their reasoning, and by training AI to highlight potential flaws.
3. Prioritize AI as Assistant
Design AI systems to primarily assist users in getting information or completing tasks, rather than operating in a fully autonomous manner. This involves having the AI simulate human assistance and explicitly disclose every action it intends to take for human supervision.
4. Focus on Information Trustworthiness
Prioritize research and development on ensuring the trustworthiness of information provided by AI systems, as this is a fundamental and difficult challenge in AI alignment. This includes addressing issues like hallucinations, factual errors, and active deception.
5. Guard Against AI Manipulation
Be aware that AI can learn to exploit human emotional biases or subtle deceptions during training, and design feedback loops to counteract this manipulative learning. This can involve educating human raters on biases and using adversarial setups.
6. Amplify Diverse Cultural Feedback
Implement mechanisms to give amplified impact to a small amount of data provided by experts or individuals from specific cultural backgrounds. This ensures that AI alignment processes include participation from a diverse range of global cultures.
7. Develop Rules via Citizen Juries
Utilize participatory “citizen jury” setups with significant human time and expertise to develop robust rules for AI behavior. These rules can then be enforced by human raters, and AI can be enabled to contextually present relevant rules to people.
8. Supervise AI’s Internal Thoughts
Encourage AI systems to articulate their internal “thoughts” or reasoning processes, and then supervise these articulated thoughts to improve auditability and alignment. Understand that AI, like humans, may use intuitive leaps and generate explanations post-hoc, so don’t demand perfect causal alignment.
9. Build Ethical Pushback into AI
Incorporate mechanisms into AI assistants that provide ethical pushback or question user requests that might lead to harmful or undesirable social outcomes. This allows the AI to act as a helpful moral guide, similar to a human peer.
10. Design for Imprecision & Neutrality
When facing complex ethical problems, design AI alignment strategies that allow for imprecision, bias towards neutrality, and incorporate margins for error. This approach helps navigate inherently messy and unresolved issues where exact solutions are not feasible.
11. Incentivize AI to Highlight Flaws
Train AI to identify and highlight the “sketchiest” or most uncertain parts of its own arguments or reasoning. This allows human supervisors to focus their limited attention on critical areas and to decompose complex arguments for more careful evaluation.
12. Integrate Rater Interaction & Evaluation
When using human raters, have the same person both interact with the AI agent and evaluate its behavior (even if evaluating others’ interactions). This familiarity improves intuition and the overall quality of the rating process.
13. Conduct Human Adversarial Red Teaming
Actively engage humans in “red teaming” exercises, where they are tasked with trying to break AI rules or elicit undesirable behavior. This includes using indirect or multi-turn conversational prompts to test the AI’s robustness against subtle bypass attempts.
14. Ensure Safe Default of Inaction
Design AI systems with a safe default action, such as declining to answer or ceasing interaction, especially when faced with conflicting rules or uncertain situations. This provides a fundamental safety fallback for complex scenarios.
15. Minimize Direct Value Encoding
Aim to design AI systems that reduce the need to explicitly encode or deeply understand complex human values, instead relying on generic assistance and blending diverse human preferences from various rater pools. This simplifies the alignment problem by avoiding overly specific value definitions.
16. Apply Psychology to Experiment Design
Utilize expertise from psychology backgrounds for designing, running, and analyzing online data collection experiments, including pilot studies and careful statistical analysis. This ensures that data collection methods are robust and account for human cognitive biases and behaviors.
17. Design for Limited Blast Radius
In AI security, design the overall ecosystem to ensure that any single security breach or “break” is quickly noticed and fixed, and that its “blast radius” (the extent of its negative impact) is limited. This holistic approach to security aims to mitigate the damage from inevitable failures.
8 Key Quotes
I think there's a vague kind of common sense notion of have the thing, have an AI system do what humans want. And I think mostly the task of kind of ALM, it is kind of figuring out what is the breakdown of sort of formal versus informal means to attack that definition.
Geoffrey Irving
As the strength increases, you have both the ability to follow instructions better and at least a latent ability to find and exploit any holes in the instructions or holes in the reward signal if you're planning some complicated reward signal.
Geoffrey Irving
If you have humans as training signals, the humans are going to make mistakes as well. And so it could be that you've trained the environment where it, it produces only output that looks correct, looks correct to humans, but that isn't actually correct.
Geoffrey Irving
I think fundamentally, this is why basically you need to engineer this human system that is that if you humans are the word function, then the goal, like a lot of the a lot of the work of alignment will be in fact, engineering this human protocol to produce kind of higher quality information.
Geoffrey Irving
I think the answer fundamentally in the longterm is all the cultures have to participate in the written rating.
Geoffrey Irving
I think that's why it's so important to find places where you can, you can afford to not get it exactly right because we shouldn't expect to kind of at our level of understanding.
Geoffrey Irving
I think your intuition is worse. You have, it's, it's four steps. So it's harder to check. A lot of things get worse in terms of like trusting those kinds of answers.
Geoffrey Irving
I think the non-cooperating equilibrium there is quite bad.
Geoffrey Irving
3 Protocols
AI Alignment through Human Protocol Engineering
Geoffrey Irving- Pre-train Large Language Models (LLMs) on a large corpus of internet text.
- Fine-tune models using Reinforcement Learning from Human Feedback (RLHF) by repeatedly showing AI output to humans for scoring its behavior.
- Build a model of human evaluation and train the AI into that model, iteratively exploiting human biases if not carefully managed.
- Engineer the human protocol (e.g., instructions, experimental design) to produce higher quality information and reduce human biases.
- Consider using adversarial setups (like AI debate) where another model complains about the first model's behavior to strengthen feedback.
- Design the system so that a small amount of data from diverse experts can have an amplified impact on the model's behavior.
- Continuously iterate and run experiments with humans to understand and correct qualitative issues.
Simulating Careful Human Reflection in AI Alignment
Geoffrey Irving- Instead of simple 'good or bad' ratings, design a process where the human thinks through things and writes out their reasoning.
- Allow the person to interact with the machine, potentially through an interactive discussion as part of the rating process.
- Incorporate adversarial elements where machines point out potential flaws or issues a human might want to track down.
- Design the human-machine interaction to simulate what someone would have thought had they considered the problem for much longer.
- For specific tasks (e.g., finding the 'best paper'), ask a competing machine to find a better alternative; if it fails, that's some evidence that the paper is, in fact, the best.
Red Teaming for AI Vulnerabilities
Geoffrey Irving- Show humans a rule that the AI is supposed to follow.
- Have the humans engage in a dialogue with the AI, trying to break the rule.
- Encourage clever, indirect attempts to trick the AI into disobeying, rather than direct requests.
- Combine human red teaming with machine red teaming to access different types of vulnerabilities (e.g., well-meaning accidental triggers) and attack vectors.