How can AIs know what we want if *we* don't even know? (with Geoffrey Irving)

Jan 24, 2024 1h 19m 17 insights Episode Page ↗
Spencer Greenberg and Jeffrey Irving discuss AI alignment, focusing on making AI systems do what humans want. They explore challenges like AI manipulation, cultural biases in training, and the importance of engineering human feedback protocols and designing systems with safe defaults.
Actionable Insights

1. Engineer Human Feedback Protocols

Invest heavily in designing and engineering human protocols for providing feedback, as the quality of this human input is critical for effective AI alignment. This involves careful understanding of human behavior and continuous experimentation.

2. Align to Reflective Human Preferences

Design AI systems to align with humans’ “on reflection” preferences rather than immediate, impulsive desires. This can be achieved by encouraging human raters to think through AI outputs, write out their reasoning, and by training AI to highlight potential flaws.

3. Prioritize AI as Assistant

Design AI systems to primarily assist users in getting information or completing tasks, rather than operating in a fully autonomous manner. This involves having the AI simulate human assistance and explicitly disclose every action it intends to take for human supervision.

4. Focus on Information Trustworthiness

Prioritize research and development on ensuring the trustworthiness of information provided by AI systems, as this is a fundamental and difficult challenge in AI alignment. This includes addressing issues like hallucinations, factual errors, and active deception.

5. Guard Against AI Manipulation

Be aware that AI can learn to exploit human emotional biases or subtle deceptions during training, and design feedback loops to counteract this manipulative learning. This can involve educating human raters on biases and using adversarial setups.

6. Amplify Diverse Cultural Feedback

Implement mechanisms to give amplified impact to a small amount of data provided by experts or individuals from specific cultural backgrounds. This ensures that AI alignment processes include participation from a diverse range of global cultures.

7. Develop Rules via Citizen Juries

Utilize participatory “citizen jury” setups with significant human time and expertise to develop robust rules for AI behavior. These rules can then be enforced by human raters, and AI can be enabled to contextually present relevant rules to people.

8. Supervise AI’s Internal Thoughts

Encourage AI systems to articulate their internal “thoughts” or reasoning processes, and then supervise these articulated thoughts to improve auditability and alignment. Understand that AI, like humans, may use intuitive leaps and generate explanations post-hoc, so don’t demand perfect causal alignment.

9. Build Ethical Pushback into AI

Incorporate mechanisms into AI assistants that provide ethical pushback or question user requests that might lead to harmful or undesirable social outcomes. This allows the AI to act as a helpful moral guide, similar to a human peer.

10. Design for Imprecision & Neutrality

When facing complex ethical problems, design AI alignment strategies that allow for imprecision, bias towards neutrality, and incorporate margins for error. This approach helps navigate inherently messy and unresolved issues where exact solutions are not feasible.

11. Incentivize AI to Highlight Flaws

Train AI to identify and highlight the “sketchiest” or most uncertain parts of its own arguments or reasoning. This allows human supervisors to focus their limited attention on critical areas and to decompose complex arguments for more careful evaluation.

12. Integrate Rater Interaction & Evaluation

When using human raters, have the same person both interact with the AI agent and evaluate its behavior (even if evaluating others’ interactions). This familiarity improves intuition and the overall quality of the rating process.

13. Conduct Human Adversarial Red Teaming

Actively engage humans in “red teaming” exercises, where they are tasked with trying to break AI rules or elicit undesirable behavior. This includes using indirect or multi-turn conversational prompts to test the AI’s robustness against subtle bypass attempts.

14. Ensure Safe Default of Inaction

Design AI systems with a safe default action, such as declining to answer or ceasing interaction, especially when faced with conflicting rules or uncertain situations. This provides a fundamental safety fallback for complex scenarios.

15. Minimize Direct Value Encoding

Aim to design AI systems that reduce the need to explicitly encode or deeply understand complex human values, instead relying on generic assistance and blending diverse human preferences from various rater pools. This simplifies the alignment problem by avoiding overly specific value definitions.

16. Apply Psychology to Experiment Design

Utilize expertise from psychology backgrounds for designing, running, and analyzing online data collection experiments, including pilot studies and careful statistical analysis. This ensures that data collection methods are robust and account for human cognitive biases and behaviors.

17. Design for Limited Blast Radius

In AI security, design the overall ecosystem to ensure that any single security breach or “break” is quickly noticed and fixed, and that its “blast radius” (the extent of its negative impact) is limited. This holistic approach to security aims to mitigate the damage from inevitable failures.