What, if anything, do AIs understand? (with ChatGPT Co-Creator Ilya Sutskever)
Spencer Greenberg speaks with Ilya Sutskever, a pioneer in AI and integral in creating GPT-3, about the nature of neural networks, the psychology and sociology of machine learning, and the increasing power of AI. They discuss the evolution of AI, its current capabilities, and future challenges.
Deep Dive Analysis
14 Topic Outline
Defining Intelligence and AI Capabilities
GPT-3: Training, Generalization, and Core Task
The Connection Between Prediction and Understanding
Key Breakthroughs Enabling GPT-3's Development
The Role of Compute, Architecture, and Belief in AI Progress
Academia's Evolving Role in AI Research
Understanding Generalization and Memorization in Large Models
Limitations and Differences Between AI and Human Intelligence
Addressing Critiques: Symbolic Thinking and Embodiment
Categorizing Potential Dangers from AI
Mitigating Risks of Narrow AI and Misapplication
Concerns About AI Leading to Power Concentration
Navigating Risks from Potential Superintelligent AI
The Nature and Importance of Causation
6 Key Concepts
Neural Network
A type of parallel computer that can program itself automatically from data using a learning algorithm. Its size determines the power of the computational capabilities it implements.
GPT-3 (Generative Pre-trained Transformer 3)
A large neural network trained on the single task of guessing the next word in a vast corpus of text. This training allows it to acquire a wide range of other capabilities, such as writing poetry, translating languages, or solving math problems, as byproducts.
Transformer Architecture
An innovation in neural network design that enables efficient processing of long sequences of vectors, such as words in a sentence. It is comparatively straightforward to train with the backpropagation algorithm and runs very fast on GPUs, significantly enhancing deep learning capabilities.
The Bitter Lesson
A concept from AI research suggesting that general methods leveraging computation and data are ultimately the most effective, by a large margin. It posits that relying on human knowledge of a domain for short-term improvements is less impactful than scaling computational resources.
Overfitting
A phenomenon where a model learns the training data too precisely, including noise, which hinders its ability to generalize to new, unseen data. Large neural networks like GPT-3 avoid this despite their vast number of parameters due to specific properties of their optimization algorithms, such as stochastic gradient descent.
Universal Simulation
The idea that neural networks, as function approximators, can organize their internal structure to simulate any complex computational gadget or even biological neurons. This implies that any specialized operation or biological complexity could theoretically be implemented within a sufficiently large and well-trained artificial neural network.
12 Questions Answered
Intelligence can be thought of as the ability of a system to perform tasks that human beings can do, with the degree of intelligence increasing with the number of such tasks it can perform.
GPT-3 was trained to predict the next word in a text corpus; to do this well, it implicitly learns to perform various subtasks like writing poetry, translating, or solving math problems, as these are necessary to accurately predict subsequent text.
If a system can accurately predict what comes next in a complex context (like a mystery novel), it suggests a significant degree of understanding of that context, making prediction a measurable proxy for the nebulous concept of understanding.
Three main barriers were the lack of sufficient compute and infrastructure, the absence of efficient neural network architectures like the transformer, and a prevailing skepticism or lack of belief in the scaling capabilities of large neural networks.
Large models like GPT-3 are trained on vast amounts of data (e.g., 300 billion tokens for 175 billion parameters), and their training procedures (like stochastic gradient descent) inherently favor solutions that generalize well by minimizing certain measures of information in the parameters.
Memorization and generalization are not mutually exclusive; idealized Bayesian inference, a gold standard for generalization, also produces functions that perfectly memorize training data while still making good predictions on new data.
Academia will likely shift its role from training the absolute largest models to focusing on foundational understanding, studying the properties of models exposed by companies, and collaborating with industry, as the cost of cutting-edge training becomes prohibitive.
While current scaling will go 'extremely far' and continue to yield incredible gains, there might be some capabilities (e.g., beating a world champion in chess without specific, high-quality chess data) that this approach alone might not achieve without further architectural modifications.
Humans have smaller vocabularies and narrower knowledge bases but possess greater depth in specific topics, are extremely selective about the information they consume, and learn with significantly less explicit data compared to GPT-3's vast training requirements.
Recent work suggests that large language models can generate step-by-step symbolic reasoning when prompted. While physical embodiment might make learning easier, it's likely not strictly necessary, as vast internet data can compensate for its absence.
Potential dangers include the misapplication or negative use of narrow AIs (e.g., bias, spam), the concentration of power in the hands of a few groups (e.g., economic dominance, surveillance), and risks from uncontrolled or misaligned superintelligent AI.
OpenAI has a capped return model for investors (e.g., 100x) to avoid being solely driven by profit maximization, aiming to retain the freedom to strategically slow growth or prioritize beneficial outcomes as AI capabilities become transformative.
20 Actionable Insights
1. Believe in Large Network Scaling
To drive significant AI advancements, cultivate the belief that training larger neural networks on bigger datasets will yield increasingly amazing results, overcoming psychological biases that underestimate their potential.
2. Leverage Computation Over Cleverness
For long-term effectiveness in AI, prioritize general methods that leverage computation and scale, rather than short-term improvements derived from human domain knowledge, which often don’t generalize.
3. Train AI on Next Word Prediction
To develop broad AI capabilities, train a neural network to excel at predicting the next word in a large text corpus, as this single task can yield numerous other valuable abilities as a byproduct.
4. Operationalize Understanding via Prediction
To effectively measure and optimize for “understanding” in AI, focus on the quantifiable metric of how well a neural network predicts the next word in text, as this operationalizes an otherwise nebulous concept.
5. Don’t Fear Memorization in AI
Re-evaluate the perception that memorization is inherently bad for AI generalization, as even idealized Bayesian inference, known for optimal predictions, exhibits perfect memorization of training data while still generalizing well.
6. Secure Massive Compute Resources
To build large-scale AI systems like GPT-3, ensure access to thousands of fast GPUs, a large cluster, and the necessary infrastructure and techniques to efficiently train a single large neural network over weeks.
7. Develop Scalable AI Architectures
To effectively utilize large-scale compute, develop neural network architectures like the transformer that are efficient, easy to learn with backpropagation, and run fast on GPUs, enabling processing of long sequences.
8. Prioritize Large Datasets for Scale
To achieve better results with larger neural networks, prioritize the creation and use of more extensive datasets, as increased network size requires more data to constrain its numerous trainable parameters effectively.
9. Trust SGD for Generalization
Understand that large neural networks with many parameters can still generalize well, even with less data than parameters, due to the inherent properties of the stochastic gradient descent (SGD) optimization algorithm, which favors solutions with good generalization.
10. Evolve AI Release Strategy to Caution
As AI capabilities grow, shift from open-sourcing technology to a more careful approach, like slow, deliberate API releases, to manage the immense power of AI systems responsibly and mitigate potential risks.
11. Prevent AI Development Race
Actively work to prevent a competitive race in advanced AI development, as avoiding such dynamics allows groups to proceed with caution, thoroughly assess risks, and implement safety measures without undue pressure.
12. Foster Industry Self-Regulation for AI Safety
To address collective action problems in AI safety, leading companies should collaborate to establish shared principles for model use and deployment, aiming for industry-wide self-regulation and encouraging other entrants to follow these guidelines.
13. Mitigate AI Bias Through Controlled Deployment
To mitigate AI bias, deploy models through controlled APIs, carefully monitor usage, restrict problematic use cases, and continuously retrain models to actively learn and avoid exhibiting undesirable biases from their training data.
14. Detect AI Misuse with Advanced AI
To combat malicious AI applications like personalized manipulation or bot attacks, leverage cutting-edge AI systems from responsible groups to detect and counter the actions of less advanced or nefarious AI systems.
15. Discuss AI Power Concentration Proactively
Acknowledge AI’s inherent tendency to concentrate power due to the massive compute required for the most capable systems, and initiate proactive societal discussions on how to address this concentration before it becomes problematic.
16. Adopt Capped Profit Model for AI
Consider adopting a capped profit model for AI development to avoid being solely driven by revenue maximization, allowing for strategic deceleration of growth if it leads to better societal outcomes for transformative technologies.
17. Link Prediction to Understanding
Recognize that a system’s ability to accurately predict what comes next in text implies a significant degree of understanding, as prediction and understanding are two sides of the same coin.
18. Program Neural Networks via Feedback
To program a neural network, feed it inputs, observe its behavior, and provide feedback on desired changes; the network will then automatically modify itself to correct future mistakes.
19. Evaluate Intelligence by Human Tasks
To assess intelligence, observe what human beings can do and compare it to what computers can achieve, recognizing that the more human-like tasks a computer performs, the more intelligent it is.
20. Academia: Focus on Foundational AI & APIs
Academic AI researchers should shift focus from training the largest models to developing foundational understanding of existing methods and collaborating with companies by studying properties and modifying models exposed via APIs.
8 Key Quotes
The real insight of GPT-3 is that there is one task, which if you get a neural network to be really, really good at it, will give you as a byproduct, all kinds of other capabilities and tasks, which are very interesting and meaningful to us.
Ilya Sutskever
The better you understand the whole mystery novel, the more ability you have to predict the next word. And so essentially understanding prediction are sort of two sides of the same coin.
Spencer Greenberg
The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin.
Richard Sutton
What the only thing that matters are really good, simple methods that scale.
Ilya Sutskever
It appears to be perplexing. How can it be that a neural network memorizes all this data and generalizes at the same time? Yet, I claim that idealized Bayesian inference, which is known to make the best predictions in a certain sense, the best predictions possible in a certain sense, has the same property.
Ilya Sutskever
The greater the power of a system is, the more impact it has. And that impact can have great magnitude and all kinds of directions.
Ilya Sutskever
It seems like maximizing profit is not the ideal when you're talking about the potential for these technologies, right?
Spencer Greenberg
Ultimately it's not possible to not build AI. And the thing that we have to do is to build AI and do so in the most carefully possible.
Ilya Sutskever