The coming AI security crisis (and what to do about it) | Sander Schulhoff

15 Topic Outline

Introduction to AI Security and Sander Schulhoff's Background

Understanding Jailbreaking vs. Prompt Injection Attacks

Real-World Examples of AI Security Breaches

The Increasing Risk with Intelligent Agents and Robotics

The Emergence of the AI Security Industry and Its Solutions

Adversarial Robustness and Attack Success Rate

Why AI Guardrails and Prompt-Based Defenses Fail

The Disconnect Between AI and Classical Cybersecurity

Practical Advice for Addressing AI Security Risks

Securing Agentic Systems with Permissioning Frameworks like Camel

The Importance of Education and Awareness in AI Security

Challenges and Future Directions for Frontier AI Labs

Companies Doing Good Work in AI Governance and Security

Predictions for the AI Security Industry and Future Harms

Final Takeaways on AI Security Posture

8 Key Concepts

Jailbreaking

This occurs when a malicious user directly interacts with a large language model (LLM), like ChatGPT, using a long or deceptive prompt to trick it into generating harmful or unintended content, such as instructions for building a bomb.

Prompt Injection

This attack happens when a malicious user manipulates an LLM-powered application by providing input that overrides the developer's original system prompt. The goal is to make the model ignore its intended instructions and perform an unintended action, like outputting malicious code or sensitive data.

Adversarial Robustness

This term refers to how well AI models or systems can defend themselves against various attacks, including prompt injection and jailbreaking. A system with high adversarial robustness can resist attempts to trick it into performing unintended actions.

Attack Success Rate (ASR)

ASR is a metric used to measure the effectiveness of attacks against AI systems. If a system is subjected to 100 attacks and 1 gets through, it has an ASR of 1%, indicating 99% adversarial robustness. This is how AI security companies often quantify the impact of their tools.

AI Guardrails

These are defense mechanisms, often implemented as separate LLMs, placed before and after a main AI system to classify inputs and outputs. They aim to detect and block malicious prompts or harmful model responses before they reach the user or cause damage.

Automated Red Teaming

This involves using algorithms, typically other large language models, to automatically generate malicious prompts designed to attack and trick other LLMs. These systems are used to discover vulnerabilities and test the defenses of AI models.

Camel Framework

A framework (developed by Google) that restricts an AI agent's possible actions based on the user's immediate request. By granting only the necessary permissions for a specific task, it prevents the agent from taking malicious actions even if it encounters a prompt injection.

CBRN Information

An acronym standing for Chemical, Biological, Radiological, Nuclear, and Explosives. In AI security, it refers to categories of highly sensitive and potentially harmful information that AI models should be prevented from generating or sharing.

8 Questions Answered

?

What is the fundamental difference between jailbreaking and prompt injection?

Jailbreaking involves a malicious user directly interacting with an LLM without an intervening developer prompt, while prompt injection occurs when a malicious user tries to make an LLM-powered application ignore its developer-defined system prompt.

?

Why haven't we seen major AI security incidents or hacks yet?

The primary reason is the early stage of AI adoption and the limited capabilities of current AI models; they haven't been given enough power or widespread deployment to cause significant damage, not because they are inherently secure.

?

Why do AI guardrails and prompt-based defenses fail to secure AI systems?

AI guardrails fail because the attack surface for LLMs is practically infinite, making it impossible to catch all attacks, and they are easily bypassed by adaptive attackers like humans. Prompt-based defenses are even less effective and have been known not to work since early 2023.

?

What is the main challenge in solving adversarial robustness for AI models?

The core challenge is that AI models are not like traditional software bugs that can be definitively patched; instead, they are like 'brains' that can always be tricked in new, unforeseen ways, making it nearly impossible to achieve 100% security.

?

What practical steps should organizations take to address AI security?

Organizations should first assess if their AI system can take actions or access sensitive data; if it's a simple read-only chatbot, the risks are limited. For agentic systems, focus on robust classical cybersecurity practices like proper data and action permissioning, and consider frameworks like Camel.

?

Should companies invest in AI guardrails and automated red teaming solutions from third-party vendors?

Sander Schulhoff advises against investing in these solutions, stating that guardrails don't work and automated red teaming systems merely confirm what is already known (that models are vulnerable), without providing effective mitigation.

?

How can AI systems be secured against indirect prompt injection, especially with agentic capabilities?

One promising technique is the Camel framework, which restricts an AI agent's permissions to only those strictly necessary for the user's current request. This prevents the agent from taking malicious actions even if it encounters a prompt injection in untrusted data.

?

What is the future outlook for AI security and the AI security industry?

Sander predicts a market correction for the AI security industry as companies realize guardrails are ineffective. He also foresees an increase in real-world harms from AI agents and robotics within the next year, emphasizing the urgent need for serious attention to this problem.

11 Actionable Insights

1. Strictly Limit AI Agent Permissions

Ensure any AI agent or system capable of taking actions (e.g., sending emails, modifying databases) is granted only the absolute minimum necessary permissions, as malicious users can trick it into performing any action it’s allowed. This aligns with classical cybersecurity’s proper permissioning.

2. Invest in AI-Cybersecurity Expertise

Develop or hire expertise that bridges classical cybersecurity and AI security, as AI systems present fundamentally different security challenges compared to traditional software. This combined knowledge is vital for identifying unique vulnerabilities and implementing effective, AI-aware security measures.

3. Adopt “Angry God in Box” Mindset

When designing and securing AI systems, particularly agents, approach them with the mindset that the AI is a malicious entity trying to cause harm and escape control. This proactive mental model helps identify and mitigate risks by focusing on containing and controlling potentially dangerous AI.

4. Implement Context-Aware Permissioning

Utilize frameworks like Google’s Camel to dynamically restrict an agent’s permissions based on the user’s specific request, granting only the necessary read/write capabilities for the task at hand. This prevents prompt injection attacks by limiting the agent’s potential actions from the outset.

5. Avoid AI Guardrails & Red Teaming

Do not rely on AI guardrails or automated red teaming tools as primary defenses against prompt injection and jailbreaking. Guardrails are easily bypassed and ineffective against determined attackers, while automated red teaming offers little novel insight as all current models are vulnerable.

6. Do Not Deploy Prompt-Based Defenses

Refrain from using prompt engineering (e.g., adding explicit instructions within the prompt) as a defense mechanism for AI systems. These defenses are known to be highly ineffective and offer minimal protection against adversarial attacks.

7. Understand Simple Chatbot Limitations

If your AI system is merely a chatbot for FAQs or information retrieval without action-taking capabilities or access to sensitive data, extensive defensive measures are likely unnecessary. The primary risk is reputational harm, which can often be achieved by users through other means.

8. Educate Your Team on AI Security

Prioritize educating your team, including decision-makers, about the realities of AI security, prompt injection, and jailbreaking. Increased awareness helps prevent poor deployment decisions and fosters a deeper understanding of AI’s unique risks.

9. Monitor AI System Inputs/Outputs

Implement logging for all inputs and outputs of your AI systems. This practice allows for later review to understand user interaction, identify potential misuse, and continuously improve the system, even if it doesn’t directly prevent attacks.

10. Beware Guardrail Overconfidence

Be aware that deploying AI guardrails can create a false sense of security regarding your AI systems’ robustness. This overconfidence is a significant problem, especially as agentic AI capabilities increase the potential for real-world damage.

11. Avoid Offensive AI Security Research

Researchers and practitioners should refrain from publishing new methods for jailbreaking or prompt injection. The community already understands these vulnerabilities, and further offensive research primarily provides more attack vectors without aiding defensive progress.

4 Key Quotes

AI guardrails do not work. I'm going to say that one more time. Guardrails do not work.
Sander Schulhoff

The only reason there hasn't been a massive attack yet is how early the adoption is, not because it's secure.
Alex Komorosky (quoted by Lenny Rachitsky)

You can patch a bug, but you can't patch a brain.
Sander Schulhoff

Humans break everything. A hundred percent of the defenses in maybe like 10 to 30 attempts.
Sander Schulhoff

5 Key Numbers

One followed by a million zeros

Number of possible attacks against a model like GPT-5 This represents an effectively infinite attack space, making it impossible for guardrails to catch everything.

10 to 30 attempts

Number of attempts for humans to break AI defenses Humans are highly effective adaptive attackers against AI guardrails and models.

90%

Average success rate for automated red teaming systems against defenses Automated systems require orders of magnitude more attempts than humans to achieve success.

20 to 50 years

Duration adversarial robustness has been a field of study The problem of adversarial robustness is not new, but its implications are more severe with modern LLMs and agents.

99.99%

Certainty of a bug being solved after patching traditional software In contrast, patching an AI system for a vulnerability offers no such certainty, as the problem is likely to persist or reappear.

Deep Dive Analysis