Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

17 Topic Outline

Defining Evals: Systematic Measurement for AI Improvement

Demo: Error Analysis of a Real Estate AI Assistant

Initial Error Analysis: Writing Open Codes on Traces

Why LLMs Cannot Replace Humans in Initial Error Analysis

The Benevolent Dictator: Streamlining the Eval Process

Theoretical Saturation: Knowing When to Stop Error Analysis

Categorizing Errors with Axial Codes Using LLMs

Quantifying Error Modes and Prioritizing Fixes

Understanding Code-Based Evals vs. LLM-as-Judge

Building and Validating an LLM-as-Judge

Evals as the New Product Requirements Documents (PRDs)

Optimal Number of Evals and Post-Eval Activities

The Great Evals Debate: Misconceptions and Nuances

Why Dogfooding Alone is Insufficient for Most AI Products

Impact of OpenAI's Statsig Acquisition on Evals

Common Misconceptions and Practical Tips for Evals

Time Investment for Implementing Effective Evals

9 Key Concepts

Evals

Evals are a systematic way to measure and improve the quality of an AI application. They involve data analytics on LLM applications to identify problems, create metrics, and establish a feedback signal for confident iteration and improvement.

Trace

A trace is an engineering term for a detailed log of a sequence of events, specifically referring to a complete interaction a customer had with an AI application. It captures all components, pieces, and information the AI used to perform its task.

Error Analysis

Error analysis is the crucial first step in building evals, involving a systematic review of application logs or traces to identify what is going wrong. It starts with informal note-taking on observed issues to understand real-world failure modes.

Open Coding

Open coding is an informal, free-form note-taking process performed during error analysis, where a domain expert writes down the first thing they see that is wrong in an AI interaction trace. The goal is to capture observations without overthinking or immediately categorizing them.

Benevolent Dictator

This term refers to appointing a single person, typically a product manager or domain expert, to lead the open coding process. This approach avoids committees getting bogged down in discussions, making the process tractable and efficient for small to medium-sized companies.

Theoretical Saturation

Theoretical saturation is the point in qualitative data analysis, such as error analysis, when you stop uncovering new types of notes, concepts, or issues that would materially change the next steps of your process. It indicates that you have a comprehensive understanding of the prevalent failure modes.

Axial Codes

Axial codes are categories or themes created by synthesizing and grouping similar open codes (informal error notes). They represent specific failure modes or types of problems, helping to organize a messy collection of observations into actionable insights for product improvement.

LLM-as-Judge

An LLM-as-judge is an automated evaluator that uses a large language model to assess complex, subjective failure modes, typically providing a binary (pass/fail) output. It is designed to evaluate a very narrow, specific problem, making it more reliable than asking an LLM to perform broad error analysis.

Criteria Drift

Criteria drift is a phenomenon observed in AI evaluation where human evaluators' opinions of what constitutes 'good' or 'bad' AI output change as they review more examples. This highlights the difficulty of establishing fixed rubrics upfront and the necessity of iterative error analysis.

10 Questions Answered

?

What are evals in the context of AI products?

Evals are a systematic method to measure and improve the quality of an AI application, essentially functioning as data analytics for LLM applications to identify problems, create metrics, and guide iterative improvements.

?

Why can't LLMs replace humans in the initial error analysis phase?

LLMs lack the necessary product and domain context to understand if an AI's response is a 'bad product smell' or a subtle hallucination, often reporting that a trace 'looks good' even when a human expert would identify an error.

?

When should you stop the open coding (note-taking) process during error analysis?

You should continue open coding until you reach 'theoretical saturation,' meaning you are no longer uncovering any new types of errors or concepts that would significantly alter your understanding of the product's failure modes.

?

What are axial codes and how are they used in evals?

Axial codes are categories or themes that synthesize and group similar informal error notes (open codes) into specific failure modes. They help organize the identified problems, making it easier to understand the most prevalent issues and prioritize fixes.

?

How do you build an effective LLM-as-judge?

An LLM-as-judge should be built to evaluate one specific failure mode with a binary (true/false) output, simplifying the evaluation and making it more reliable. The prompt needs to clearly define the rules for passing or failing based on the desired behavior.

?

How do you ensure an LLM-as-judge is accurate and trustworthy?

Before deploying an LLM-as-judge, it must be validated against human judgment by measuring agreement, specifically focusing on reducing misalignments (false positives/negatives) rather than just overall percentage agreement. The prompt should be iterated upon until these misalignments are minimized.

?

Why are evals considered the 'new PRDs' for AI products?

Evals, particularly LLM-as-judge prompts, function similarly to product requirements documents by explicitly defining how an AI agent should respond in specific situations. They provide an automatic and constantly running test of these requirements, which are often refined by observing real-world data.

?

How many LLM-as-judge evals are typically needed for an AI product?

For most products, typically between four and seven LLM-as-judge evals are sufficient. This is because many failure modes can be fixed by simply adjusting the prompt, and LLM-as-judge evals are reserved for the more complex or 'pesky' issues.

?

What are the common misconceptions people have about evals?

Common misconceptions include believing that an AI can simply eval itself, that buying an off-the-shelf tool will solve all eval needs, or that one shouldn't bother looking at raw data/traces. Another misconception is that there's only one 'correct' way to do evals.

?

What is the typical time investment for implementing evals?

The initial setup, including error analysis and building a few LLM-as-judge evaluators, typically takes three to four days of focused work. After this initial investment, maintaining and improving the eval suite can take as little as 30 minutes per week.

23 Actionable Insights

1. Prioritize Evals for AI Products

Make building and mastering evals a top priority for AI product development. This is considered the highest ROI activity for creating successful AI products.

2. Start with Error Analysis

Begin your eval process by performing error analysis on your AI application’s data, focusing on identifying what’s going wrong before writing tests. This grounds your efforts and helps uncover real-world issues.

3. Manually Review Traces (Open Code)

Conduct “open coding” by manually reviewing interaction traces from your AI product and noting the first, most upstream error you observe in each. This informal note-taking helps you quickly learn about your application’s behavior.

4. Avoid LLMs for Initial Analysis

Do not rely on LLMs for the initial, free-form note-taking stage of error analysis. LLMs often lack the necessary product context to identify subtle “bad product smells” or hallucinations, leading to inaccurate assessments.

5. Appoint a Benevolent Dictator

Designate one person with deep domain expertise, often the product manager, to lead the open coding and error analysis process. This “benevolent dictator” approach prevents committees from slowing down progress.

6. Embrace Flexibility in Requirements

Maintain flexibility in your product requirements (PRDs) and be prepared for them to evolve as you uncover new failure modes and desired behaviors through data analysis. The stochastic nature of LLMs means you can’t foresee all issues upfront.

7. Seek Theoretical Saturation

Continue reviewing and noting errors in traces until you reach “theoretical saturation,” where you are no longer discovering new types of problems or concepts. This ensures comprehensive coverage without excessive effort.

8. Use LLMs for Categorizing Notes

After open coding, leverage an LLM to categorize your informal notes (open codes) into broader themes or failure modes, known as axial codes. This utilizes AI’s strength in synthesizing large amounts of information.

9. Refine LLM-Generated Categories

Critically review and refine the axial codes suggested by the LLM, making them more specific and actionable. Human iteration ensures the categories are truly useful for problem-solving.

10. Prioritize with Basic Counting

Quantify the prevalence of each failure mode (axial code) using simple counting methods, such as pivot tables. This provides a clear overview of your biggest problems, guiding prioritization.

11. Build Binary LLM Judges

For complex or subjective failure modes, design LLM-as-judge prompts that yield a binary (true/false, pass/fail) output. This simplifies decision-making and makes metrics clearer and more actionable.

12. Validate LLM Judges Manually

Validate your LLM judge by comparing its outputs against human judgments on a sample of traces. This ensures the judge’s accuracy and alignment with human expectations, building trust in your evals.

13. Avoid Misleading Agreement Metrics

Do not rely solely on simple “agreement percentage” when validating LLM judges, as it can be misleading, especially for rare errors. Instead, use a confusion matrix to understand specific types of misalignment.

14. Integrate Evals for Monitoring

Implement automated evals (code-based and LLM-as-judge) into unit tests, CI/CD pipelines, and continuous online monitoring of production data. This provides ongoing quality assurance and real-time failure rate insights.

15. Drive Product Improvement

Use the insights gained from evals to directly inform and implement improvements to your AI product. The ultimate purpose of evals is to make the application better, not just to build a test suite.

16. Allocate Upfront Eval Time

Dedicate an initial investment of approximately three to four days for comprehensive error analysis and the setup of your core eval suite. This is a one-time cost that establishes a robust system.

17. Maintain Evals Efficiently

After the initial setup, the ongoing maintenance and review of your eval suite can be managed efficiently, often requiring only about 30 minutes per week. This allows for continuous improvement with minimal effort.

18. Create Custom Data Review Tools

Develop simple, custom web applications or tools to streamline the process of reviewing and annotating your product’s data. This removes friction and makes the crucial activity of looking at data more efficient.

19. Build Code-Based Evals

For failure modes that can be checked with clear, deterministic rules (e.g., output format), build code-based evaluators (like unit tests). These are cheaper and more straightforward than LLM-based judges for certain types of checks.

20. Embrace Mindset of Learning

Approach the eval process with a mindset of continuous learning and improvement, rather than striving for perfection. The primary goal is actionable product enhancement, not flawless evals.

21. Leverage LLMs in Workflow

Utilize LLMs to assist in various aspects of your workflow, such as organizing thoughts, refining product requirements, and improving documentation based on insights from error analysis.

Share your experiences and successes in implementing evals and improving AI products with the broader community. This helps inspire and educate others, fostering collective growth.

23. Amplify Others’ Learnings

Contribute to the community by sharing your knowledge and experiences with evals through blog posts or other writing. This helps amplify best practices and encourages broader adoption.

8 Key Quotes

The goal is not to do evals perfectly. It's to actionably improve your product.
Shreya Shankar

Everyone that does this immediately gets addicted to it when you're building an AI application. You just learn a lot.
Hamel Husain

You don't want to make this process so expensive that you can't do it.
Hamel Husain

The purpose, axial code basically is just a failure mode. It's like the label or category. And what our goal is is to get to this clusters of failure modes and figure out what is the most prevalent.
Shreya Shankar

You're never going to know what the failure modes are going to be up front. And you're always going to uncover new vibes that you think that your product should have.
Hamel Husain

If someone ever reports to you agreement, you should immediately ask, okay, tell me more.
Hamel Husain

People have been burned by evals in the past. People have done evals badly, then they didn't trust it anymore. And then they're like, oh, I'm anti-evals.
Shreya Shankar

My vision for evals is not that Hamill and I become billionaires. It is that everyone can build AI products and we're all on the same page.
Shreya Shankar

2 Protocols

Error Analysis and Categorization Protocol

Hamel Husain & Shreya Shankar

Review logs or 'traces' of interactions with your AI application.
Adopt a 'product hat' perspective to identify what is going wrong from a user experience or business objective standpoint.
Write quick, informal notes (open codes) for the first significant error observed in each trace, focusing on the most upstream issue.
Continue reviewing traces and open coding until 'theoretical saturation' is reached, meaning no new types of errors are being discovered (typically around 100 traces).
Use an LLM to categorize the collected open codes into broader themes or 'axial codes' (failure modes).
Review and refine the LLM-generated axial codes, making them more specific and actionable if necessary.
Utilize an LLM (e.g., via a spreadsheet formula) to automatically categorize all open codes into the refined axial codes.
Count the occurrences of each axial code (e.g., using a pivot table) to quantify the prevalence of different problems and prioritize which ones to address.
For obvious engineering errors, fix them directly; for subjective or complex issues, consider building an LLM-as-judge.

Building and Validating an LLM-as-Judge Protocol

Hamel Husain & Shreya Shankar

Identify a specific, complex failure mode from your error analysis that requires subjective, automated evaluation.
Create a binary (true/false) LLM-as-judge prompt that clearly defines the rules and criteria for 'pass' or 'fail' for that single, specific failure mode.
Do not blindly accept the LLM's judgment; validate it against human judgment by comparing the LLM's output with your own manual assessment of traces.
Measure the agreement between the LLM judge and human judgment, focusing specifically on reducing misalignments (false positives and false negatives) rather than just overall percentage agreement.
Iterate on the LLM-as-judge prompt, refining its instructions and criteria, until the misalignment with human judgment is minimized and the judge is reliable.

7 Key Numbers

100

Recommended minimum traces for initial error analysis This number is suggested to unblock users and ensure sufficient learning, though the actual stopping point is theoretical saturation.

4-7

Typical number of LLM-as-judge evals needed per product This range covers the most complex or 'pesky' failure modes that cannot be fixed by simple prompt adjustments.

30 minutes

Time investment for evals after initial setup Per week, to maintain and improve the eval suite.

3-4 days

Initial time investment for error analysis and setup For dedicated work to establish initial rounds of error analysis and LLM-as-judge evaluators.

8 years

Age of Andrew Ng's video on error analysis Highlights that error analysis is a long-standing technique in machine learning.

160

Pages in the supplementary book for their course A meticulously written book detailing the entire eval process.

10 months

Free unlimited access to AI bot for course students The AI bot is trained on all course content, office hours, and related materials.

Deep Dive Analysis

Defining Evals: Systematic Measurement for AI Improvement

Demo: Error Analysis of a Real Estate AI Assistant

Initial Error Analysis: Writing Open Codes on Traces

Why LLMs Cannot Replace Humans in Initial Error Analysis

The Benevolent Dictator: Streamlining the Eval Process

Theoretical Saturation: Knowing When to Stop Error Analysis

Categorizing Errors with Axial Codes Using LLMs

Quantifying Error Modes and Prioritizing Fixes

Understanding Code-Based Evals vs. LLM-as-Judge

Building and Validating an LLM-as-Judge

Evals as the New Product Requirements Documents (PRDs)

Optimal Number of Evals and Post-Eval Activities

The Great Evals Debate: Misconceptions and Nuances

Why Dogfooding Alone is Insufficient for Most AI Products

Impact of OpenAI's Statsig Acquisition on Evals

Common Misconceptions and Practical Tips for Evals

Time Investment for Implementing Effective Evals

Evals

Trace

Error Analysis

Open Coding

Benevolent Dictator

Theoretical Saturation

Axial Codes

LLM-as-Judge

Criteria Drift

1. Prioritize Evals for AI Products

2. Start with Error Analysis

3. Manually Review Traces (Open Code)

4. Avoid LLMs for Initial Analysis

5. Appoint a Benevolent Dictator

6. Embrace Flexibility in Requirements

7. Seek Theoretical Saturation

8. Use LLMs for Categorizing Notes

9. Refine LLM-Generated Categories

10. Prioritize with Basic Counting

11. Build Binary LLM Judges

12. Validate LLM Judges Manually

13. Avoid Misleading Agreement Metrics

14. Integrate Evals for Monitoring

15. Drive Product Improvement

16. Allocate Upfront Eval Time

17. Maintain Evals Efficiently

18. Create Custom Data Review Tools

19. Build Code-Based Evals

20. Embrace Mindset of Learning

21. Leverage LLMs in Workflow

22. Share Eval Successes

23. Amplify Others’ Learnings

Error Analysis and Categorization Protocol

Building and Validating an LLM-as-Judge Protocol