Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)
Hamil Hussain and Shreya Shankar, co-creators of the #1 Maven course on evals, discuss how to systematically measure and improve AI applications. They walk through the process of developing effective evals, address major misconceptions, and share best practices for AI product builders.
Deep Dive Analysis
17 Topic Outline
Defining Evals: Systematic Measurement for AI Improvement
Demo: Error Analysis of a Real Estate AI Assistant
Initial Error Analysis: Writing Open Codes on Traces
Why LLMs Cannot Replace Humans in Initial Error Analysis
The Benevolent Dictator: Streamlining the Eval Process
Theoretical Saturation: Knowing When to Stop Error Analysis
Categorizing Errors with Axial Codes Using LLMs
Quantifying Error Modes and Prioritizing Fixes
Understanding Code-Based Evals vs. LLM-as-Judge
Building and Validating an LLM-as-Judge
Evals as the New Product Requirements Documents (PRDs)
Optimal Number of Evals and Post-Eval Activities
The Great Evals Debate: Misconceptions and Nuances
Why Dogfooding Alone is Insufficient for Most AI Products
Impact of OpenAI's Statsig Acquisition on Evals
Common Misconceptions and Practical Tips for Evals
Time Investment for Implementing Effective Evals
9 Key Concepts
Evals
Evals are a systematic way to measure and improve the quality of an AI application. They involve data analytics on LLM applications to identify problems, create metrics, and establish a feedback signal for confident iteration and improvement.
Trace
A trace is an engineering term for a detailed log of a sequence of events, specifically referring to a complete interaction a customer had with an AI application. It captures all components, pieces, and information the AI used to perform its task.
Error Analysis
Error analysis is the crucial first step in building evals, involving a systematic review of application logs or traces to identify what is going wrong. It starts with informal note-taking on observed issues to understand real-world failure modes.
Open Coding
Open coding is an informal, free-form note-taking process performed during error analysis, where a domain expert writes down the first thing they see that is wrong in an AI interaction trace. The goal is to capture observations without overthinking or immediately categorizing them.
Benevolent Dictator
This term refers to appointing a single person, typically a product manager or domain expert, to lead the open coding process. This approach avoids committees getting bogged down in discussions, making the process tractable and efficient for small to medium-sized companies.
Theoretical Saturation
Theoretical saturation is the point in qualitative data analysis, such as error analysis, when you stop uncovering new types of notes, concepts, or issues that would materially change the next steps of your process. It indicates that you have a comprehensive understanding of the prevalent failure modes.
Axial Codes
Axial codes are categories or themes created by synthesizing and grouping similar open codes (informal error notes). They represent specific failure modes or types of problems, helping to organize a messy collection of observations into actionable insights for product improvement.
LLM-as-Judge
An LLM-as-judge is an automated evaluator that uses a large language model to assess complex, subjective failure modes, typically providing a binary (pass/fail) output. It is designed to evaluate a very narrow, specific problem, making it more reliable than asking an LLM to perform broad error analysis.
Criteria Drift
Criteria drift is a phenomenon observed in AI evaluation where human evaluators' opinions of what constitutes 'good' or 'bad' AI output change as they review more examples. This highlights the difficulty of establishing fixed rubrics upfront and the necessity of iterative error analysis.
10 Questions Answered
Evals are a systematic method to measure and improve the quality of an AI application, essentially functioning as data analytics for LLM applications to identify problems, create metrics, and guide iterative improvements.
LLMs lack the necessary product and domain context to understand if an AI's response is a 'bad product smell' or a subtle hallucination, often reporting that a trace 'looks good' even when a human expert would identify an error.
You should continue open coding until you reach 'theoretical saturation,' meaning you are no longer uncovering any new types of errors or concepts that would significantly alter your understanding of the product's failure modes.
Axial codes are categories or themes that synthesize and group similar informal error notes (open codes) into specific failure modes. They help organize the identified problems, making it easier to understand the most prevalent issues and prioritize fixes.
An LLM-as-judge should be built to evaluate one specific failure mode with a binary (true/false) output, simplifying the evaluation and making it more reliable. The prompt needs to clearly define the rules for passing or failing based on the desired behavior.
Before deploying an LLM-as-judge, it must be validated against human judgment by measuring agreement, specifically focusing on reducing misalignments (false positives/negatives) rather than just overall percentage agreement. The prompt should be iterated upon until these misalignments are minimized.
Evals, particularly LLM-as-judge prompts, function similarly to product requirements documents by explicitly defining how an AI agent should respond in specific situations. They provide an automatic and constantly running test of these requirements, which are often refined by observing real-world data.
For most products, typically between four and seven LLM-as-judge evals are sufficient. This is because many failure modes can be fixed by simply adjusting the prompt, and LLM-as-judge evals are reserved for the more complex or 'pesky' issues.
Common misconceptions include believing that an AI can simply eval itself, that buying an off-the-shelf tool will solve all eval needs, or that one shouldn't bother looking at raw data/traces. Another misconception is that there's only one 'correct' way to do evals.
The initial setup, including error analysis and building a few LLM-as-judge evaluators, typically takes three to four days of focused work. After this initial investment, maintaining and improving the eval suite can take as little as 30 minutes per week.
23 Actionable Insights
1. Prioritize Evals for AI Products
Make building and mastering evals a top priority for AI product development. This is considered the highest ROI activity for creating successful AI products.
2. Start with Error Analysis
Begin your eval process by performing error analysis on your AI application’s data, focusing on identifying what’s going wrong before writing tests. This grounds your efforts and helps uncover real-world issues.
3. Manually Review Traces (Open Code)
Conduct “open coding” by manually reviewing interaction traces from your AI product and noting the first, most upstream error you observe in each. This informal note-taking helps you quickly learn about your application’s behavior.
4. Avoid LLMs for Initial Analysis
Do not rely on LLMs for the initial, free-form note-taking stage of error analysis. LLMs often lack the necessary product context to identify subtle “bad product smells” or hallucinations, leading to inaccurate assessments.
5. Appoint a Benevolent Dictator
Designate one person with deep domain expertise, often the product manager, to lead the open coding and error analysis process. This “benevolent dictator” approach prevents committees from slowing down progress.
6. Embrace Flexibility in Requirements
Maintain flexibility in your product requirements (PRDs) and be prepared for them to evolve as you uncover new failure modes and desired behaviors through data analysis. The stochastic nature of LLMs means you can’t foresee all issues upfront.
7. Seek Theoretical Saturation
Continue reviewing and noting errors in traces until you reach “theoretical saturation,” where you are no longer discovering new types of problems or concepts. This ensures comprehensive coverage without excessive effort.
8. Use LLMs for Categorizing Notes
After open coding, leverage an LLM to categorize your informal notes (open codes) into broader themes or failure modes, known as axial codes. This utilizes AI’s strength in synthesizing large amounts of information.
9. Refine LLM-Generated Categories
Critically review and refine the axial codes suggested by the LLM, making them more specific and actionable. Human iteration ensures the categories are truly useful for problem-solving.
10. Prioritize with Basic Counting
Quantify the prevalence of each failure mode (axial code) using simple counting methods, such as pivot tables. This provides a clear overview of your biggest problems, guiding prioritization.
11. Build Binary LLM Judges
For complex or subjective failure modes, design LLM-as-judge prompts that yield a binary (true/false, pass/fail) output. This simplifies decision-making and makes metrics clearer and more actionable.
12. Validate LLM Judges Manually
Validate your LLM judge by comparing its outputs against human judgments on a sample of traces. This ensures the judge’s accuracy and alignment with human expectations, building trust in your evals.
13. Avoid Misleading Agreement Metrics
Do not rely solely on simple “agreement percentage” when validating LLM judges, as it can be misleading, especially for rare errors. Instead, use a confusion matrix to understand specific types of misalignment.
14. Integrate Evals for Monitoring
Implement automated evals (code-based and LLM-as-judge) into unit tests, CI/CD pipelines, and continuous online monitoring of production data. This provides ongoing quality assurance and real-time failure rate insights.
15. Drive Product Improvement
Use the insights gained from evals to directly inform and implement improvements to your AI product. The ultimate purpose of evals is to make the application better, not just to build a test suite.
16. Allocate Upfront Eval Time
Dedicate an initial investment of approximately three to four days for comprehensive error analysis and the setup of your core eval suite. This is a one-time cost that establishes a robust system.
17. Maintain Evals Efficiently
After the initial setup, the ongoing maintenance and review of your eval suite can be managed efficiently, often requiring only about 30 minutes per week. This allows for continuous improvement with minimal effort.
18. Create Custom Data Review Tools
Develop simple, custom web applications or tools to streamline the process of reviewing and annotating your product’s data. This removes friction and makes the crucial activity of looking at data more efficient.
19. Build Code-Based Evals
For failure modes that can be checked with clear, deterministic rules (e.g., output format), build code-based evaluators (like unit tests). These are cheaper and more straightforward than LLM-based judges for certain types of checks.
20. Embrace Mindset of Learning
Approach the eval process with a mindset of continuous learning and improvement, rather than striving for perfection. The primary goal is actionable product enhancement, not flawless evals.
21. Leverage LLMs in Workflow
Utilize LLMs to assist in various aspects of your workflow, such as organizing thoughts, refining product requirements, and improving documentation based on insights from error analysis.
22. Share Eval Successes
Share your experiences and successes in implementing evals and improving AI products with the broader community. This helps inspire and educate others, fostering collective growth.
23. Amplify Others’ Learnings
Contribute to the community by sharing your knowledge and experiences with evals through blog posts or other writing. This helps amplify best practices and encourages broader adoption.
8 Key Quotes
The goal is not to do evals perfectly. It's to actionably improve your product.
Shreya Shankar
Everyone that does this immediately gets addicted to it when you're building an AI application. You just learn a lot.
Hamel Husain
You don't want to make this process so expensive that you can't do it.
Hamel Husain
The purpose, axial code basically is just a failure mode. It's like the label or category. And what our goal is is to get to this clusters of failure modes and figure out what is the most prevalent.
Shreya Shankar
You're never going to know what the failure modes are going to be up front. And you're always going to uncover new vibes that you think that your product should have.
Hamel Husain
If someone ever reports to you agreement, you should immediately ask, okay, tell me more.
Hamel Husain
People have been burned by evals in the past. People have done evals badly, then they didn't trust it anymore. And then they're like, oh, I'm anti-evals.
Shreya Shankar
My vision for evals is not that Hamill and I become billionaires. It is that everyone can build AI products and we're all on the same page.
Shreya Shankar
2 Protocols
Error Analysis and Categorization Protocol
Hamel Husain & Shreya Shankar- Review logs or 'traces' of interactions with your AI application.
- Adopt a 'product hat' perspective to identify what is going wrong from a user experience or business objective standpoint.
- Write quick, informal notes (open codes) for the first significant error observed in each trace, focusing on the most upstream issue.
- Continue reviewing traces and open coding until 'theoretical saturation' is reached, meaning no new types of errors are being discovered (typically around 100 traces).
- Use an LLM to categorize the collected open codes into broader themes or 'axial codes' (failure modes).
- Review and refine the LLM-generated axial codes, making them more specific and actionable if necessary.
- Utilize an LLM (e.g., via a spreadsheet formula) to automatically categorize all open codes into the refined axial codes.
- Count the occurrences of each axial code (e.g., using a pivot table) to quantify the prevalence of different problems and prioritize which ones to address.
- For obvious engineering errors, fix them directly; for subjective or complex issues, consider building an LLM-as-judge.
Building and Validating an LLM-as-Judge Protocol
Hamel Husain & Shreya Shankar- Identify a specific, complex failure mode from your error analysis that requires subjective, automated evaluation.
- Create a binary (true/false) LLM-as-judge prompt that clearly defines the rules and criteria for 'pass' or 'fail' for that single, specific failure mode.
- Do not blindly accept the LLM's judgment; validate it against human judgment by comparing the LLM's output with your own manual assessment of traces.
- Measure the agreement between the LLM judge and human judgment, focusing specifically on reducing misalignments (false positives and false negatives) rather than just overall percentage agreement.
- Iterate on the LLM-as-judge prompt, refining its instructions and criteria, until the misalignment with human judgment is minimized and the judge is reliable.