Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)
1. Prioritize Evals for AI Products
Make building and mastering evals a top priority for AI product development. This is considered the highest ROI activity for creating successful AI products.
2. Start with Error Analysis
Begin your eval process by performing error analysis on your AI application’s data, focusing on identifying what’s going wrong before writing tests. This grounds your efforts and helps uncover real-world issues.
3. Manually Review Traces (Open Code)
Conduct “open coding” by manually reviewing interaction traces from your AI product and noting the first, most upstream error you observe in each. This informal note-taking helps you quickly learn about your application’s behavior.
4. Avoid LLMs for Initial Analysis
Do not rely on LLMs for the initial, free-form note-taking stage of error analysis. LLMs often lack the necessary product context to identify subtle “bad product smells” or hallucinations, leading to inaccurate assessments.
5. Appoint a Benevolent Dictator
Designate one person with deep domain expertise, often the product manager, to lead the open coding and error analysis process. This “benevolent dictator” approach prevents committees from slowing down progress.
6. Embrace Flexibility in Requirements
Maintain flexibility in your product requirements (PRDs) and be prepared for them to evolve as you uncover new failure modes and desired behaviors through data analysis. The stochastic nature of LLMs means you can’t foresee all issues upfront.
7. Seek Theoretical Saturation
Continue reviewing and noting errors in traces until you reach “theoretical saturation,” where you are no longer discovering new types of problems or concepts. This ensures comprehensive coverage without excessive effort.
8. Use LLMs for Categorizing Notes
After open coding, leverage an LLM to categorize your informal notes (open codes) into broader themes or failure modes, known as axial codes. This utilizes AI’s strength in synthesizing large amounts of information.
9. Refine LLM-Generated Categories
Critically review and refine the axial codes suggested by the LLM, making them more specific and actionable. Human iteration ensures the categories are truly useful for problem-solving.
10. Prioritize with Basic Counting
Quantify the prevalence of each failure mode (axial code) using simple counting methods, such as pivot tables. This provides a clear overview of your biggest problems, guiding prioritization.
11. Build Binary LLM Judges
For complex or subjective failure modes, design LLM-as-judge prompts that yield a binary (true/false, pass/fail) output. This simplifies decision-making and makes metrics clearer and more actionable.
12. Validate LLM Judges Manually
Validate your LLM judge by comparing its outputs against human judgments on a sample of traces. This ensures the judge’s accuracy and alignment with human expectations, building trust in your evals.
13. Avoid Misleading Agreement Metrics
Do not rely solely on simple “agreement percentage” when validating LLM judges, as it can be misleading, especially for rare errors. Instead, use a confusion matrix to understand specific types of misalignment.
14. Integrate Evals for Monitoring
Implement automated evals (code-based and LLM-as-judge) into unit tests, CI/CD pipelines, and continuous online monitoring of production data. This provides ongoing quality assurance and real-time failure rate insights.
15. Drive Product Improvement
Use the insights gained from evals to directly inform and implement improvements to your AI product. The ultimate purpose of evals is to make the application better, not just to build a test suite.
16. Allocate Upfront Eval Time
Dedicate an initial investment of approximately three to four days for comprehensive error analysis and the setup of your core eval suite. This is a one-time cost that establishes a robust system.
17. Maintain Evals Efficiently
After the initial setup, the ongoing maintenance and review of your eval suite can be managed efficiently, often requiring only about 30 minutes per week. This allows for continuous improvement with minimal effort.
18. Create Custom Data Review Tools
Develop simple, custom web applications or tools to streamline the process of reviewing and annotating your product’s data. This removes friction and makes the crucial activity of looking at data more efficient.
19. Build Code-Based Evals
For failure modes that can be checked with clear, deterministic rules (e.g., output format), build code-based evaluators (like unit tests). These are cheaper and more straightforward than LLM-based judges for certain types of checks.
20. Embrace Mindset of Learning
Approach the eval process with a mindset of continuous learning and improvement, rather than striving for perfection. The primary goal is actionable product enhancement, not flawless evals.
21. Leverage LLMs in Workflow
Utilize LLMs to assist in various aspects of your workflow, such as organizing thoughts, refining product requirements, and improving documentation based on insights from error analysis.
22. Share Eval Successes
Share your experiences and successes in implementing evals and improving AI products with the broader community. This helps inspire and educate others, fostering collective growth.
23. Amplify Others’ Learnings
Contribute to the community by sharing your knowledge and experiences with evals through blog posts or other writing. This helps amplify best practices and encourages broader adoption.