Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

Sep 25, 2025 1h 46m 23 insights Episode Page ↗
Hamil Hussain and Shreya Shankar, co-creators of the #1 Maven course on evals, discuss how to systematically measure and improve AI applications. They walk through the process of developing effective evals, address major misconceptions, and share best practices for AI product builders.
Actionable Insights

1. Prioritize Evals for AI Products

Make building and mastering evals a top priority for AI product development. This is considered the highest ROI activity for creating successful AI products.

2. Start with Error Analysis

Begin your eval process by performing error analysis on your AI application’s data, focusing on identifying what’s going wrong before writing tests. This grounds your efforts and helps uncover real-world issues.

3. Manually Review Traces (Open Code)

Conduct “open coding” by manually reviewing interaction traces from your AI product and noting the first, most upstream error you observe in each. This informal note-taking helps you quickly learn about your application’s behavior.

4. Avoid LLMs for Initial Analysis

Do not rely on LLMs for the initial, free-form note-taking stage of error analysis. LLMs often lack the necessary product context to identify subtle “bad product smells” or hallucinations, leading to inaccurate assessments.

5. Appoint a Benevolent Dictator

Designate one person with deep domain expertise, often the product manager, to lead the open coding and error analysis process. This “benevolent dictator” approach prevents committees from slowing down progress.

6. Embrace Flexibility in Requirements

Maintain flexibility in your product requirements (PRDs) and be prepared for them to evolve as you uncover new failure modes and desired behaviors through data analysis. The stochastic nature of LLMs means you can’t foresee all issues upfront.

7. Seek Theoretical Saturation

Continue reviewing and noting errors in traces until you reach “theoretical saturation,” where you are no longer discovering new types of problems or concepts. This ensures comprehensive coverage without excessive effort.

8. Use LLMs for Categorizing Notes

After open coding, leverage an LLM to categorize your informal notes (open codes) into broader themes or failure modes, known as axial codes. This utilizes AI’s strength in synthesizing large amounts of information.

9. Refine LLM-Generated Categories

Critically review and refine the axial codes suggested by the LLM, making them more specific and actionable. Human iteration ensures the categories are truly useful for problem-solving.

10. Prioritize with Basic Counting

Quantify the prevalence of each failure mode (axial code) using simple counting methods, such as pivot tables. This provides a clear overview of your biggest problems, guiding prioritization.

11. Build Binary LLM Judges

For complex or subjective failure modes, design LLM-as-judge prompts that yield a binary (true/false, pass/fail) output. This simplifies decision-making and makes metrics clearer and more actionable.

12. Validate LLM Judges Manually

Validate your LLM judge by comparing its outputs against human judgments on a sample of traces. This ensures the judge’s accuracy and alignment with human expectations, building trust in your evals.

13. Avoid Misleading Agreement Metrics

Do not rely solely on simple “agreement percentage” when validating LLM judges, as it can be misleading, especially for rare errors. Instead, use a confusion matrix to understand specific types of misalignment.

14. Integrate Evals for Monitoring

Implement automated evals (code-based and LLM-as-judge) into unit tests, CI/CD pipelines, and continuous online monitoring of production data. This provides ongoing quality assurance and real-time failure rate insights.

15. Drive Product Improvement

Use the insights gained from evals to directly inform and implement improvements to your AI product. The ultimate purpose of evals is to make the application better, not just to build a test suite.

16. Allocate Upfront Eval Time

Dedicate an initial investment of approximately three to four days for comprehensive error analysis and the setup of your core eval suite. This is a one-time cost that establishes a robust system.

17. Maintain Evals Efficiently

After the initial setup, the ongoing maintenance and review of your eval suite can be managed efficiently, often requiring only about 30 minutes per week. This allows for continuous improvement with minimal effort.

18. Create Custom Data Review Tools

Develop simple, custom web applications or tools to streamline the process of reviewing and annotating your product’s data. This removes friction and makes the crucial activity of looking at data more efficient.

19. Build Code-Based Evals

For failure modes that can be checked with clear, deterministic rules (e.g., output format), build code-based evaluators (like unit tests). These are cheaper and more straightforward than LLM-based judges for certain types of checks.

20. Embrace Mindset of Learning

Approach the eval process with a mindset of continuous learning and improvement, rather than striving for perfection. The primary goal is actionable product enhancement, not flawless evals.

21. Leverage LLMs in Workflow

Utilize LLMs to assist in various aspects of your workflow, such as organizing thoughts, refining product requirements, and improving documentation based on insights from error analysis.

22. Share Eval Successes

Share your experiences and successes in implementing evals and improving AI products with the broader community. This helps inspire and educate others, fostering collective growth.

23. Amplify Others’ Learnings

Contribute to the community by sharing your knowledge and experiences with evals through blog posts or other writing. This helps amplify best practices and encourages broader adoption.