Q: Manually Review Traces (Open Code)

Conduct “open coding” by manually reviewing interaction traces from your AI product and noting the first, most upstream error you observe in each. This informal note-taking helps you quickly learn about your application’s behavior.

Question 1

Prioritize Evals for AI Products

Accepted Answer

Make building and mastering evals a top priority for AI product development. This is considered the highest ROI activity for creating successful AI products.

Question 2

Start with Error Analysis

Accepted Answer

Begin your eval process by performing error analysis on your AI application&rsquo;s data, focusing on identifying what&rsquo;s going wrong before writing tests. This grounds your efforts and helps uncover real-world issues.

Question 3

Manually Review Traces (Open Code)

Accepted Answer

Conduct &ldquo;open coding&rdquo; by manually reviewing interaction traces from your AI product and noting the first, most upstream error you observe in each. This informal note-taking helps you quickly learn about your application&rsquo;s behavior.

Question 4

Avoid LLMs for Initial Analysis

Accepted Answer

Do not rely on LLMs for the initial, free-form note-taking stage of error analysis. LLMs often lack the necessary product context to identify subtle &ldquo;bad product smells&rdquo; or hallucinations, leading to inaccurate assessments.

Question 5

Appoint a Benevolent Dictator

Accepted Answer

Designate one person with deep domain expertise, often the product manager, to lead the open coding and error analysis process. This &ldquo;benevolent dictator&rdquo; approach prevents committees from slowing down progress.

Question 6

Embrace Flexibility in Requirements

Accepted Answer

Maintain flexibility in your product requirements (PRDs) and be prepared for them to evolve as you uncover new failure modes and desired behaviors through data analysis. The stochastic nature of LLMs means you can&rsquo;t foresee all issues upfront.

Question 7

Seek Theoretical Saturation

Accepted Answer

Continue reviewing and noting errors in traces until you reach &ldquo;theoretical saturation,&rdquo; where you are no longer discovering new types of problems or concepts. This ensures comprehensive coverage without excessive effort.

Question 8

Use LLMs for Categorizing Notes

Accepted Answer

After open coding, leverage an LLM to categorize your informal notes (open codes) into broader themes or failure modes, known as axial codes. This utilizes AI&rsquo;s strength in synthesizing large amounts of information.

Question 9

Refine LLM-Generated Categories

Accepted Answer

Critically review and refine the axial codes suggested by the LLM, making them more specific and actionable. Human iteration ensures the categories are truly useful for problem-solving.

Question 10

Prioritize with Basic Counting

Accepted Answer

Quantify the prevalence of each failure mode (axial code) using simple counting methods, such as pivot tables. This provides a clear overview of your biggest problems, guiding prioritization.

Question 11

Build Binary LLM Judges

Accepted Answer

For complex or subjective failure modes, design LLM-as-judge prompts that yield a binary (true/false, pass/fail) output. This simplifies decision-making and makes metrics clearer and more actionable.

Question 12

Validate LLM Judges Manually

Accepted Answer

Validate your LLM judge by comparing its outputs against human judgments on a sample of traces. This ensures the judge&rsquo;s accuracy and alignment with human expectations, building trust in your evals.

Question 13

Avoid Misleading Agreement Metrics

Accepted Answer

Do not rely solely on simple &ldquo;agreement percentage&rdquo; when validating LLM judges, as it can be misleading, especially for rare errors. Instead, use a confusion matrix to understand specific types of misalignment.

Question 14

Integrate Evals for Monitoring

Accepted Answer

Implement automated evals (code-based and LLM-as-judge) into unit tests, CI/CD pipelines, and continuous online monitoring of production data. This provides ongoing quality assurance and real-time failure rate insights.

Question 15

Drive Product Improvement

Accepted Answer

Use the insights gained from evals to directly inform and implement improvements to your AI product. The ultimate purpose of evals is to make the application better, not just to build a test suite.

Question 16

Allocate Upfront Eval Time

Accepted Answer

Dedicate an initial investment of approximately three to four days for comprehensive error analysis and the setup of your core eval suite. This is a one-time cost that establishes a robust system.

Question 17

Maintain Evals Efficiently

Accepted Answer

After the initial setup, the ongoing maintenance and review of your eval suite can be managed efficiently, often requiring only about 30 minutes per week. This allows for continuous improvement with minimal effort.

Question 18

Create Custom Data Review Tools

Accepted Answer

Develop simple, custom web applications or tools to streamline the process of reviewing and annotating your product&rsquo;s data. This removes friction and makes the crucial activity of looking at data more efficient.

Question 19

Build Code-Based Evals

Accepted Answer

For failure modes that can be checked with clear, deterministic rules (e.g., output format), build code-based evaluators (like unit tests). These are cheaper and more straightforward than LLM-based judges for certain types of checks.

Question 20

Embrace Mindset of Learning

Accepted Answer

Approach the eval process with a mindset of continuous learning and improvement, rather than striving for perfection. The primary goal is actionable product enhancement, not flawless evals.

Question 21

Leverage LLMs in Workflow

Accepted Answer

Utilize LLMs to assist in various aspects of your workflow, such as organizing thoughts, refining product requirements, and improving documentation based on insights from error analysis.

Question 22

Share Eval Successes

Accepted Answer

Share your experiences and successes in implementing evals and improving AI products with the broader community. This helps inspire and educate others, fostering collective growth.

Question 23

Amplify Others&rsquo; Learnings

Accepted Answer

Contribute to the community by sharing your knowledge and experiences with evals through blog posts or other writing. This helps amplify best practices and encourages broader adoption.

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

1. Prioritize Evals for AI Products

2. Start with Error Analysis

3. Manually Review Traces (Open Code)

4. Avoid LLMs for Initial Analysis

5. Appoint a Benevolent Dictator

6. Embrace Flexibility in Requirements

7. Seek Theoretical Saturation

8. Use LLMs for Categorizing Notes

9. Refine LLM-Generated Categories

10. Prioritize with Basic Counting

11. Build Binary LLM Judges

12. Validate LLM Judges Manually

13. Avoid Misleading Agreement Metrics

14. Integrate Evals for Monitoring

15. Drive Product Improvement

16. Allocate Upfront Eval Time

17. Maintain Evals Efficiently

18. Create Custom Data Review Tools

19. Build Code-Based Evals

20. Embrace Mindset of Learning

21. Leverage LLMs in Workflow

23. Amplify Others’ Learnings