Question 1

Test Every Code Change

Accepted Answer

Implement every code change or feature as an experiment, as even small modifications can yield surprising, unexpected impacts.

Question 2

Define Overall Evaluation Criterion

Accepted Answer

Establish a clear Overall Evaluation Criterion (OEC) that is causally predictive of the long-term lifetime value of the user, incorporating countervailing metrics to prevent short-term gains that harm user experience.

Question 3

Expect High Failure Rates

Accepted Answer

Be prepared for most ideas to fail (e.g., 80-92% of experiments), which helps avoid the sunk cost fallacy and encourages iterative development.

Question 4

Allocate to Big Bets

Accepted Answer

Dedicate resources to high-risk, high-reward ideas, understanding that while most will fail, a successful one can be a breakthrough.

Question 5

Avoid Large Redesigns

Accepted Answer

Decompose large redesigns into smaller, incremental changes and test them one factor at a time (OFAT) to learn and adjust, rather than risking a complete failure after significant investment.

Question 6

Document Experiment Learnings

Accepted Answer

Document successes and failures, and maintain a searchable history of experiments to build institutional memory and learn from past results.

Question 7

Focus on Surprising Results

Accepted Answer

Prioritize reviewing experiments where the estimated and actual results differ significantly, whether surprisingly positive or negative, to gain deeper insights into user behavior and system interactions.

Question 8

Ensure Experiment Trustworthiness

Accepted Answer

Build an experimentation platform that acts as a safety net for deployments and a reliable oracle for results, as trust in the platform is crucial for data-driven decision-making.

Question 9

Check for Sample Ratio Mismatch

Accepted Answer

Implement automated checks for Sample Ratio Mismatch (SRM) in experiments; if detected, do not trust the results and investigate the underlying cause, as it indicates a fundamental flaw.

Question 10

Apply Twyman&rsquo;s Law

Accepted Answer

If an experiment result looks too good (or bad) to be true, investigate thoroughly for flaws or bugs before celebrating, as such figures are often incorrect.

Question 11

Understand P-Value Limitations

Accepted Answer

Recognize that a P-value does not directly represent the probability of your treatment being better than control; instead, consider the &lsquo;false positive risk,&rsquo; which is often much higher than commonly assumed (e.g., 26% for P<0.05 with an 8% success rate).

Question 12

Don&rsquo;t Ship Flat or Negative Results

Accepted Answer

Avoid shipping features that show flat or negative results, as they introduce maintenance overhead without providing value, unless legally required (in which case, test multiple options to find the least harmful).

Question 13

Start A/B Testing with Sufficient Users

Accepted Answer

Begin running A/B tests when you have at least tens of thousands of users, with 200,000 users being a &lsquo;magical&rsquo; threshold for detecting meaningful effects across various metrics.

Question 14

Build or Buy an Experimentation Platform

Accepted Answer

Decide whether to build an internal experimentation platform or leverage third-party vendors to reduce the marginal cost of running experiments and enable self-service.

Question 15

Shift Culture with Beachheads

Accepted Answer

To shift company culture towards experimentation, start with a receptive team that launches frequently, demonstrate success, and share surprising results to build internal advocacy and cross-pollination.

Question 16

Ensure OEC Directional Agreement

Accepted Answer

Verify that all stakeholders agree on the directional impact of the Overall Evaluation Criterion (OEC); if half the team thinks an increase is good and the other half thinks it&rsquo;s bad, the OEC is poorly defined.

Question 17

Speed Up Results with Variance Reduction

Accepted Answer

Implement variance reduction techniques like capping skewed metrics (e.g., revenue, nights booked) or using CUPID (adjusting results with pre-experiment data) to achieve statistically significant results faster with fewer users.

Question 18

Use Structured Narratives

Accepted Answer

Adopt structured narrative documents (e.g., &lsquo;six-pagers&rsquo;) instead of PowerPoint presentations for product development to foster clearer thinking, facilitate honest feedback, and ensure decisions are well-documented.

Question 19

Emphasize Hierarchy of Evidence

Accepted Answer

Teach and apply the &lsquo;hierarchy of evidence&rsquo; to evaluate information, trusting anecdotal evidence the least and multiple controlled experiments the most, to make more informed decisions in all aspects of life.

The ultimate guide to A/B testing | Ronny Kohavi (Airbnb, Microsoft, Amazon)

1. Test Every Code Change

2. Define Overall Evaluation Criterion

3. Expect High Failure Rates

4. Allocate to Big Bets

5. Avoid Large Redesigns

6. Document Experiment Learnings

7. Focus on Surprising Results

8. Ensure Experiment Trustworthiness

9. Check for Sample Ratio Mismatch

10. Apply Twyman’s Law

11. Understand P-Value Limitations

12. Don’t Ship Flat or Negative Results

13. Start A/B Testing with Sufficient Users

14. Build or Buy an Experimentation Platform

15. Shift Culture with Beachheads

16. Ensure OEC Directional Agreement

17. Speed Up Results with Variance Reduction

18. Use Structured Narratives

19. Emphasize Hierarchy of Evidence