The ultimate guide to A/B testing | Ronny Kohavi (Airbnb, Microsoft, Amazon)

Jul 27, 2023 1h 23m 19 insights Episode Page ↗
Rani Kohavi, a world expert on A-B testing and experimentation from Airbnb, Microsoft, and Amazon, shares tactical advice on running effective experiments, building an experiment-driven culture, and understanding key metrics like p-values and Twyman's Law.
Actionable Insights

1. Test Every Code Change

Implement every code change or feature as an experiment, as even small modifications can yield surprising, unexpected impacts.

2. Define Overall Evaluation Criterion

Establish a clear Overall Evaluation Criterion (OEC) that is causally predictive of the long-term lifetime value of the user, incorporating countervailing metrics to prevent short-term gains that harm user experience.

3. Expect High Failure Rates

Be prepared for most ideas to fail (e.g., 80-92% of experiments), which helps avoid the sunk cost fallacy and encourages iterative development.

4. Allocate to Big Bets

Dedicate resources to high-risk, high-reward ideas, understanding that while most will fail, a successful one can be a breakthrough.

5. Avoid Large Redesigns

Decompose large redesigns into smaller, incremental changes and test them one factor at a time (OFAT) to learn and adjust, rather than risking a complete failure after significant investment.

6. Document Experiment Learnings

Document successes and failures, and maintain a searchable history of experiments to build institutional memory and learn from past results.

7. Focus on Surprising Results

Prioritize reviewing experiments where the estimated and actual results differ significantly, whether surprisingly positive or negative, to gain deeper insights into user behavior and system interactions.

8. Ensure Experiment Trustworthiness

Build an experimentation platform that acts as a safety net for deployments and a reliable oracle for results, as trust in the platform is crucial for data-driven decision-making.

9. Check for Sample Ratio Mismatch

Implement automated checks for Sample Ratio Mismatch (SRM) in experiments; if detected, do not trust the results and investigate the underlying cause, as it indicates a fundamental flaw.

10. Apply Twyman’s Law

If an experiment result looks too good (or bad) to be true, investigate thoroughly for flaws or bugs before celebrating, as such figures are often incorrect.

11. Understand P-Value Limitations

Recognize that a P-value does not directly represent the probability of your treatment being better than control; instead, consider the ‘false positive risk,’ which is often much higher than commonly assumed (e.g., 26% for P<0.05 with an 8% success rate).

12. Don’t Ship Flat or Negative Results

Avoid shipping features that show flat or negative results, as they introduce maintenance overhead without providing value, unless legally required (in which case, test multiple options to find the least harmful).

13. Start A/B Testing with Sufficient Users

Begin running A/B tests when you have at least tens of thousands of users, with 200,000 users being a ‘magical’ threshold for detecting meaningful effects across various metrics.

14. Build or Buy an Experimentation Platform

Decide whether to build an internal experimentation platform or leverage third-party vendors to reduce the marginal cost of running experiments and enable self-service.

15. Shift Culture with Beachheads

To shift company culture towards experimentation, start with a receptive team that launches frequently, demonstrate success, and share surprising results to build internal advocacy and cross-pollination.

16. Ensure OEC Directional Agreement

Verify that all stakeholders agree on the directional impact of the Overall Evaluation Criterion (OEC); if half the team thinks an increase is good and the other half thinks it’s bad, the OEC is poorly defined.

17. Speed Up Results with Variance Reduction

Implement variance reduction techniques like capping skewed metrics (e.g., revenue, nights booked) or using CUPID (adjusting results with pre-experiment data) to achieve statistically significant results faster with fewer users.

18. Use Structured Narratives

Adopt structured narrative documents (e.g., ‘six-pagers’) instead of PowerPoint presentations for product development to foster clearer thinking, facilitate honest feedback, and ensure decisions are well-documented.

19. Emphasize Hierarchy of Evidence

Teach and apply the ‘hierarchy of evidence’ to evaluate information, trusting anecdotal evidence the least and multiple controlled experiments the most, to make more informed decisions in all aspects of life.