The ultimate guide to A/B testing | Ronny Kohavi (Airbnb, Microsoft, Amazon)
Rani Kohavi, a world expert on A-B testing and experimentation from Airbnb, Microsoft, and Amazon, shares tactical advice on running effective experiments, building an experiment-driven culture, and understanding key metrics like p-values and Twyman's Law.
Deep Dive Analysis
16 Topic Outline
Most Surprising A/B Test Results and Learnings
Small Effort, Huge Gains vs. Incremental Improvements
Typical Experiment Failure Rates Across Companies
Importance of Institutional Learning and Documentation
Balancing Incremental vs. High-Risk, High-Reward Ideas
When and When Not to A/B Test
Defining the Overall Evaluation Criterion (OEC)
Addressing Long-Term Metrics and Spamming Users
The Problem with Product Redesigns
Implementing Experimentation Culture at Microsoft
Experimentation at Airbnb and During Crisis
The Critical Role of Trust in Experimentation
Detecting Flaws: Sample Ratio Mismatch and Twyman's Law
Understanding and Misinterpreting P-values
Getting Started and Shifting Organizational Culture
Building Experimentation Platforms and Improving Speed
8 Key Concepts
Institutional Learning
The process of systematically documenting and summarizing experiment successes and failures, especially surprising ones, to build organizational memory and prevent repeating mistakes. This involves regularly reviewing and sharing insights from experiments to foster continuous improvement.
Surprising Experiment
An experiment where the estimated result predicted beforehand and the actual observed result differ significantly in absolute value. This applies whether the outcome is a surprising win or a surprising negative result, as both offer valuable learning opportunities.
Overall Evaluation Criterion (OEC)
A carefully defined metric that a company optimizes for, designed to be causally predictive of the long-term lifetime value of a user. It often incorporates countervailing metrics or constraints to balance short-term gains (like revenue) with critical aspects such as user experience or long-term retention.
Experimentation Platform as a Safety Net and Oracle
The dual role of an experimentation platform: acting as a safety net by enabling quick aborts of bad launches (safe deployments) and serving as an oracle by providing trustworthy, scientifically sound results for key metrics and guardrails at the end of an experiment.
Sample Ratio Mismatch (SRM)
A critical red flag in an A/B test where the actual distribution of users between control and treatment groups significantly deviates from the intended split (e.g., 50/50). An SRM indicates a fundamental flaw in the experiment's setup, randomization, or data collection, making the results untrustworthy.
Twyman's Law
The principle stating that if any figure or result from an experiment looks unusually interesting or dramatically different from expectations, it is usually wrong and requires thorough investigation. This law encourages skepticism towards unexpectedly large positive or negative outcomes.
P-value Misinterpretation
The common misunderstanding that 1 minus the p-value represents the probability that the treatment is better than the control. This is incorrect, as the p-value is a conditional probability that assumes the null hypothesis (no difference between groups) is true.
Variance Reduction Techniques
Methods used to decrease the variability of metrics in A/B tests, which allows for faster detection of statistically significant results with fewer users. Examples include capping skewed metrics (e.g., revenue, nights booked) and using CUPED (Controlled-experiment Using Pre-Experiment Data).
6 Questions Answered
A company should start experimenting when it has at least tens of thousands of users to detect large effects. It becomes truly effective and allows for testing everything once the user base reaches around 200,000 users.
The most common sign of a flawed A/B experiment is a Sample Ratio Mismatch (SRM), where the actual user distribution between control and treatment groups deviates significantly from the intended split, often caused by issues like bots or data pipeline problems.
Many people incorrectly interpret 1 minus the p-value as the probability that the treatment is better than the control. However, the p-value is a conditional probability that assumes the null hypothesis (no difference between groups) is true.
To shift culture, start by identifying a team that launches frequently and has a clear Overall Evaluation Criterion (OEC). Scale experimentation within this team, and then leverage their successes and cross-pollination to influence other groups within the organization.
The key is to define the OEC such that it is causally predictive of the long-term lifetime value of the user, incorporating countervailing metrics like retention rates or time to achieve a task to prevent short-term optimizations that harm the user experience.
Large product redesigns often fail because they involve too many changes at once, making it difficult to isolate what works and what doesn't. The sunk cost fallacy can also lead teams to launch a negative redesign after significant investment, rather than iterating incrementally and adjusting based on data.
19 Actionable Insights
1. Test Every Code Change
Implement every code change or feature as an experiment, as even small modifications can yield surprising, unexpected impacts.
2. Define Overall Evaluation Criterion
Establish a clear Overall Evaluation Criterion (OEC) that is causally predictive of the long-term lifetime value of the user, incorporating countervailing metrics to prevent short-term gains that harm user experience.
3. Expect High Failure Rates
Be prepared for most ideas to fail (e.g., 80-92% of experiments), which helps avoid the sunk cost fallacy and encourages iterative development.
4. Allocate to Big Bets
Dedicate resources to high-risk, high-reward ideas, understanding that while most will fail, a successful one can be a breakthrough.
5. Avoid Large Redesigns
Decompose large redesigns into smaller, incremental changes and test them one factor at a time (OFAT) to learn and adjust, rather than risking a complete failure after significant investment.
6. Document Experiment Learnings
Document successes and failures, and maintain a searchable history of experiments to build institutional memory and learn from past results.
7. Focus on Surprising Results
Prioritize reviewing experiments where the estimated and actual results differ significantly, whether surprisingly positive or negative, to gain deeper insights into user behavior and system interactions.
8. Ensure Experiment Trustworthiness
Build an experimentation platform that acts as a safety net for deployments and a reliable oracle for results, as trust in the platform is crucial for data-driven decision-making.
9. Check for Sample Ratio Mismatch
Implement automated checks for Sample Ratio Mismatch (SRM) in experiments; if detected, do not trust the results and investigate the underlying cause, as it indicates a fundamental flaw.
10. Apply Twyman’s Law
If an experiment result looks too good (or bad) to be true, investigate thoroughly for flaws or bugs before celebrating, as such figures are often incorrect.
11. Understand P-Value Limitations
Recognize that a P-value does not directly represent the probability of your treatment being better than control; instead, consider the ‘false positive risk,’ which is often much higher than commonly assumed (e.g., 26% for P<0.05 with an 8% success rate).
12. Don’t Ship Flat or Negative Results
Avoid shipping features that show flat or negative results, as they introduce maintenance overhead without providing value, unless legally required (in which case, test multiple options to find the least harmful).
13. Start A/B Testing with Sufficient Users
Begin running A/B tests when you have at least tens of thousands of users, with 200,000 users being a ‘magical’ threshold for detecting meaningful effects across various metrics.
14. Build or Buy an Experimentation Platform
Decide whether to build an internal experimentation platform or leverage third-party vendors to reduce the marginal cost of running experiments and enable self-service.
15. Shift Culture with Beachheads
To shift company culture towards experimentation, start with a receptive team that launches frequently, demonstrate success, and share surprising results to build internal advocacy and cross-pollination.
16. Ensure OEC Directional Agreement
Verify that all stakeholders agree on the directional impact of the Overall Evaluation Criterion (OEC); if half the team thinks an increase is good and the other half thinks it’s bad, the OEC is poorly defined.
17. Speed Up Results with Variance Reduction
Implement variance reduction techniques like capping skewed metrics (e.g., revenue, nights booked) or using CUPID (adjusting results with pre-experiment data) to achieve statistically significant results faster with fewer users.
18. Use Structured Narratives
Adopt structured narrative documents (e.g., ‘six-pagers’) instead of PowerPoint presentations for product development to foster clearer thinking, facilitate honest feedback, and ensure decisions are well-documented.
19. Emphasize Hierarchy of Evidence
Teach and apply the ‘hierarchy of evidence’ to evaluate information, trusting anecdotal evidence the least and multiple controlled experiments the most, to make more informed decisions in all aspects of life.
5 Key Quotes
If you go for something big, try it out, but be ready to fail 80% of the time.
Ronny Kohavi
If the result looks too good to be true, if you suddenly moved your, you know, your normal movement of an experiment is under 1% and you suddenly have a 10% movement, hold a celebratory dinner... hold that dinner, investigate, see, because there's a large probability that something is wrong with the result.
Ronny Kohavi
I don't think it's possible to experiment too much.
Ronny Kohavi
This thing was worth $100 million at the time when Bing was a lot smaller.
Ronny Kohavi
If something is not stat sig, that's a no ship because you've just introduced more code.
Ronny Kohavi
1 Protocols
Addressing High False Positive Risk in A/B Tests
Ronny Kohavi- If a statistically significant result (p-value < 0.05) is obtained, but the p-value is above 0.01, consider rerunning or replicating the experiment.
- Combine the results of the original and replicated experiments using methods like Fisher's or Stauffer's to get a lower, joint p-value.
- Aim for a p-value below 0.01 for higher confidence, especially for high-impact decisions, to significantly reduce the false positive rate.