The ultimate guide to A/B testing | Ronny Kohavi (Airbnb, Microsoft, Amazon)

Jul 27, 2023 Episode Page ↗
Overview

Rani Kohavi, a world expert on A-B testing and experimentation from Airbnb, Microsoft, and Amazon, shares tactical advice on running effective experiments, building an experiment-driven culture, and understanding key metrics like p-values and Twyman's Law.

At a Glance
19 Insights
1h 23m Duration
16 Topics
8 Concepts

Deep Dive Analysis

Most Surprising A/B Test Results and Learnings

Small Effort, Huge Gains vs. Incremental Improvements

Typical Experiment Failure Rates Across Companies

Importance of Institutional Learning and Documentation

Balancing Incremental vs. High-Risk, High-Reward Ideas

When and When Not to A/B Test

Defining the Overall Evaluation Criterion (OEC)

Addressing Long-Term Metrics and Spamming Users

The Problem with Product Redesigns

Implementing Experimentation Culture at Microsoft

Experimentation at Airbnb and During Crisis

The Critical Role of Trust in Experimentation

Detecting Flaws: Sample Ratio Mismatch and Twyman's Law

Understanding and Misinterpreting P-values

Getting Started and Shifting Organizational Culture

Building Experimentation Platforms and Improving Speed

Institutional Learning

The process of systematically documenting and summarizing experiment successes and failures, especially surprising ones, to build organizational memory and prevent repeating mistakes. This involves regularly reviewing and sharing insights from experiments to foster continuous improvement.

Surprising Experiment

An experiment where the estimated result predicted beforehand and the actual observed result differ significantly in absolute value. This applies whether the outcome is a surprising win or a surprising negative result, as both offer valuable learning opportunities.

Overall Evaluation Criterion (OEC)

A carefully defined metric that a company optimizes for, designed to be causally predictive of the long-term lifetime value of a user. It often incorporates countervailing metrics or constraints to balance short-term gains (like revenue) with critical aspects such as user experience or long-term retention.

Experimentation Platform as a Safety Net and Oracle

The dual role of an experimentation platform: acting as a safety net by enabling quick aborts of bad launches (safe deployments) and serving as an oracle by providing trustworthy, scientifically sound results for key metrics and guardrails at the end of an experiment.

Sample Ratio Mismatch (SRM)

A critical red flag in an A/B test where the actual distribution of users between control and treatment groups significantly deviates from the intended split (e.g., 50/50). An SRM indicates a fundamental flaw in the experiment's setup, randomization, or data collection, making the results untrustworthy.

Twyman's Law

The principle stating that if any figure or result from an experiment looks unusually interesting or dramatically different from expectations, it is usually wrong and requires thorough investigation. This law encourages skepticism towards unexpectedly large positive or negative outcomes.

P-value Misinterpretation

The common misunderstanding that 1 minus the p-value represents the probability that the treatment is better than the control. This is incorrect, as the p-value is a conditional probability that assumes the null hypothesis (no difference between groups) is true.

Variance Reduction Techniques

Methods used to decrease the variability of metrics in A/B tests, which allows for faster detection of statistically significant results with fewer users. Examples include capping skewed metrics (e.g., revenue, nights booked) and using CUPED (Controlled-experiment Using Pre-Experiment Data).

?
When should a company start considering running A/B tests?

A company should start experimenting when it has at least tens of thousands of users to detect large effects. It becomes truly effective and allows for testing everything once the user base reaches around 200,000 users.

?
What is the most common sign that an A/B experiment is flawed?

The most common sign of a flawed A/B experiment is a Sample Ratio Mismatch (SRM), where the actual user distribution between control and treatment groups deviates significantly from the intended split, often caused by issues like bots or data pipeline problems.

?
What is the common misunderstanding about p-values in A/B testing?

Many people incorrectly interpret 1 minus the p-value as the probability that the treatment is better than the control. However, the p-value is a conditional probability that assumes the null hypothesis (no difference between groups) is true.

?
How can an organization shift its culture to be more experiment-driven?

To shift culture, start by identifying a team that launches frequently and has a clear Overall Evaluation Criterion (OEC). Scale experimentation within this team, and then leverage their successes and cross-pollination to influence other groups within the organization.

?
What is the key to defining a good Overall Evaluation Criterion (OEC)?

The key is to define the OEC such that it is causally predictive of the long-term lifetime value of the user, incorporating countervailing metrics like retention rates or time to achieve a task to prevent short-term optimizations that harm the user experience.

?
Why do large product redesigns often fail?

Large product redesigns often fail because they involve too many changes at once, making it difficult to isolate what works and what doesn't. The sunk cost fallacy can also lead teams to launch a negative redesign after significant investment, rather than iterating incrementally and adjusting based on data.

1. Test Every Code Change

Implement every code change or feature as an experiment, as even small modifications can yield surprising, unexpected impacts.

2. Define Overall Evaluation Criterion

Establish a clear Overall Evaluation Criterion (OEC) that is causally predictive of the long-term lifetime value of the user, incorporating countervailing metrics to prevent short-term gains that harm user experience.

3. Expect High Failure Rates

Be prepared for most ideas to fail (e.g., 80-92% of experiments), which helps avoid the sunk cost fallacy and encourages iterative development.

4. Allocate to Big Bets

Dedicate resources to high-risk, high-reward ideas, understanding that while most will fail, a successful one can be a breakthrough.

5. Avoid Large Redesigns

Decompose large redesigns into smaller, incremental changes and test them one factor at a time (OFAT) to learn and adjust, rather than risking a complete failure after significant investment.

6. Document Experiment Learnings

Document successes and failures, and maintain a searchable history of experiments to build institutional memory and learn from past results.

7. Focus on Surprising Results

Prioritize reviewing experiments where the estimated and actual results differ significantly, whether surprisingly positive or negative, to gain deeper insights into user behavior and system interactions.

8. Ensure Experiment Trustworthiness

Build an experimentation platform that acts as a safety net for deployments and a reliable oracle for results, as trust in the platform is crucial for data-driven decision-making.

9. Check for Sample Ratio Mismatch

Implement automated checks for Sample Ratio Mismatch (SRM) in experiments; if detected, do not trust the results and investigate the underlying cause, as it indicates a fundamental flaw.

10. Apply Twyman’s Law

If an experiment result looks too good (or bad) to be true, investigate thoroughly for flaws or bugs before celebrating, as such figures are often incorrect.

11. Understand P-Value Limitations

Recognize that a P-value does not directly represent the probability of your treatment being better than control; instead, consider the ‘false positive risk,’ which is often much higher than commonly assumed (e.g., 26% for P<0.05 with an 8% success rate).

12. Don’t Ship Flat or Negative Results

Avoid shipping features that show flat or negative results, as they introduce maintenance overhead without providing value, unless legally required (in which case, test multiple options to find the least harmful).

13. Start A/B Testing with Sufficient Users

Begin running A/B tests when you have at least tens of thousands of users, with 200,000 users being a ‘magical’ threshold for detecting meaningful effects across various metrics.

14. Build or Buy an Experimentation Platform

Decide whether to build an internal experimentation platform or leverage third-party vendors to reduce the marginal cost of running experiments and enable self-service.

15. Shift Culture with Beachheads

To shift company culture towards experimentation, start with a receptive team that launches frequently, demonstrate success, and share surprising results to build internal advocacy and cross-pollination.

16. Ensure OEC Directional Agreement

Verify that all stakeholders agree on the directional impact of the Overall Evaluation Criterion (OEC); if half the team thinks an increase is good and the other half thinks it’s bad, the OEC is poorly defined.

17. Speed Up Results with Variance Reduction

Implement variance reduction techniques like capping skewed metrics (e.g., revenue, nights booked) or using CUPID (adjusting results with pre-experiment data) to achieve statistically significant results faster with fewer users.

18. Use Structured Narratives

Adopt structured narrative documents (e.g., ‘six-pagers’) instead of PowerPoint presentations for product development to foster clearer thinking, facilitate honest feedback, and ensure decisions are well-documented.

19. Emphasize Hierarchy of Evidence

Teach and apply the ‘hierarchy of evidence’ to evaluate information, trusting anecdotal evidence the least and multiple controlled experiments the most, to make more informed decisions in all aspects of life.

If you go for something big, try it out, but be ready to fail 80% of the time.

Ronny Kohavi

If the result looks too good to be true, if you suddenly moved your, you know, your normal movement of an experiment is under 1% and you suddenly have a 10% movement, hold a celebratory dinner... hold that dinner, investigate, see, because there's a large probability that something is wrong with the result.

Ronny Kohavi

I don't think it's possible to experiment too much.

Ronny Kohavi

This thing was worth $100 million at the time when Bing was a lot smaller.

Ronny Kohavi

If something is not stat sig, that's a no ship because you've just introduced more code.

Ronny Kohavi

Addressing High False Positive Risk in A/B Tests

Ronny Kohavi
  1. If a statistically significant result (p-value < 0.05) is obtained, but the p-value is above 0.01, consider rerunning or replicating the experiment.
  2. Combine the results of the original and replicated experiments using methods like Fisher's or Stauffer's to get a lower, joint p-value.
  3. Aim for a p-value below 0.01 for higher confidence, especially for high-impact decisions, to significantly reduce the false positive rate.
12%
Bing ad experiment revenue increase For a simple change of moving the second line of an ad to the first line, worth $100 million at the time.
2%
Bing relevance team annual metric improvement goal Small, incremental improvements adding up over time.
6%
Airbnb search relevance overall revenue improvement Achieved through 250 experiments, each with small gains.
92%
Airbnb search relevance experiment failure rate Percentage of ideas that failed to improve key metrics, the highest observed rate.
66%
Microsoft overall experiment failure rate About two-thirds of ideas fail.
85%
Bing experiment failure rate For a highly optimized domain, making it harder to find improvements.
10%
Experiments aborted on the first day Usually due to implementation issues or unexpected problems, not bad ideas.
20,000 to 25,000
Microsoft experiments run annually Equivalent to about 100 new treatments per working day.
80%
Likelihood of high-risk, high-reward ideas failing Organizations should be prepared for this failure rate when pursuing big bets.
100-person years
Bing social integration experiment effort A large, failed experiment that was eventually aborted.
Tens of thousands
Minimum users needed for A/B testing Required for the statistics to work for most metrics and detect large effects.
200,000
Users needed for 'magical' A/B testing Threshold where a retail site can detect 5-10% beneficial changes and test everything effectively.
Over 20,000
Ronny's book copies sold in English Proceeds donated to charity.
8%
Microsoft experiments suffering from Sample Ratio Mismatch (SRM) A significant number of experiments were flawed due to incorrect user distribution.
9 out of 10 times
Probability of finding a flaw when Twyman's Law is called out When a result looks too good to be true, it's highly likely to be wrong.
26%
False positive risk for p-value < 0.05 at Airbnb search Given a historical success rate of only 8%, the actual false positive risk is much higher than the assumed 5%.