Statistics Intuitions and Social Science Reproducibility (with Stuart Buck)

Jul 7, 2021 Episode Page ↗
Overview

Spencer Greenberg and Stuart Buck discuss improving math education by focusing on statistics and logic, the nuances of p-values and publishing null results, and the critical need for open science, reproducibility, and better research practices in various fields.

At a Glance
12 Insights
1h 13m Duration
12 Topics
7 Concepts

Deep Dive Analysis

Rethinking High School Math Education

Critique of Traditional Math Curriculum Justifications

Ideal High School Math Curriculum: Focus on Statistics

Statistical Literacy and Understanding Science

Understanding P-values and Their Misinterpretations

The Debate Over P-value Cutoffs and Null Results

Introduction to Open Science and Research Reproducibility

Reproducibility Challenges: Examples from Economics and Biology

The Importance of Detailed Methods and Material Sharing

Generalizability of Research Findings and Continuous Experimentation

Challenges in Generalizing International Development Interventions

Reproducibility Crisis in Pharmaceutical Research

P-value

A p-value indicates the probability of observing data as extreme or more extreme than what was found, *if* there is no effect (the null hypothesis is true). It does not directly tell you the probability that a result is true or that an effect exists.

Correlation vs. Causation

This concept differentiates between two events or variables that tend to occur together (correlation) and one event directly causing the other (causation). Confusing these two is a common error in public discourse and understanding the world.

Base Rate Neglect

This is a cognitive bias where people tend to ignore the overall prevalence or 'base rate' of an event when estimating the probability of a specific outcome. It often leads to incorrect conclusions, such as misinterpreting medical test results without considering the disease's rarity.

Transfer of Learning

This educational concept suggests that skills learned in one domain can be applied or 'transferred' to improve performance in another, seemingly unrelated domain. However, rigorous studies often show that such transfer is limited and direct training is usually more effective.

Statistically Significant

A term used to describe a research result where the p-value is below a predetermined threshold, typically 0.05. This phrase can be misleading as it implies importance or truth, despite evidence existing on a continuum and small differences around the cutoff being statistically insignificant themselves.

Reproducibility (in science)

Reproducibility refers to the ability of independent researchers to obtain the same results when conducting an experiment again, ideally using the same data, code, and methods. Failures in reproducibility highlight issues with research reliability, methodology, or theoretical understanding.

Publication Bias

This is the tendency for scientific journals to preferentially publish studies that report positive or 'statistically significant' findings over those with null or negative results. This bias can distort the scientific literature, making certain effects appear more robust or prevalent than they truly are.

?
What are the shortcomings of current high school math education?

Current high school math curricula, particularly geometry and advanced topics, are often not useful for most students in understanding the world or daily life, leading to frustration and a lack of practical application.

?
What kind of math education would be more beneficial for high school students?

A more beneficial curriculum would focus on conceptual understanding of statistics and data analysis, including probability, correlation vs. causation, Bayesian interpretation, and understanding basic distributions like mean and median.

?
What does a p-value actually mean?

A p-value indicates the probability of observing data as extreme or more extreme than what was found, assuming there is no true effect (the null hypothesis is true).

?
Why is the common interpretation of a p-value often incorrect?

People often incorrectly interpret a p-value as the probability that a result is true or that an effect exists, whereas it's actually the probability of the data given no effect, which is the inverse of what people typically want to know.

?
What is the problem with using a strict p-value cutoff (e.g., 0.05) for scientific findings?

A strict cutoff creates an artificial dichotomy where results just below the line are considered 'significant' and those just above are not, despite the evidence being a continuum and small differences in p-values often being statistically insignificant themselves.

?
When are null results valuable to publish in scientific literature?

Null results are valuable when they address a theoretically interesting question, especially if there's a strong prior belief in a positive effect, or if they challenge an intervention widely believed to work.

?
What is research reproducibility, and why is it important?

Reproducibility means that if an experiment is repeated, the same results should be obtained. It's crucial for building reliable scientific knowledge and ensuring that evidence-based decisions are made on trustworthy findings.

?
What factors contribute to the reproducibility crisis in science?

Factors include poor coding practices, lack of data/code sharing, subtle unstated methodological differences (e.g., wood shavings in mouse cages, stirring speed in cell cultures), publication bias towards positive results, and over-generalization of findings from specific populations.

?
How can the generalizability problem in scientific findings be addressed?

One approach is to integrate lightweight randomized control trials (RCTs) directly into the deployment of interventions, continuously testing and improving them on the specific target population, rather than relying on generalization from studies done elsewhere.

1. Integrate Continuous RCTs

Weave lightweight randomized control trials (RCTs) into the deployment of interventions (e.g., cash transfers, digital apps) to continuously collect high-quality data on the target population and iterate for improvement, rather than relying on generalization from other studies.

2. Question Your Successes

Adopt the practice of questioning your successes as much as your failures, like good poker players, to determine if your decision-making process was sound or if you merely got lucky.

3. Prioritize Core Math & Logic

Restructure high school math curriculum to prioritize basic statistics, data analysis, and direct logic instruction, as these concepts are more useful for understanding the world, news, and scientific claims than esoteric geometry or indirect logic training.

4. Master Core Statistical Concepts

Learn and understand fundamental statistical concepts like the mean, median, probability distributions (e.g., standard deviation), and the difference between correlation and causation, as these are critical for making sense of the world and scientific information.

5. P-values for Sampling Error

Use p-values as a tool to reasonably rule out sampling error, understanding that a very low p-value suggests the result is unlikely due to random noise from particular participants or data points, but does not imply the result’s truth or importance.

6. Publish Informative Null Results

Prioritize publishing null results for theoretically interesting questions, especially those that contradict prior positive findings or widely held beliefs, to advance the field and avoid pursuing dead ends.

7. Fund High-Value Science

Guide science funding decisions towards research that yields fundamental theoretical results, applied results with immediate utility (e.g., curing diseases), or tools and methods that accelerate future scientific progress.

8. Rigorously Double-Check Work

When producing significant work, especially that which will be widely published or cited, rigorously double-check all calculations and data analysis to minimize errors, acknowledging the increased responsibility that comes with greater impact.

9. Adopt Unit Tests in Research

Scientists should implement unit tests (writing code specifically to test their analytical code) to catch bugs and improve the reliability of their computational work, a best practice from software engineering.

10. Detail Experimental Methods

Fully disclose experimental methods in scientific papers with as much detail as possible, recognizing that subtle, seemingly minute differences in procedures can dramatically affect results and are crucial for reproducibility.

11. Avoid Grandiose Generalizations

Scientists should approach findings with humility, avoiding grandiose pronouncements about general human behavior based on limited populations or contexts, and acknowledge the potential for results to be highly specific.

12. Assess Implementation Quality

When evaluating interventions, recognize that a poor or low-quality implementation of an otherwise effective program can lead to a failure to find an effect, making it crucial to account for implementation quality.

If you want people to learn math, teach them math. If you want people to learn music, teach them music and justify it on its own terms, not because of some benefit to something else that you could have been teaching directly.

Stuart Buck

The difference between statistically significant and statistically insignificant is statistically insignificant.

Stuart Buck

If psychology is that brittle, we're screwed.

Spencer Greenberg

But the best poker players will also go take a successful hand where they won. And instead of just feeling good about it, they say, let's go back and look. Because maybe, maybe I made a decision that actually was not a good decision at the time. And I just got lucky.

Stuart Buck

What we want is not just more publications. What we want is more publications, particularly in that area, you know, cell biology, et cetera, or publications that lead to greater understanding of how the human body works and ultimately greater ability to prevent diseases or cure diseases or address aging and issues like that.

Stuart Buck
1%
Percentage of people with a specific type of cancer (hypothetical scenario) Used to illustrate base rate neglect in medical test interpretation.
90%
Accuracy of a hypothetical cancer test Used to illustrate base rate neglect in medical test interpretation.
0.05
Standard p-value cutoff for statistical significance Commonly used in science, but considered arbitrary.
8 years
Stuart Buck's tenure at Arnold Ventures As director and VP of research.
2012
Year Arnold Foundation noticed reproducibility problems in psychology Around the summer of 2012.
2015
Year Reproducibility Project in Psychology was published in Science Published in Science.
100
Number of original studies replicated in Reproducibility Project in Psychology Studies from top psychology journals published in 2008.
36% to 39%
Percentage of studies successfully replicated in Reproducibility Project in Psychology Found the same effect in the second experiment.
30%
Approximate percentage of inconclusive studies in Reproducibility Project in Psychology Results were hard to interpret.
30%
Approximate percentage of studies that did not replicate in Reproducibility Project in Psychology Original experiment did not stand up in the second experiment.
21
Number of countries analyzed in Reinhard Rogoff Excel study Some were left out due to an Excel formula error.
18 to 24 hours
Stirring duration in one breast cancer cell lab At a very slow speed, using half concentration of collagenase digest.
6 to 8 hours
Stirring duration in another breast cancer cell lab At a higher rate, using collagenase mixture.
70% or 80%
Approximate percentage of academic experiments pharmaceutical companies couldn't reproduce Reported by Amgen and Bayer; Pfizer also anecdotally reported two-thirds.