Exploring your shadow and healing your traumas (with Aurora Quinn-Elmore)

Jul 25, 2021 Episode Page ↗

Overview

Spencer Greenberg and Stuart Buck discuss reforming high school math to focus on practical statistics and logic. They also explore issues in scientific research, including the interpretation of p-values, publication of null results, and challenges in experimental reproducibility and generalizability across various fields.

At a Glance

26 Insights

1h 11m Duration

18 Topics

8 Concepts

Deep Dive Analysis

18 Topic Outline

Rethinking High School Math Education

Critique of Traditional Math Justifications

Ideal Math Curriculum: Statistics and Data Analysis

Understanding P-values and Their Limitations

The P-value Debate and Publication Bias

Value of Publishing Null Results in Science

Introduction to Open Science and Reproducibility

Challenges in Reproducing Research: Data and Code

The Reinhard Rogoff Excel Error Case Study

Questioning Successes and Improving Scientific Practices

Foundation's Role in Research Reproducibility

Funding Challenges for Meta-Science Initiatives

Replication Crisis in Social Sciences

Subtle Factors Affecting Reproducibility in Biology

Importance of Material Sharing and Detailed Methods

Generalizability of Scientific Findings and Interventions

Integrating Lightweight RCTs into Intervention Deployment

Reasons Studies Fail to Generalize

8 Key Concepts

Correlation vs. Causation

This fundamental distinction highlights that just because two events or variables occur together (correlation) does not mean one directly causes the other (causation). Confusing these can lead to significant misunderstandings in public discourse and policy.

Bayesian Interpretation of Probability

This approach to probability involves incorporating prior knowledge or 'base rates' into calculations. It helps in understanding scenarios like medical test results, where the overall prevalence of a condition significantly impacts the probability of actually having it after a positive test.

P-value

A p-value quantifies the probability of observing data as extreme, or more extreme, than what was found in an experiment, assuming that there is no actual effect (i.e., the null hypothesis is true). It does not directly indicate the probability that a result is true or that an effect exists.

Transfer of Learning

This concept refers to the idea that skills or knowledge acquired in one domain can be applied to or improve performance in another, seemingly unrelated domain. The episode suggests that rigorous studies often show limited transfer, implying direct teaching of desired skills is more effective.

Statistically Significant

A term used when a p-value falls below an arbitrary threshold, typically 0.05, suggesting a result is unlikely to have occurred by chance. The episode argues this threshold creates a false dichotomy, as evidence exists on a continuum and small differences around the cutoff are not truly significant.

Reproducibility (of Research)

The ability for independent researchers to obtain the same results when conducting an experiment again, ideally using the same methods, data, and analytical code. A lack of reproducibility signals potential issues with the original finding or an incomplete understanding of experimental conditions.

Publication Bias

The tendency for scientific journals to preferentially publish studies that report positive or statistically significant findings, while studies with null or inconclusive results are less likely to be published. This can distort the scientific literature, overstating the prevalence of real effects.

Valuable Null Results

These are research findings where no effect is found, but they are still important for advancing knowledge. They are valuable when they address theoretically interesting questions, contradict widely held beliefs, or demonstrate that a method or intervention does not work as claimed, especially if the study was well-conducted.

11 Questions Answered

What should be taught in high school math classes?

High school math should prioritize practical statistics and data analysis, including concepts like probability, correlation vs. causation, mean, median, and probability distributions, over esoteric geometry or trigonometry, to better equip students for understanding the world and scientific reasoning.

Why is geometry often criticized in high school curricula?

Geometry is criticized for its emphasis on esoteric proofs, with the justification that it 'trains the mind.' Critics argue that teaching actual logic directly would be more effective for developing logical thinking, and that much of geometry is not practically useful for most students.

What is a p-value in simple terms?

A p-value indicates the probability of observing your experimental results, or results more extreme, if there were truly no effect or difference (e.g., if a coin were perfectly fair). It does not directly tell you the probability that your hypothesis is true.

Why is the common interpretation of p-values often misleading?

People often mistakenly interpret a p-value as the probability that their result is true, which is the inverse of what it actually represents. The p-value also considers the probability of 'more extreme' data, making its interpretation less intuitive and prone to misapplication.

Is the 0.05 p-value cutoff for 'statistical significance' meaningful?

The 0.05 p-value cutoff is an arbitrary convention, and evidence exists on a continuum. Findings with p-values just above or below this threshold are not meaningfully different in terms of the strength of evidence, making rigid cutoffs problematic for scientific interpretation.

Should all null results be published in scientific literature?

Not all null results should be published, as many may stem from uninteresting questions or flawed research. However, null results are highly valuable when they address theoretically significant questions, contradict widely accepted beliefs, or demonstrate that a method or intervention is ineffective.

What is research reproducibility?

Research reproducibility refers to the ability of other researchers to independently replicate an experiment and achieve the same results as originally reported. This involves using the same methods, data, and analytical code, and its absence suggests potential issues with the original findings or an incomplete understanding of the experimental context.

Why do many scientific findings fail to replicate?

Studies fail to replicate for various reasons, including original research errors, publication bias favoring positive results, unknown or unmeasured factors influencing outcomes (e.g., specific lab conditions), and challenges in perfectly reproducing complex experimental designs or protocols.

Why are large institutions not more invested in funding meta-science and reproducibility research?

Funding for meta-science is often challenging because it's perceived as less exciting than direct biomedical research. Leaders may be hesitant to divert funds to initiatives that question the reliability of existing research, which can be seen as controversial rather than groundbreaking.

How can the generalizability problem in science be addressed?

One solution is to integrate lightweight randomized controlled trials (RCTs) directly into the deployment of interventions. This means continuously running studies on the specific populations and in the exact formats where interventions are used, rather than relying on generalizations from studies conducted elsewhere.

What are some reasons why studies fail to generalize across different populations or contexts?

Studies can fail to generalize because the original result was noise, there are differences in intervention quality or dosage during rollout, cultural factors are at play, or unknown contextual variables were not accounted for in the original study.

26 Actionable Insights

1. Restructure Math Education for Real-World Use

Advocate for high school math curricula to prioritize basic statistics, data analysis, probability, and direct logic instruction over traditional geometry or advanced calculus. This helps students understand the world, news, medical information, and make better decisions.

2. Teach Logic Directly

Instead of hoping students learn logic through indirect methods like geometric proofs, teach actual logic directly. This is a more effective way to train logical thinking skills.

3. Understand Correlation vs. Causation

Learn to differentiate between correlation and causation. This is crucial for interpreting information, understanding public discourse, and avoiding false conclusions about why events occur.

4. Grasp Bayesian Probability & Base Rates

Develop an intuitive understanding of Bayesian probability, particularly how to incorporate base rates into decision-making. This helps avoid common errors in interpreting probabilities, such as in medical test results.

5. Understand Core Statistical Concepts

Learn the concepts of mean, median, and basic probability distributions (e.g., standard deviation, width). These are fundamental for making sense of data, the world, and scientific findings.

6. Participate in Scientific Reasoning

Recognize that scientific reasoning is not exclusive to ‘scientists’; everyone can participate in better reasoning by understanding principles of mathematical probability and logic. This democratizes knowledge acquisition.

7. Interpret P-values Correctly

Understand that a p-value indicates the probability of observing data as extreme or more extreme if there were no effect, not the probability of an effect given the data. This prevents misinterpretation of scientific results.

8. Use P-values to Rule Out Sampling Error

Employ p-values as a tool to determine if a result can reasonably be attributed to random sampling error or noise. A very low p-value suggests the result is unlikely due to chance alone.

9. Be Wary of Small Sample Size + Large Effect

Approach findings with small p-values from very small sample sizes (e.g., N=20) and large effects with suspicion. Such results may be flukish and less likely to replicate.

10. Avoid Dichotomous Thinking with P-values

Do not treat the 0.05 p-value cutoff as a strict dichotomy for ‘statistical significance.’ Evidence exists on a continuum, and results just above or below this arbitrary line are essentially equivalent in terms of actual evidence.

11. Publish Theoretically Interesting Null Results

Prioritize publishing null results for questions that are theoretically interesting and would advance the field, regardless of whether the answer is positive or negative. This informs researchers about dead ends or what doesn’t work.

12. Publish Applied Null Results for Common Beliefs

Publish null results for applied interventions that are widely used or believed to be effective but are found not to work. This provides valuable information to practitioners and the public.

13. Publish Null Results for Failed Methods/Tools

Disclose when a scientific method or tool is discovered not to perform as claimed. This helps improve future scientific practices and tool development.

14. Implement Rigorous Coding Practices in Academia

Academics should adopt industry best practices for programming, including testing code for bugs and conducting independent code reviews. This helps catch inevitable errors in data analysis.

15. Double-Check High-Impact Work

For research with significant implications (e.g., policy, public health), rigorously double-check all calculations and analyses. Increased impact demands increased responsibility and verification.

16. Question Successes as Much as Failures

Perform ‘post-mortems’ on successful experiments or projects, not just failures, to evaluate if good decisions were made or if luck played a significant role. This improves decision-making processes.

17. Adopt Unit Tests in Scientific Programming

Integrate unit tests into scientific code development, where code is written specifically to test other code. This is an indispensable practice for catching bugs.

18. Fully Disclose Experimental Methods

Provide comprehensive and detailed descriptions of experimental methods in scientific publications. This is crucial for other researchers to accurately reproduce studies.

19. Be Less Grandiose in Scientific Claims

Avoid overgeneralizing findings from specific populations (e.g., university undergraduates) to broader humanity. Acknowledge the limits of generalizability and the specific context of the study.

20. Design Interventions for Specific Populations/Formats

When developing interventions, study them on populations very similar to the intended real users and in the exact format they will be deployed. This helps ensure effectiveness without needing broad generalizability.

21. Integrate Lightweight RCTs into Intervention Deployment

Weave lightweight randomized controlled trials (RCTs) directly into the ongoing deployment of interventions. This allows for continuous, high-quality data collection on the target population and iterative improvement.

22. Conduct Baseline RCTs in New Contexts

When introducing an intervention or policy in a new geographical or cultural context, always start with a baseline RCT to verify its effectiveness in that specific environment. Do not assume generalizability.

23. Continuously Test and Improve Interventions

Treat interventions as dynamic entities that can be continually improved through ongoing A-B testing and data collection, rather than static, one-time deployments.

24. Recognize Impact of Implementation Quality

Understand that even a theoretically sound intervention will fail if poorly implemented. Quality of execution is a critical factor for success.

25. Account for Dosage Differences

Be aware that variations in intervention dosage (e.g., duration, frequency) can significantly alter outcomes and affect generalizability.

26. Consider Cultural Factors in Interventions

When deploying interventions in new areas, carefully consider local cultural factors and potential moral opposition, as these can profoundly impact effectiveness.

5 Key Quotes

If you want people to learn math, teach them math. If you want people to learn music, teach them music and justify it on its own terms, not because of some benefit to something else that you could have been teaching directly.
Stuart Buck

The difference between statistically significant and statistically insignificant is statistically insignificant.
Andrew Gelman (quoted by Stuart Buck)

If psychology is that brittle, we're screwed.
Spencer Greenberg

What we want is not just more publications. What we want is more publications, particularly in that area, you know, cell biology, et cetera, or publications that lead to greater understanding of how the human body works and ultimately greater ability to prevent diseases or cure diseases or address aging and issues like that.
Stuart Buck

A bad implementation of any intervention will always fail.
Spencer Greenberg

3 Protocols

Coin Flip P-Value Explanation

Stuart Buck

Assume a coin is fair (50/50 probability) as the null hypothesis.
Flip the coin 100 times and observe the result (e.g., 60 heads, 40 tails).
Ask: What is the probability of getting this observed result (60/40) or something more extreme (e.g., 61/39, 62/38, etc.) if the coin were truly fair?
This calculated probability is the p-value, indicating how likely the observed data is under the assumption of no effect.

Improving Scientific Research Processes

Stuart Buck & Spencer Greenberg

Question successes as much as failures: Conduct 'postmortems' on successful outcomes to determine if decisions were genuinely good or if luck played a role.
Implement unit tests: Write code specifically designed to test other code, helping to catch bugs and ensure accuracy in analyses.

Integrating Lightweight Randomized Control Trials (RCTs) into Interventions

Spencer Greenberg

Design an intervention to target a specific population.
Weave a lightweight RCT into the normal deployment process of the intervention (e.g., randomize who receives the intervention or different versions of it).
Collect follow-up data continuously as the intervention is delivered.
If moving the intervention to a new area, start with a baseline RCT there to confirm effectiveness before further optimization.
Continuously A-B test different versions of the intervention to improve it over time, similar to software product development.

15 Key Numbers

Probability of people having a specific type of cancer (base rate) Used in a scenario to illustrate how doctors often misinterpret Bayesian probability regarding medical tests.

90%

Accuracy of a cancer test Used in a scenario to illustrate how doctors often misinterpret Bayesian probability regarding medical tests.

0.05

Standard p-value cutoff for statistical significance An arbitrary standard often used in science; a proposed lower cutoff of 0.01 was also mentioned.

0.049

P-value just below significance threshold Used to illustrate the arbitrary nature of the 0.05 cutoff, showing a result considered 'significant'.

0.051

P-value just above significance threshold Used to illustrate the arbitrary nature of the 0.05 cutoff, showing a result considered 'insignificant'.

36% to 39%

Percentage of original psychology studies successfully replicated From the Reproducibility Project in Psychology, published in Science in 2015, based on 100 original studies from 2008.

~30%

Percentage of original psychology studies with inconclusive replication results From the Reproducibility Project in Psychology.

~30%

Percentage of original psychology studies that did not replicate From the Reproducibility Project in Psychology.

1 year

Time spent by labs trying to replicate breast cancer cell study before finding a subtle difference Duration two labs spent trying to replicate results before identifying a subtle procedural difference in stirring time.

18 to 24 hours

Stirring time for breast cancer cell tissue digestion (Berkeley lab) One of the differing methods that affected replication results in a breast cancer cell study.

6 to 8 hours

Stirring time for breast cancer cell tissue digestion (Boston lab) One of the differing methods that affected replication results in a breast cancer cell study.

70% to 80%

Percentage of academic experiments not replicated by pharmaceutical companies (Amgen, Bayer, Pfizer) Reported by Amgen and Bayer, and anecdotally by Pfizer, when attempting to reproduce academic findings in their own labs.

83%

Percentage of people who found Clearer Thinking's 'life-changing questions' valuable Based on a series of five scientific studies by Clearer Thinking.

78%

Percentage of people who would recommend Clearer Thinking's 'life-changing questions' Based on a series of five scientific studies by Clearer Thinking.

88%

Percentage of people who enjoyed answering Clearer Thinking's 'life-changing questions' Based on a series of five scientific studies by Clearer Thinking.