Exploring your shadow and healing your traumas (with Aurora Quinn-Elmore)
Spencer Greenberg and Stuart Buck discuss reforming high school math to focus on practical statistics and logic. They also explore issues in scientific research, including the interpretation of p-values, publication of null results, and challenges in experimental reproducibility and generalizability across various fields.
Deep Dive Analysis
18 Topic Outline
Rethinking High School Math Education
Critique of Traditional Math Justifications
Ideal Math Curriculum: Statistics and Data Analysis
Understanding P-values and Their Limitations
The P-value Debate and Publication Bias
Value of Publishing Null Results in Science
Introduction to Open Science and Reproducibility
Challenges in Reproducing Research: Data and Code
The Reinhard Rogoff Excel Error Case Study
Questioning Successes and Improving Scientific Practices
Foundation's Role in Research Reproducibility
Funding Challenges for Meta-Science Initiatives
Replication Crisis in Social Sciences
Subtle Factors Affecting Reproducibility in Biology
Importance of Material Sharing and Detailed Methods
Generalizability of Scientific Findings and Interventions
Integrating Lightweight RCTs into Intervention Deployment
Reasons Studies Fail to Generalize
8 Key Concepts
Correlation vs. Causation
This fundamental distinction highlights that just because two events or variables occur together (correlation) does not mean one directly causes the other (causation). Confusing these can lead to significant misunderstandings in public discourse and policy.
Bayesian Interpretation of Probability
This approach to probability involves incorporating prior knowledge or 'base rates' into calculations. It helps in understanding scenarios like medical test results, where the overall prevalence of a condition significantly impacts the probability of actually having it after a positive test.
P-value
A p-value quantifies the probability of observing data as extreme, or more extreme, than what was found in an experiment, assuming that there is no actual effect (i.e., the null hypothesis is true). It does not directly indicate the probability that a result is true or that an effect exists.
Transfer of Learning
This concept refers to the idea that skills or knowledge acquired in one domain can be applied to or improve performance in another, seemingly unrelated domain. The episode suggests that rigorous studies often show limited transfer, implying direct teaching of desired skills is more effective.
Statistically Significant
A term used when a p-value falls below an arbitrary threshold, typically 0.05, suggesting a result is unlikely to have occurred by chance. The episode argues this threshold creates a false dichotomy, as evidence exists on a continuum and small differences around the cutoff are not truly significant.
Reproducibility (of Research)
The ability for independent researchers to obtain the same results when conducting an experiment again, ideally using the same methods, data, and analytical code. A lack of reproducibility signals potential issues with the original finding or an incomplete understanding of experimental conditions.
Publication Bias
The tendency for scientific journals to preferentially publish studies that report positive or statistically significant findings, while studies with null or inconclusive results are less likely to be published. This can distort the scientific literature, overstating the prevalence of real effects.
Valuable Null Results
These are research findings where no effect is found, but they are still important for advancing knowledge. They are valuable when they address theoretically interesting questions, contradict widely held beliefs, or demonstrate that a method or intervention does not work as claimed, especially if the study was well-conducted.
11 Questions Answered
High school math should prioritize practical statistics and data analysis, including concepts like probability, correlation vs. causation, mean, median, and probability distributions, over esoteric geometry or trigonometry, to better equip students for understanding the world and scientific reasoning.
Geometry is criticized for its emphasis on esoteric proofs, with the justification that it 'trains the mind.' Critics argue that teaching actual logic directly would be more effective for developing logical thinking, and that much of geometry is not practically useful for most students.
A p-value indicates the probability of observing your experimental results, or results more extreme, if there were truly no effect or difference (e.g., if a coin were perfectly fair). It does not directly tell you the probability that your hypothesis is true.
People often mistakenly interpret a p-value as the probability that their result is true, which is the inverse of what it actually represents. The p-value also considers the probability of 'more extreme' data, making its interpretation less intuitive and prone to misapplication.
The 0.05 p-value cutoff is an arbitrary convention, and evidence exists on a continuum. Findings with p-values just above or below this threshold are not meaningfully different in terms of the strength of evidence, making rigid cutoffs problematic for scientific interpretation.
Not all null results should be published, as many may stem from uninteresting questions or flawed research. However, null results are highly valuable when they address theoretically significant questions, contradict widely accepted beliefs, or demonstrate that a method or intervention is ineffective.
Research reproducibility refers to the ability of other researchers to independently replicate an experiment and achieve the same results as originally reported. This involves using the same methods, data, and analytical code, and its absence suggests potential issues with the original findings or an incomplete understanding of the experimental context.
Studies fail to replicate for various reasons, including original research errors, publication bias favoring positive results, unknown or unmeasured factors influencing outcomes (e.g., specific lab conditions), and challenges in perfectly reproducing complex experimental designs or protocols.
Funding for meta-science is often challenging because it's perceived as less exciting than direct biomedical research. Leaders may be hesitant to divert funds to initiatives that question the reliability of existing research, which can be seen as controversial rather than groundbreaking.
One solution is to integrate lightweight randomized controlled trials (RCTs) directly into the deployment of interventions. This means continuously running studies on the specific populations and in the exact formats where interventions are used, rather than relying on generalizations from studies conducted elsewhere.
Studies can fail to generalize because the original result was noise, there are differences in intervention quality or dosage during rollout, cultural factors are at play, or unknown contextual variables were not accounted for in the original study.
26 Actionable Insights
1. Restructure Math Education for Real-World Use
Advocate for high school math curricula to prioritize basic statistics, data analysis, probability, and direct logic instruction over traditional geometry or advanced calculus. This helps students understand the world, news, medical information, and make better decisions.
2. Teach Logic Directly
Instead of hoping students learn logic through indirect methods like geometric proofs, teach actual logic directly. This is a more effective way to train logical thinking skills.
3. Understand Correlation vs. Causation
Learn to differentiate between correlation and causation. This is crucial for interpreting information, understanding public discourse, and avoiding false conclusions about why events occur.
4. Grasp Bayesian Probability & Base Rates
Develop an intuitive understanding of Bayesian probability, particularly how to incorporate base rates into decision-making. This helps avoid common errors in interpreting probabilities, such as in medical test results.
5. Understand Core Statistical Concepts
Learn the concepts of mean, median, and basic probability distributions (e.g., standard deviation, width). These are fundamental for making sense of data, the world, and scientific findings.
6. Participate in Scientific Reasoning
Recognize that scientific reasoning is not exclusive to ‘scientists’; everyone can participate in better reasoning by understanding principles of mathematical probability and logic. This democratizes knowledge acquisition.
7. Interpret P-values Correctly
Understand that a p-value indicates the probability of observing data as extreme or more extreme if there were no effect, not the probability of an effect given the data. This prevents misinterpretation of scientific results.
8. Use P-values to Rule Out Sampling Error
Employ p-values as a tool to determine if a result can reasonably be attributed to random sampling error or noise. A very low p-value suggests the result is unlikely due to chance alone.
9. Be Wary of Small Sample Size + Large Effect
Approach findings with small p-values from very small sample sizes (e.g., N=20) and large effects with suspicion. Such results may be flukish and less likely to replicate.
10. Avoid Dichotomous Thinking with P-values
Do not treat the 0.05 p-value cutoff as a strict dichotomy for ‘statistical significance.’ Evidence exists on a continuum, and results just above or below this arbitrary line are essentially equivalent in terms of actual evidence.
11. Publish Theoretically Interesting Null Results
Prioritize publishing null results for questions that are theoretically interesting and would advance the field, regardless of whether the answer is positive or negative. This informs researchers about dead ends or what doesn’t work.
12. Publish Applied Null Results for Common Beliefs
Publish null results for applied interventions that are widely used or believed to be effective but are found not to work. This provides valuable information to practitioners and the public.
13. Publish Null Results for Failed Methods/Tools
Disclose when a scientific method or tool is discovered not to perform as claimed. This helps improve future scientific practices and tool development.
14. Implement Rigorous Coding Practices in Academia
Academics should adopt industry best practices for programming, including testing code for bugs and conducting independent code reviews. This helps catch inevitable errors in data analysis.
15. Double-Check High-Impact Work
For research with significant implications (e.g., policy, public health), rigorously double-check all calculations and analyses. Increased impact demands increased responsibility and verification.
16. Question Successes as Much as Failures
Perform ‘post-mortems’ on successful experiments or projects, not just failures, to evaluate if good decisions were made or if luck played a significant role. This improves decision-making processes.
17. Adopt Unit Tests in Scientific Programming
Integrate unit tests into scientific code development, where code is written specifically to test other code. This is an indispensable practice for catching bugs.
18. Fully Disclose Experimental Methods
Provide comprehensive and detailed descriptions of experimental methods in scientific publications. This is crucial for other researchers to accurately reproduce studies.
19. Be Less Grandiose in Scientific Claims
Avoid overgeneralizing findings from specific populations (e.g., university undergraduates) to broader humanity. Acknowledge the limits of generalizability and the specific context of the study.
20. Design Interventions for Specific Populations/Formats
When developing interventions, study them on populations very similar to the intended real users and in the exact format they will be deployed. This helps ensure effectiveness without needing broad generalizability.
21. Integrate Lightweight RCTs into Intervention Deployment
Weave lightweight randomized controlled trials (RCTs) directly into the ongoing deployment of interventions. This allows for continuous, high-quality data collection on the target population and iterative improvement.
22. Conduct Baseline RCTs in New Contexts
When introducing an intervention or policy in a new geographical or cultural context, always start with a baseline RCT to verify its effectiveness in that specific environment. Do not assume generalizability.
23. Continuously Test and Improve Interventions
Treat interventions as dynamic entities that can be continually improved through ongoing A-B testing and data collection, rather than static, one-time deployments.
24. Recognize Impact of Implementation Quality
Understand that even a theoretically sound intervention will fail if poorly implemented. Quality of execution is a critical factor for success.
25. Account for Dosage Differences
Be aware that variations in intervention dosage (e.g., duration, frequency) can significantly alter outcomes and affect generalizability.
26. Consider Cultural Factors in Interventions
When deploying interventions in new areas, carefully consider local cultural factors and potential moral opposition, as these can profoundly impact effectiveness.
5 Key Quotes
If you want people to learn math, teach them math. If you want people to learn music, teach them music and justify it on its own terms, not because of some benefit to something else that you could have been teaching directly.
Stuart Buck
The difference between statistically significant and statistically insignificant is statistically insignificant.
Andrew Gelman (quoted by Stuart Buck)
If psychology is that brittle, we're screwed.
Spencer Greenberg
What we want is not just more publications. What we want is more publications, particularly in that area, you know, cell biology, et cetera, or publications that lead to greater understanding of how the human body works and ultimately greater ability to prevent diseases or cure diseases or address aging and issues like that.
Stuart Buck
A bad implementation of any intervention will always fail.
Spencer Greenberg
3 Protocols
Coin Flip P-Value Explanation
Stuart Buck- Assume a coin is fair (50/50 probability) as the null hypothesis.
- Flip the coin 100 times and observe the result (e.g., 60 heads, 40 tails).
- Ask: What is the probability of getting this observed result (60/40) or something more extreme (e.g., 61/39, 62/38, etc.) if the coin were truly fair?
- This calculated probability is the p-value, indicating how likely the observed data is under the assumption of no effect.
Improving Scientific Research Processes
Stuart Buck & Spencer Greenberg- Question successes as much as failures: Conduct 'postmortems' on successful outcomes to determine if decisions were genuinely good or if luck played a role.
- Implement unit tests: Write code specifically designed to test other code, helping to catch bugs and ensure accuracy in analyses.
Integrating Lightweight Randomized Control Trials (RCTs) into Interventions
Spencer Greenberg- Design an intervention to target a specific population.
- Weave a lightweight RCT into the normal deployment process of the intervention (e.g., randomize who receives the intervention or different versions of it).
- Collect follow-up data continuously as the intervention is delivered.
- If moving the intervention to a new area, start with a baseline RCT there to confirm effectiveness before further optimization.
- Continuously A-B test different versions of the intervention to improve it over time, similar to software product development.