Should science stop worshiping statistical significance? (with Andrew Gelman)

Mar 5, 2026 Episode Page ↗

Overview

Andrew Gelman, Ph.D., discusses the replication crisis in science, highlighting common statistical flaws and the need for greater transparency and uncertainty in research. He advocates for valuing criticism, measuring mechanisms directly, and incorporating prior knowledge through Bayesian methods.

At a Glance

21 Insights

1h 19m Duration

15 Topics

7 Concepts

Deep Dive Analysis

15 Topic Outline

Andrew Gelman's Approach to Critiquing Science

Reactions to Criticism and Value of Flaw-Finding

Understanding the Scientific Replication Crisis

Ego Depletion and Flawed Research Paradigms

P-Hacking and the Garden of Forking Paths

Rethinking the Role of P-Values in Research

Politicization and Hidden Biases in Science

Evaluating Evidence from Correlational Studies

Challenges and Value of Meta-Analyses

Statistical Realities of Election Polling

Introduction to Bayesian Statistical Methods

Embracing Uncertainty and Variation in Data

Improving Scientific Process and Data Sharing

LLMs and Their Utility in Statistical Analysis

Detecting Issues in Clinical Trial Design

7 Key Concepts

P-hacking

P-hacking is the practice of analyzing data in many different ways until a statistically significant result (low p-value) is found, making it appear that the result couldn't have occurred by chance alone. Andrew Gelman prefers the term 'garden of forking paths' to describe this, as researchers may not be intentionally 'hacking' but rather exploring different analytical paths.

Garden of Forking Paths

This concept describes how researchers, when analyzing data, can make numerous choices (e.g., which variables to include, how to treat outliers, specific statistical tests) that lead to different results. This flexibility can inadvertently lead to finding statistically significant patterns that are not robust, even without malicious intent.

Replication Crisis

The replication crisis is a phenomenon in science, particularly in psychology and social sciences, where many high-profile published studies fail to produce the same results when re-attempted by other researchers. This calls into question the reliability of a significant body of scientific literature.

Ego Depletion

Ego depletion is a psychological theory suggesting that willpower is a finite resource that can be depleted through use, making it harder to exert self-control later. Despite hundreds of published studies, large-scale replication attempts have largely failed to confirm the phenomenon, raising questions about its scientific validity.

Penicillin Model of Science

This term describes a simplistic approach to scientific research where a 'treatment' is applied (like taking a pill or pushing a button) and only the final outcome is observed, without understanding the underlying mechanisms or contextual factors. This approach can fail when effects are complex, variable, or context-dependent.

Bayesian Statistics

Bayesian statistics is a statistical approach that combines observed data with prior information or beliefs about what is true to produce more realistic and informed estimates. It explicitly models prior knowledge, allowing for more nuanced inferences, especially when data is sparse or when integrating findings from multiple sources.

One-way Street Fallacy

This fallacy refers to the assumption that a particular intervention or phenomenon can only have a positive or neutral effect, and 'can't hurt.' This thinking ignores the possibility of negative or counterproductive effects in certain contexts or for certain individuals.

12 Questions Answered

How should scientists react to criticism of their work?

Scientists should welcome criticism, as it can help identify flaws and lead to improvements in their research, even if the criticism is delivered rudely.

What is the 'replication crisis' in science?

The replication crisis refers to the widespread finding that many published scientific studies, particularly in fields like psychology, fail to produce the same results when other researchers attempt to replicate them.

How can hundreds of papers be published on a phenomenon that doesn't seem to exist, like ego depletion?

This can happen due to a combination of noisy measurements, the pressure for statistically significant results (leading to overestimation of effects), and researchers inadvertently finding patterns by analyzing data in various ways until a publishable result emerges.

What is 'p-hacking' and why is it problematic?

P-hacking, or the 'garden of forking paths,' is when researchers explore multiple analytical choices in their data until they find a statistically significant p-value (typically less than 0.05). This can lead to publishing results that appear robust but are actually due to chance or analytical flexibility, overstating the strength of evidence.

Should p-values be eliminated from scientific practice?

While Andrew Gelman personally doesn't use them and believes science might be better without them, he acknowledges they serve a purpose by setting a bar against publishing overly noisy or trivial findings, preventing a 'moral hazard' where researchers might publish any positive result regardless of quality.

To what extent does political bias influence scientific research?

Political content can subtly influence research, especially in areas like political science and psychology, by favoring studies that suggest hidden forces drive human actions or that voters are easily swayed, which can align with certain cynical or anti-democratic viewpoints.

How should one evaluate correlational studies, especially in health research?

Correlational studies should not be dismissed entirely, as they can provide valuable information and insights into potential mechanisms. However, they require careful consideration of intermediate outcomes, potential confounding factors, and the reasonableness of the proposed mechanisms, as correlation does not imply causation.

Are meta-analyses always reliable for synthesizing scientific literature?

Meta-analyses can be powerful tools, but their reliability depends on the quality of the individual studies included ('garbage in, garbage out'). If the underlying studies are biased or poorly conducted, combining them will not necessarily yield a more accurate result.

Why were polls seemingly inaccurate in the 2016 US presidential election?

Polls in 2016 were off by about two or three percentage points, which is within a typical historical range of error. The perceived inaccuracy was partly due to unrealistic expectations fostered by highly accurate polls in previous close elections (2000, 2004, 2008) and the fact that Trump attracted non-typical voters who were harder for polls to reach.

What is Bayesian statistics and why is it useful?

Bayesian statistics combines observed data with prior information or beliefs (priors) to produce more realistic and informed estimates. It is particularly useful when data is limited, as it allows researchers to integrate existing knowledge to refine their understanding and make more robust inferences.

How can the scientific process be improved to address issues like the replication crisis?

Improvements include encouraging registered reports (publishing study designs before data collection), allowing the publication of just data sets for others to analyze, and fostering a culture where researchers are more open about uncertainty and variation in their findings.

How should large language models (LLMs) be used in statistical analysis?

LLMs can be useful for searching technical information, like a highly accessible textbook, and for coding statistical programs. However, they should be used with caution and their outputs verified, as their ability to perform complex, novel statistical analysis or find flaws in papers is limited.

21 Actionable Insights

1. Embrace Uncertainty in Science

Recognize that evidence often points in different directions and that it’s a relief to be uncertain. This mindset helps avoid the false certainty often derived from misinterpreting statistical results.

2. Value and Seek Criticism

Actively welcome criticism of your work, as it can reveal flaws and lead to significant improvements. Treat criticism as a gift that helps refine and strengthen your understanding or research.

3. Avoid Presumption of Correctness

Do not assume a finding is correct simply because you cannot immediately identify a ‘smoking gun’ error. Maintain skepticism and be aware that flaws may become obvious later.

4. Be Open About Uncertainty

Communicate openly about the uncertainty in your findings, rather than presenting results as definitive. This fosters a more realistic understanding of scientific evidence.

5. Prioritize Research Design

Focus on robust study design, as it is more critical than the analysis itself. Use your intuitions and prior knowledge to inform and strengthen your experimental setup.

6. Measure Mechanisms Directly

When studying complex phenomena, measure the underlying mechanisms as directly as possible, rather than relying solely on reduced-form outcome analyses. This provides deeper insights into how effects occur.

7. Avoid One-Sided Thinking

Be aware of the ‘one-way street fallacy,’ the assumption that an intervention can only be positive or neutral. Consider that effects can be negative in some settings or for certain individuals.

8. Plot Data for Initial Insights

Before formal statistical analysis, create graphs of your data to visually inspect patterns and understand what the data look like. This can provide intuitive insights that complement formal methods.

9. Focus on Real-World Measures

Report and interpret findings using real-world scales (e.g., percentage shifts, death rate reductions) rather than solely relying on p-values. This makes results more tangible and understandable.

10. Be Realistic About Effect Sizes

When designing new studies, set realistic expectations for potential effect sizes based on prior knowledge and the inherent noise in measurements. This helps avoid designing underpowered studies.

11. Assess Mechanism Reasonableness

When evaluating health or other claims, assess the reasonableness of the proposed mechanism, especially for observational studies. This helps to gauge the plausibility of reported correlations.

12. Make Decisions with Imperfect Information

Recognize that you often have to make decisions with imperfect data and that waiting for perfect evidence is not always feasible. Not making a decision is itself a decision.

13. Conduct ‘One-Study Meta-Analysis’

Even with a single study, consider the potential variation of the effect across different populations and conditions. This ‘one-study meta-analysis’ helps account for broader uncertainty beyond the immediate sample.

14. Include Study-Level Predictors

In meta-analyses, incorporate characteristics of individual studies as predictors to understand where and for whom an intervention works or doesn’t work. This helps explain variation in effects.

15. Use Training and Test Sets

When working with large datasets, split them into training and test sets. Analyze the training set extensively, then validate findings on the unseen test set to prevent overfitting and spurious results.

16. Publish Study Designs (Registered Reports)

Consider publishing the design of your study (e.g., as a registered report) before data collection. This commits you to publishing results regardless of outcome, increasing transparency and reducing publication bias.

17. Publish Raw Data Separately

If your data is interesting, publish it as a standalone paper, allowing other researchers to analyze it independently. This promotes open science and diverse interpretations.

18. Show Raw and Adjusted Results

When performing complex analyses, present both the raw results and the results after adjustments. This transparency helps readers understand the impact of your analytical choices.

19. Respect Implementers in Policy

When designing and implementing policies or interventions, actively involve and respect the people on the ground (e.g., doctors, teachers, police). Their commitment and involvement are crucial for success.

20. Limit Email Checking

Consider limiting email checking to later in the day (e.g., after 4 p.m.) to protect focus and productivity. This habit helps manage interruptions and maintain concentration on core tasks.

21. Spend Time with Family

Prioritize spending time with family, as it can be a significant source of happiness. This direct approach to well-being can be more effective than indirect or subliminal methods.

8 Key Quotes

If you publish something, it's public. And if you're willing to let people forward your paper and say how great it is without asking you for permission, you should be willing to let people forward your paper and say how it sucks without asking for permission, too.
Andrew Gelman

Criticism is wonderful. People should value it more.
Andrew Gelman

I sometimes think that actually a lot of science would be better if statistics had never been invented.
Andrew Gelman

Taking a bunch of really garbage mortgages and then bundling them up and saying, oh, look, if I bundle them just the right way, I get AAA. Yeah, that's how I feel with a lot of that.
Andrew Gelman

The very fact that it's so hard to say it is an indication that I think it's the wrong thing to say.
Andrew Gelman

Garbage in, garbage out.
Andrew Gelman

Nobody said you had the right to predict the winner in a very close election. I mean, nobody owes you that.
Andrew Gelman

The belief is that the part is representative of the whole.
Andrew Gelman

7 Key Numbers

108%

Estimated percentage of non-whites voting for Barack Obama in New Hampshire An impossible estimate from an early model, highlighting a flaw found by a critic in Andrew Gelman's work.

18 percentage points

Claimed change in attitude on immigration due to subliminal smiley face A claim promoted by a prominent political scientist, considered implausible by Andrew Gelman.

Less than 0.05

Common p-value threshold for statistical significance The widely accepted threshold for publishing results, often contributing to 'p-hacking' practices.

Standard errors away from zero for a mind-body healing result An extremely high initial value, later reanalyzed to be closer to two standard errors, indicating an overstatement of evidence.

80%

Typical statistical power for study design The standard rule for designing studies to have an 80% chance of detecting a statistically significant effect if it truly exists.

2-3 percentage points

Approximate polling error in the 2016 US Presidential Election The margin by which polls were off, considered within historical norms but led to 'unrealistic expectations' among the public.

~10%

Percentage of people who answer their cell phone from random numbers Illustrates the significant challenge pollsters face in reaching and recruiting participants for surveys.