Science

P-hacking: how studies become 'significant' without lying

The 2011 paper that named p-hacking — and the cleanup that has followed in psychology, medicine, and economics — has reshaped what 'statistically significant' actually means.

James Okonkwo
Contributing Editor, Tessera. PhD, Behavioral Economics, LSE
4 min read

Joseph Simmons, Leif Nelson, and Uri Simonsohn's 2011 paper carried a title that sounded like a magazine pitch: "False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant." The paper did not allege fraud. It demonstrated, through simulations and disclosed analyses of their own studies, that ordinary researchers using ordinary analytical decisions could produce statistically "significant" findings from random noise about 60% of the time.

The phenomenon they identified got named p-hacking. It is the most consequential methodological finding in psychology of the past fifty years.

1. The mechanism

Statistical significance at the p < .05 threshold is supposed to mean: if there were no real effect, you'd observe data this extreme only 5% of the time. The interpretation breaks down when researchers make many decisions about analysis after seeing the data:

  • Test 4 different measures and report the one that worked
  • Try 3 different sample exclusion rules
  • Run with and without a covariate, keep the better version
  • Collect 50 more participants if your first 50 don't show the effect
  • Try the analysis multiple ways and report only the cleanest

Each of these is individually defensible. Combined, they inflate the false-positive rate substantially. Simmons et al. showed that researcher degrees of freedom — small flexible choices not pre-registered — could turn a true 5% false-positive rate into a 60%+ rate (Simmons, Nelson, & Simonsohn, 2011).

2. Why this isn't fraud

The researchers doing this are mostly not cheating. They are using their judgment about which analyses are most informative. The problem is that the result of those judgments, made after seeing the data, is structurally biased toward finding effects.

This distinguishes p-hacking from outright fraud. Most of the inflated literature was produced by honest researchers using flexible analytical practices that the field treated as normal.

3. The reforms

The 2010s wave of methodological reform in psychology has been almost entirely about constraining researcher degrees of freedom:

Pre-registration. Researchers publicly commit to hypotheses and analyses before data collection.

Registered Reports. Journals accept papers based on the design alone — before data is collected — committing to publish regardless of results.

Effect-size reporting. Statistical significance alone is no longer accepted as evidence; effect sizes must be reported.

Multiverse analysis. Some papers now report results across all reasonable analytical choices, not just the one the authors preferred.

These have measurably reduced false-positive rates in the post-2015 literature.

4. The pre-2015 status

The implication for older studies is uncomfortable. Substantial fractions of the psychology literature published before about 2015 — and to a lesser extent, of medical and economic research — was produced under conditions that p-hacking could substantially distort. Most popular psychology books rely heavily on pre-2015 findings.

This doesn't make those findings wrong. It means single-study findings from that era should be cited with explicit acknowledgment that they may not replicate.

5. The reader's calibration

For a non-specialist reading any single study: ask if it was pre-registered, what the effect size is (not just the p-value), whether it's been independently replicated. If those answers are "no, unclear, no" — treat it as a suggestion rather than evidence.

P-hacking didn't make psychology fake. It made the burden of evidence higher than the field had been treating it. A decade in, the field is better than it was. The catch-up on the existing literature is still in progress.

References
  1. Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359-1366.
  2. Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing transparency through a multiverse analysis. Perspectives on Psychological Science, 11(5), 702-712.