The replication crisis: what 2015 changed in psychology
The Open Science Collaboration's attempt to replicate 100 published psychology findings produced one of the most uncomfortable papers in the field's history. A decade later, here's what changed.
On August 28, 2015, the Open Science Collaboration published a paper in Science titled "Estimating the reproducibility of psychological science." The team had attempted to replicate 100 findings published in three major psychology journals in 2008. Their result: only 39 percent of the replications produced significant findings in the same direction as the originals (Open Science Collaboration, 2015).
The paper landed like a depth charge. It did not say psychology was fake. It said that the literature, taken as a whole, was substantially less robust than the published record suggested. A decade later, the field has changed in specific ways. Other fields are still catching up.
1. What the paper actually showed
The 39% figure is often misquoted. The OSC team used multiple criteria for "successful replication" — statistical significance, effect size in the original confidence interval, subjective rating. By the strictest criterion (significant in the same direction), 36% replicated. By the most lenient (any directional consistency), 47%.
The mean effect size of the replications was roughly half the original effect sizes. This is the more important number. Even findings that did replicate often showed weaker effects than the originals — consistent with publication bias and selective reporting in the original literature.
2. The reaction
The paper was contested immediately. Daniel Gilbert and colleagues argued the OSC team had used methods too different from the originals and that proper analysis would show 70%+ replication (Gilbert et al., 2016). The OSC team responded that Gilbert's reanalysis required assumptions the data didn't support (Anderson et al., 2016).
The exchange continues. But the broader landscape has shifted independent of the specific debate: subsequent replication efforts in economics, medicine, and political science have found similar or worse patterns (Camerer et al., 2018).
3. What changed methodologically
The decade since 2015 has produced several institutional changes:
Pre-registration. Researchers now routinely publish their hypotheses and analysis plans before collecting data, in registries that can't be edited after the fact. Pre-registered studies replicate at substantially higher rates than non-pre-registered (Nosek et al., 2018).
Registered Reports. Journals like Nature Human Behaviour now accept papers based on the design alone — before data collection — committing to publish regardless of whether the result is positive.
Sample size norms. Effect sizes from the replication studies suggested most psychology research had been underpowered. Sample sizes have crept upward, though not as fast as the methodology recommends.
Effect size reporting. Reviewers and editors now routinely require effect sizes (Cohen's d, r², etc.), not just p-values.
4. What hasn't changed
The career incentives of academic psychology still reward novel positive findings over confirmation studies. Most replication work remains underfunded and undercited. The journals that publish original findings rarely publish failed replications of those findings with the same prominence.
The field has gotten methodologically more rigorous. The publication system that selected for fragile findings hasn't been fully reformed.
5. The honest reader's position
The implication for a reader of pop psychology in 2026 is straightforward: any single study published before roughly 2015, particularly with a small sample and a surprising result, should be treated with suspicion until the replication picture is clear. The half-shelf of bestselling psychology books from the 2000s — including several that became cultural touchstones — rests partly on findings that subsequent replication has weakened or undone.
This isn't a reason for nihilism. It's a reason for caution. The findings that replicate tend to be the boring ones — moderate effects, mundane mechanisms. The exciting ones tend to shrink.
References
- Anderson, C. J., Bahník, Š., Barnett-Cowan, M., et al. (2016). Response to comment on "Estimating the reproducibility of psychological science." Science, 351(6277), 1037.
- Camerer, C. F., Dreber, A., Holzmeister, F., et al. (2018). Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour, 2(9), 637-644.
- Gilbert, D. T., King, G., Pettigrew, S., & Wilson, T. D. (2016). Comment on "Estimating the reproducibility of psychological science." Science, 351(6277), 1037.
- Nosek, B. A., Ebersole, C. R., DeHaven, A. C., & Mellor, D. T. (2018). The preregistration revolution. PNAS, 115(11), 2600-2606.
- Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.