A defining characteristic of science is reproducibility. Basically, if a result is reliable, we should observe it every single time when we perform an experiment under similar conditions. But in recent years more and more papers across various fields have begun to show that this isn’t necessarily the case.
In 2015, 270 scientists from various institutions across the world came together to publish an earth-shattering paper. Out of 100 psychology studies published in high-quality journals, only 39 could be successfully reproduced. This, on the one hand, triggered a wave of headlines such as “Almost 70% of psychology studies cannot be reproduced!” or “Psychology in a replication crisis!”. On the other hand, it contributed to what was already an ongoing, albeit more silent, discussion about reproducibility of scientific results in general.
Now, the 2015 paper was an impressive feat of scientific work and collaboration. And, in addition to its scientific merits, I would dare add, it managed to do what many other studies had failed, which is to draw the general public’s attention to a current scientific problem. But between the flashy headlines, some details were easily lost and the implication became that most psychology studies are garbage. So let’s bring those pesky details into the spotlight for once.
Why can’t studies be reproduced?
There are actually many reasons why studies cannot be reproduced and surprisingly, they’re not all “because those studies are crap” (although yes, that is one reason). Based on who is to blame for the failure to reproduce a study, we can classify these reasons into three categories.
1. Problems in the original study
In the first category we include reasons for failure on the side of the original author/s of the study. These can range from mere incompetence (mistakes in the analysis, inadequate sample size, mistakes in the data collection procedure, incomplete descriptions of what was done etc.) to questionable practices (testing many hypotheses and only reporting the significant ones, collecting more data after conducting the analysis to boost power) to plain fraud (fabricating data and falsifying results).
Fortunately, the latter doesn’t happen very often and it’s also sometimes easier to spot compared to the others. The other reasons listed above are also serious, but there are somewhat straightforward solutions to correct them from now on (more on that later).
2. Problems in the replicated study
Of course, original authors are not the only ones who can make mistakes. Those who replicate a study could also mess something up. But by far the biggest issue when it comes to replications in psychology and neuroscience is that experimental conditions are often not the same.
To illustrate why that’s important, let’s imagine a simple experiment: we put a cup of water into the freezer and leave it there for a while. When we check again, we see that it’s frozen and conclude that water left in the freezer will turn to ice. Now, we want to replicate this experiment with another cup of water. We put in the freezer, leave it there for a while, check again…and, lo and behold, the water isn’t frozen this time. What’s the conclusion? Is it that the first experiment was wrong? Since we know that water freezes at sub-zero temperatures, it’s easy to answer the question: we messed something up the second time. Maybe we didn’t leave it there for long enough or maybe someone came and unplugged the freezer.
However, people aren’t water (I know, this is the high-quality content you expect from this page, right?). When it comes to humans, there is a lot more variability that influences experiments. Even if one could somehow manage to include only people of the exact same age, gender, socioeconomic and educational status, brains are still different between people. And even the same person’s performance could vary from day to day. Maybe they slept badly, so they can’t focus that well. Or they a couple of extra shots of espresso and now they can hear colours and whizz through that reaction time task.
In any case, there are many hidden variables which might influence the results. And some effects observed in psychology and neuroscience could be extremely sensitive even to small variations in conditions. This makes it quite difficult to know why a study could not be replicated.
3. Problems with statistics, life, the universe, and everything
Finally, the last reason is, simply put, pure chance. Whenever a hypothesis is tested, there is always a possibility that the observed results are simply due to chance. In psychology and neuroscience, that possibility of error is considered acceptable if it is smaller than 5 in 100 (the infamous p < .05). So even if everything is done perfectly, in theory, 5 out of 100 studies won’t be reproducible simply because life and the universe are like that. (If you want to go further down this rabbit hole, this super-cheerful paper from 2005 takes it to the extreme by claiming that most published results from most fields are probably false.)
How can reproducibility be improved?
So now that we have a detailed list of reasons why all those studies failed to be reproduced, let’s talk about solutions.
The best solution that came out of the replication crisis so far has been an increased push towards transparency. Researchers are now actively encouraged to take a few steps which increase chances of reproducibility:
- they are encouraged to pre-register their studies. This means that before they start running an experiment, they go on a public website where they write down what exactly they plan to test and how they plan to test it. This minimizes many of issues we’ve listed in terms of original study problems.
- many scientific journals now require that the original data and code are also made available. Therefore, it’s easier for other scientists to check for mistakes and to try to replicate what was done initially.
In addition to that, replication studies are becoming more accepted. In science, there is obviously a bias towards novelty. No one wants to do something that has already been done, because we assume that’s been proven already. But as the issue of reproducibility is gaining momentum, the acceptance of replication studies is growing.
And as more and more replications for a certain study are published, we can also see a glimmer of hope with respect to the stats issue. Basically, one can start looking at the body of evidence instead of a single lonely study and thus have more confidence in the results. In other words, if out of ten papers, nine claim to not have found a significant effect, then it’s probably not there and vice versa.
In theory, other steps can also be taken. For example, studies could use larger sample sizes. Or they could try to replicate their results in independent samples before publication. But this is highly problematic, because acquiring data from hundreds/thousands of people is very expensive and most labs simply don’t receive that kind of funding. Additionally, it could take several years to do that, but a lot of researchers are on short-term contracts, so it’s not really feasible for them. Plus, one needs to take into account storage and computational capacities, which again comes back to the issue of funding. However, some progress has been made in this direction, as many labs have begun collaborating for larger studies.
Why hasn’t anyone checked this earlier?
Finally, it’s all fine and dandy that we know about these issues and are working hard to address them now. Still, you probably can’t help but wonder “why only now? What the hell have researchers been doing all this time?”. Well, everyone kind of suspected this was a problem for a pretty long time. But, as mentioned above, until recently, there wasn’t much incentive to replicate other people’s work, let alone publish that replication. Even more, if a study found no significant effect, it was extremely unlikely to ever see the light of day (the positive result bias).
And scientists who cannot publish their results won’t be scientists for much longer. Successful publications are not only the way through which researchers communicate their findings. They are also the metric through which their job performance is evaluated, kind of like how successful heart surgeries are the performance metric for heart surgeons. Jobs and funding are given out based on the number and quality of publications a scientist has managed to produce. But whether that’s a good metric is a can of worms for a different post.
For now, let’s sum up this post. We’ve seen that reproducibility is indeed a problem in psychology (but also in other fields of science). Yet, the reasons are complex and go beyond the mere “it’s all trash” argument. And scientists are working hard to improve the reproducibility of their findings.
What did you think about this post? Let us know in the comments below.
You might also like:
Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604).
Begley, C. G., & Ellis, L. M. (2012). Raise standards for preclinical cancer research. Nature, 483(7391), 531-533.
Ioannidis, J. P. (2005). Why most published research findings are false. PLoS medicine, 2(8), e124.
Miłkowski, M., Hensel, W. M., & Hohol, M. (2018). Replicability or reproducibility? On the replication crisis in computational neuroscience and sharing only relevant detail. Journal of computational neuroscience, 45(3), 163-172.
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251).