Related, but distinct, from the issue of impact factors is that most scientific journals have “novelty” as a publication criterion. In other words, a paper that presents a replication of previously-published work is unpublishable in the vast majority of journals. This is a serious problem that has led to a replication crisis in many fields, including Psychology. The replication crisis is, simply put, that many published results cannot be replicated. But, this information is often not available to the scientific community, because results published in peer-reviewed journals are considered the gold standard of “truth”. In fact, by one line of argument (and the title of a now-famous paper), the results of “most published research findings are false” Ioannidis, 2005. Ioannidis’ argument demonstrates that chasing impact factors incentivizes journals and authors to publish results that are novel and surprising, but in general if a result is surprising, that is because it goes against what is already known. While in some cases these may be true results, the odds of this are lower than for confirmatory findings. Moreover, once such a result is published, it is both harder for anyone else to publish a replication of the methods, and even harder if the replication findings are not consistent with the published, peer-reviewed results. This is not to say that the published results in question are fraudulent, or that the authors had any ill intent in publishing them. Most scientific results are based on statistics, which rely on samples of a much larger population and estimates of the probability that the result is due to chance. This carries inherent risk that any particular result is due to chance, and the only real way to determine this is to replicate the experiment, ideally more than once. However, given the time and cost of research, as well as the dependence of scientific careers on publication (especially high-impact publication), scientists are much more likely to publish a result sooner, rather than wait for replication (if they can even afford to, either financially or career-wise).

Another practice that has contributed to the replicability crisis is behaviour by scientists that has been widely condoned in the past, but that is also questionable upon reflection. One such behaviour is HARKing Kerr [Ker98] or hypothesizing after the results are known. This essentially entails analyzing data and then writing the results up with a hypothesis (prediction about the outcome of the experiment) that is based on the results, rather than one that was created a priori, before the study was conducted. Statistical testing is often described as hypothesis testing, because the interpretation of statistical results relies on estimates that a result could be obtained by chance. If the statistical result is consistent with an a priori theory, this provides additional reassurance that the result is correct — because it is consistent with a theoretical prediction. On the other hand (as noted above), surprising research results are also more likely to be false, such as the result of a Type 1 (false positive) error. Moreover, scientific hypotheses should be disconfirmable, but it is impossible to disconfirm a result that you only “predict” after you have found it (without doing more experiments). This is not to say there is no value in following up on unexpected results, or conducting exploratory research. However, it is critical to distinguish between results that you genuinely predicted before conducting the experiment, and those obtained post hoc.

Another questionable practice is p-hackingHead, Holman, Lanfear, Kahn, and Jennions [HHL+15]. This is the practice of iteratively collecting and analyzing data, until a specific, desired result is obtained, and then terminating data collection. This practice is problematic because, again, p values used to determine statistical significance reflect probability, not ultimate truth, and so there is always a probability of error. A properly-designed study will conduct a power analysis prior to starting, which involves using the expected size of the experimental effect, and variance among samples, to determine how large a sample size (e.g., number of participants) to collect. If the estimates of effect size and variance are accurate, then obtaining a significant p value from the results of a study using the estimated sample size can be taken to be probably true. However, p-hacking virtually guarantees false results, because one is essentially playing the lottery until one gets a “big win”, and then quitting. This is fine (and probably advisable) when playing roulette in Las Vegas, but it is frankly dishonest scientific practice.