Open Methods and Data

Open Methods and Data#

You have likely been taught that the Methods section of a published journal article should describe the methods in sufficient detail that they can be replicated by anyone reading them. In practice, this is rarely entirely true. Methods sections gloss over many details, for reasons such as assumptions about what the reader understands, or saving space. Many journals set limits on the length of articles, and usually authors are more concerned with presenting and discussing results and theory, than with the methods. As well, an overly-detailed Methods section can be tedious to read. Indeed, as you will discover in this course, data analysis often involves extremely long and complex computer code, and the “devil is in the details”. Publishing thousands of lines of Python code in a journal article is not feasible.

At the same time, a lack of transparency in methods has contributed to the replicability crisis, and arguably restricted scientific advances in cases where researchers cannot access enough of the methods to perform a replication, or even determine what factors may have contributed to a result. For instance, many studies in psychology and neuroscience rely on presenting stimuli to participants. It is relatively rare for researchers to publish their entire stimulus set (as opposed to one or two examples), and it is not uncommon for old stimuli to simply be lost, so that even if a researcher wanted to share them with a colleague later, they cannot. Likewise, since data analysis code is complex, there is always the possibility that it contains errors — but if the actual code used to obtain a result is not available, no one can audit it to catch those errors.

For these reasons, it is increasingly common for researchers to provide greater transparency in their practice by making all stimuli, analysis code, and data publicly available. The Open Science Foundation (OSF) platform mentioned earlier is one that supports this practice. In addition to pre-registering methods, researchers can post all of their stimuli, analysis code, and data. Platforms like OSF can also serve as a preprint archive, hosting a copy of the resulting manuscript. As with methods, any material placed on OSF can be embargoed, so that researchers have control of when it becomes available to the public.

Increasingly, journals are making the decision to require that researchers make at least their data, if not their analysis code and stimuli, publicly available as a criterion for publication. Beyond transparency and accountability, these open practices can benefit the scientific community in other ways. For example, a given data set may contain interesting information that were not the focus of the analyses performed by the researchers that collected the data. By making data publicly available, other researchers can access it and perform different analyses, or meta-analyses (analyses of results over multiple studies). Given that most research is funded by taxpayer money, this also increases the value of the public investment in science. Another benefit is education. By examining the analysis code from a published study, other researchers may learn new ways of doing things, which benefit their own research. As well, open datasets can be used as sample data in educational settings (such as this course), which allows students to practice data analysis, and has the added benefit in many cases of knowing whether their results are correct, by comparison to the published results.

The Ethics of Open Science#

It is worth considering, however, that there are also ethical limitations on what and how scientific data can be shared. For example, as detailed in an editorial in Nature, the development of facial recognition software using artificial intelligence has been fraught with many ethical questions. In some cases, researchers obtained photos of faces from public sources (like campus webcam feeds, or Flickr pages) — typically without the consent of the people in the photos. Some researchers further published databases of such photos online in the spirit of open science, as this was the data used to train their facial recognition algorithms. But in such cases, permission to re-share those photos was not given. Moreover, some databases of face photos have been used to train algorithms to perform questionable if not frankly unethical tasks, such as detecting people from specific ethnic minorities, or to attempt to predict criminal behavior.

Even in more benign settings, such as sharing neuroimaging data, ethical questions must be considered. Data should not be shared unless those providing the data explicitly gave their consent for the data to be shared publicly. Steps must also be taken to ensure that individuals are not identifiable from their data. For example, structural MRI scans of the brain also include the outside of the head. It is fairly easy to recreate a recognizable image of an individual’s face from such a scan. Therefore, researchers sharing datasets from MRI studies must apply a de-facing algorithm to each dataset to ensure that the individual is not recognizable from the shared data.