Data#

Data” are, ultimately, information. My favourite definition of information is, “a difference that makes a difference” [(]Bateson, 1972](https://en.wikipedia.org/wiki/Steps_to_an_Ecology_of_Mind). If we measure something, and it’s always the same, then there is no information. Difference is the heart of information. But, differences can be meaningful in some way (signal), or meaningless, such as variation due to random factors (noise). Information are those differences that mean something — that make a difference. Meaning can be derived in many ways, such as through visualizations (graphing), statistical analysis, or making predictions about future data. These are all ways of identifying systematic patterns in data.

Note

You may have noticed in the first sentence of the previous paragraph, I wrote “data are”; I consider data to be a mass noun, like “sand”, and not a count noun, like “sandwich”. The singular of data is datum, which would be the smallest single unit of data. But we don’t use that word much.

In neuroscience, psychology, and other fields, data comprises all the information you might collect in a study. The types of data collected in neuroscience research spans such a wide range that it would be impossible to document them concisely here. To give a sense of the breadth of what we might call “data”, we can first consider that data are typically collected from individual “units”, be they humans, non-human animals, cells, etc.. Each of these units typically contributes many data points, including things such as age, sex, handedness, genotype, cell type, early experience (e.g., monocular deprivation), responses to questionnaires, behavioural responses (e.g., reaction times, button or lever presses, maze running patterns or times), electrophysiological measurements, images, sound files, or movies of behaviour. We call this organization of data within such units nested data; for instance, age is nested within each animal, and animals in turn may be nested within groups (e.g., different genetic lines).

For the purposes of this class we assume that the data are collected in some way (measured), and stored digitally in files, typically as numbers or characters (words, sentences, etc.) — remembering that even digital media like movies or sounds are, ultimately, files composed of numbers.Data should be, however, more than just a bunch of numbers. Ultimately, data science is about identifying meaningful patterns in data. This informs our definition of “data” — data are sets of measurements from which we aim to extract meaning. As such, we assume that the data have been collected in such a way that meaning can be derived from them.

In disciplines such as psychology and neuroscience, data are most typically collected in experiments, in which variables are systematically manipulated by the researcher. Experiments, by definition, pre-suppose particular meaning in the data — the experimental designs, and types of measurements, are performed specifically with particular patterns of results in mind (hypotheses), which are in turn derived from theories. These theories provide ways of interpreting - providing meaning — to the data.

However, lots of data doesn’t come from experiments at all. For example, surveys simply ask lots of questions, and the data analysis usually looks for relationships between answers to different questions (correlations). Even in the context of an experiment, a number of measurements may be taken that aren’t expected to be directly affected by the experimental manipulations, but might help explain differences between individuals. For example, in psycholinguistics, a number of studies have shown differences in how people interpret ambiguous sentences, depending on their working memory capacity. In these studies, participants were not selected based on their working memory capacity; rather, this capacity was measured in each individual, which the experimental manipulation was the ambiguity of the sentences. In some cases, data collection has a strong, or even exclusively, exploratory component: many measures are obtained without specific hypotheses concerning how they will affect other measurements, but with the more general hypothesis that some systematic and meaningful relationships can be identified amongst the measures taken. Indeed, such approaches are central when we move from a mindset of statistical analysis if experimental data to machine learning, classification, and prediction. For example, if we want to build a model that predicts whether a person will develop a particular disease or not, or how an already-diagnosed disease will progress, we will likely want to measure a wide range of variables that might be predictive, so that we can identify the optimal combination of variables that leads to the most accurate predictions.

In doing data science, we focus less on the process of acquiring the data, and more on what we do after it’s collected. That said, it is critically important to understand what your data are — which includes what was being measured, and how it was measured. Often we might also care about why the data were measured, but not necessarily; increasingly, researchers are making data sets openly available (e.g., through public repositories such as OSF.org not only for purposes of transparency, but with the expectation that other researchers might be able to use the data in ways other than originally intended, to generate new insights.

Our approaches to working with the timing of action potentials of single neurons, Morris water maze behaviour in rats, human reaction times, and functional MRI data will necessarily be different, according to our understanding of what was measured, the underlying physiological properties, and our goals — the meaning we are trying to derive from the data. At the same time, all of these are ultimately measurements stored in files on a computer, and data science is about learning the core skills that allow you to work with any of these types of data and try to find meaning in them.

One final thought on the definition of data: since data is a combination of signal and noise, data may need various kinds of preparation or “cleaning” to strip away noise and more easily identify systematic patterns. When some people first encounter such practices, they raise questions about the validity of such practices, or whether they are “cooking” the data — manipulating it to generate the results the researcher desires. There is a huge, and fundamental, difference between manipulating data to generate a specific, predicted result, and cleaning data to minimize noise and optimize finding the signal. Approaches to data cleaning should be systematic, well-reasoned, and accurately reported. In data science, data cleaning procedures are typically well-understood and supported by the peer-reviewed scientific literature. In contrast, manipulating the data to achieve specific ends typically involves dishonest practices such as arbitrarily removing or adding data to create a data set that generates specific results (often without reporting how or why the data were thus manipulated). Dishonest practices are not accepted in the scientific community, and people caught manipulating their data are typically publicly discredited (or at least privately discredited, within social networks), and subsequently find it hard to obtain work or the respect of their peers.