Data Cleaning

Data Cleaning#

Data should always be thought of as a combination of signal and noise. Signal refers to the information in the data that we are interested in. Noise refers to aspects of the data that we are not interested in, and which typically make it more difficult to see and understand the signal. For example, in EEG (electroencephalography), the signal we measure is electrical activity originating in the brain. This is measured via electrodes placed on the scalp. EEG contains a variety of types of noise, because those electrodes are not sensitive only to the brain’s electrical activity. They also detect other physiological signals, like muscle contractions, eye movements, and sometimes heartbeat, as well as electromagnetic interference from the environment, such as electric lights, computer screens, etc.. Because the skull is a poor conductor of electricity, EEG signals from the brain are much smaller than these other sources of noise. A neuroscientist’s goal in working with EEG data is typically to identify reliable patterns in the signal (brain activity), in spite of all of the noise in the data.

There are other types of noise in many data sets as well. For example, there may be missing data, such as if not all study participants completed all of an experiment, or of some data were accidentally lost. Neuroscience recording equipment, such as electrodes, may malfunction, resulting in artifacts in the data. Data may sometimes also be stored in unusual formats, or in different units, which can make it difficult to work with.

For all of these reasons, data may need various kinds of preparation or cleaning to strip away noise and more easily identify systematic patterns. Indeed, in many data science projects, cleaning a data set can take as long or longer than actually running statistical analyses! But, this is time well spent, because the old adage “garbage in, garbage out” is true in data science: if you don’t clean your data, you will likely find garbage in your results. Data cleaning is a critical part of being a good data scientist, and getting truthful and reproducible results.

When some people first encounter data cleaning, they raise questions about the validity of such practices, or whether they are “cooking” the data — manipulating it to generate the results the researcher desires. There is a huge, and fundamental, difference between manipulating data to generate a specific, predicted result, and cleaning data to minimize noise and optimize finding the signal. Approaches to data cleaning should be systematic, well-reasoned, and accurately reported. In data science, data cleaning procedures are typically well-understood and supported by the peer-reviewed scientific literature. In contrast, manipulating the data to achieve specific ends typically involves dishonest practices such as arbitrarily removing, adding, or modifying data to create a data set that generates specific results (often without reporting how or why the data were thus manipulated). Dishonest practices are not accepted in the scientific community, and people caught manipulating their data are typically publicly discredited, and subsequently find it hard to obtain work or the respect of their peers.