What is Data Science?#

Most areas of neuroscience research and development rely on increasingly large and complex data sets. Discovery and application in neuroscience thus relies on the ability to manage these large data sets, and extract meaning from them. In other words, neuroscience now relies heavily on data science, which has been variously defined as “…an umbrella term to describe the entire complex and multistep processes used to extract value from data.” (Wing, 2019) and the ability to “bring structure to large quantities of formless data and make analysis possible” (Davenport & Patil, 2012, p.73).

In neuroscience, data science is an increasingly necessary skill. Data from techniques like single-cell recordings, local field potentials, EEG, and fMRI is complex and multidimensional. Being able to understand, manipulate, and visualize the structure of these complex datasets is a necessary skill for performing the research. On top of this, it is increasingly clear that very large data sets - often built collaboratively by many labs - are required to make reliable inferences about neuroscientific processes. Making inferences also depends on computational models - ways of identifying and representing patterns in the data. While some of these will be familiar from statistics class, a wide range of statistical and machine learning models are now widely used in neuroscience.

Is data science just a trendy name for statistics?#

While data science and statistics are overlapping fields, statistics is generally focused on the specific task of testing hypotheses based on data. Data science more broadly includes the storage, manipulation, visualization, filtering, and preparation of data that is typically required prior to statistical analysis. Data science does also encompass statistics, as well as machine learning; whereas statistics generally involves deriving conclusions from existing data, machine learning involves making predictions from a data set that will generalize to other data. Since statistics is covered in other courses in the neuroscience and psychology curricula, this course focuses instead on the other “front-end” aspects of data science described above. Other areas of data science, including software development and “back-end” data science (engineering, hardware, databases), will not be covered in detail.

This highlights a mindset that differs quite dramatically in data science, as compared to the basic statistics taught in undergraduate psychology and neuroscience curricula. Data science includes practices that are more exploratory. In experimentally-oriented disciplines such as psychology and neuroscience, statistics are a natural approach to deriving meaning from data. This is because data typically come from experiments, in which the research systematically and intentionally manipulated certain variables. A good experiment is hypothesis-driven, meaning that the researcher has predictions in advance as to how the data will systematically vary with the experimental manipulations. These predictions are usually based on past experimental findings, or models of the process being studied. Statistics are fundamentally embedded in data science — and indeed, the concept of “data science” as a discipline emerged from the field of statistics — but data science can be thought of as a larger set of practices the includes statistics, machine learning, data cleaning and transformations, and visualization. Many of these approaches are more exploratory than hypothesis-driven. That is, rather than looking for a specific, predicted pattern, the data scientist explores the data to find systematic patterns that may emerge from the data. For instance, researchers using techniques like fMRI have attempted to “decode” specific patterns of brain activity, such as what picture a person is viewing. These lines of research explore a variety of ways to transform the data, and a variety of machine learning approaches to make predictions about what the person is seeing. The goal is to identify the data processing pipeline that makes the most accurate predictions

Tools for Data Science#

Central to data science is the ability to use scientific programming languages, such as Python, Matlab, and R. This ability includes a strong understanding of the fundamentals of at least one programming language, and the ability to extend one’s knowledge through continuous learning and problem-solving. This course teaches Python, a mature and widely-used language in neuroscience and data science more broadly. However, many of the fundamentals of scientific programming and data science are common to all many languages. Thus, having learned Python, you will be better-prepared to learn new languages in the future, as necessary.

Another important facet of data science is that it is a team endeavour. On the one hand, it is founded on open, shared software developed by widely distributed teams of contributors. On the other hand, the practice of data science typically involves teams of individuals with complementary skillsets, both due to the size and complexity of many projects. In science, these teams often comprise students and faculty members in collaborating labs distributed around the world. Team members with different skillsets can also teach each other new things, often through demonstration in a shared project. This class prepares you for such collaboration by developing and coaching your teamwork skills, as well as teaching you how to use software platforms that support such collaboration.

The skills learned in this class will benefit students working in a wide range of areas of neuroscience. As well, the class will provide an introductory foundation in data science that can be applied to a range of areas beyond neuroscience, in academia, industry, and government.