Exploring MNE’s data structure#
Learning Objectives#
Import raw EEG data using MNE
Examine the structure of an MNE
Raw
objectAccess and view different attributes of a
Raw
data objectVisualize the positions of EEG electrodes
Introduction#
In this lesson, we will load a raw EEG data file using MNE and gain an understanding of how MNE stores data. This is an important first step to working with data in MNE.
Load Packages#
import mne
mne.set_log_level('error')
import numpy as np
import matplotlib.pyplot as plt
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
Cell In[1], line 1
----> 1 import mne
2 mne.set_log_level('error')
4 import numpy as np
File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/mne/__init__.py:22
19 __version__ = '0.16.2'
21 # have to import verbose first since it's needed by many things
---> 22 from .utils import (set_log_level, set_log_file, verbose, set_config,
23 get_config, get_config_path, set_cache_dir,
24 set_memmap_min_size, grand_average, sys_info, open_docs)
25 from .io.pick import (pick_types, pick_channels,
26 pick_channels_regexp, pick_channels_forward,
27 pick_types_forward, pick_channels_cov,
28 pick_channels_evoked, pick_info)
29 from .io.base import concatenate_raws
File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/lazy_loader/__init__.py:82, in attach.<locals>.__getattr__(name)
80 elif name in attr_to_modules:
81 submod_path = f"{package_name}.{attr_to_modules[name]}"
---> 82 submod = importlib.import_module(submod_path)
83 attr = getattr(submod, name)
85 # If the attribute lives in a file (module) with the same
86 # name as the attribute, ensure that the attribute and *not*
87 # the module is accessible on the package.
File ~/miniforge3/envs/neural_data_science/lib/python3.12/importlib/__init__.py:90, in import_module(name, package)
88 break
89 level += 1
---> 90 return _bootstrap._gcd_import(name[level:], package, level)
File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/mne/utils/_logging.py:20
16 from typing import Any, Callable, TypeVar
18 from decorator import FunctionMaker
---> 20 from .docs import fill_doc
22 logger = logging.getLogger("mne") # one selection here used across mne-python
23 logger.propagate = False # don't propagate (in case of multiple imports)
File ~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/mne/utils/docs.py:17
13 from copy import deepcopy
15 from decorator import FunctionMaker
---> 17 from ..defaults import HEAD_SIZE_DEFAULT
18 from ._bunch import BunchConst
20 # # # WARNING # # #
21 # This list must also be updated in doc/_templates/autosummary/class.rst if it
22 # is changed here!
ImportError: cannot import name 'HEAD_SIZE_DEFAULT' from 'mne.defaults' (/Users/aaron/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/mne/defaults.py)
Tip
In the above cell, we ran a command after importing mne, mne.set_log_level('error')
. If you don’t run this when using MNE, you will find that MNE provides a lot output, and often including some concerning-looking Python warning messages. Most of these warnings are safely ignored. I generally find this output more confusing than helpful, and turn it off routinely. For our lessons we will keep the output off because it just creates confusion. Many MNE comnmands have a verbose
kwarg that allows you to change the amount of output a particular command generates. This can be useful if you want more information about what a command is doing; for example, some functions take a long time to run and you can increase the verbosity to see how far along the function is.
Import raw data#
For EEG, raw data is typically stored in a one or more files containing continuous EEG data. There are many different proprietary file formats for EEG data; most of these are developed by the manufacturer of a particular EEG system, since an integral part of an EEG system is software to save the data for later analysis. Fortunately, MNE provides functions to import data from most common EEG systems.
In the present case, we are working with data from a system sold by Brain Products, whose software is called Brain Vision. So we will use MNE’s read_raw_brainvision()
function. The Brain Vision format actually specifies three files for any given data set. The three files are all created at the time that the EEG data is acquired, and it’s important to ensure that all three are copied when moving them around. The files are as follows:
*.eeg
is the actual EEG data (electrical potential measurements for all electrodes, at all time points). This is stored in a compressed (binary) format*.vmrk
is a text file containing all of the markers (trigger codes) that were sent to the EEG system during data collection, by the stimulus computer. These typically encode the onset of various stimuli, and times of any responses by the participant, as well as and what each stimulus/response was.*.vhdr
is a text file with header information, or metadata about the data set, including technical details such as the sampling rate, and the settings of the EEG amplifier during the recording. The.vhdr
file is particularly important because it specifies the file names of the.eeg
and.vmrk
files (although they should always be named consistently with the header file).
When we run read_raw_brainvision()
the file we will actually pass as an argument is the .vhdr
file. Since it specifies the names of the associated .eeg
and .vmrk
files, from this file MNE can find the data and markers.
As described previously, many of MNE’s functions are nested inside subfolders (modules); for example, all of the file input-output (I/O) routines are in the module mne.io
. So when we call the read_raw_brainvision()
function, we have to do so as mne.io.read_raw_brainvision()
. The one required argument for this function is the file name, and that’s all we need for now:
raw = mne.io.read_raw_brainvision('data/sub-001/sub-001.vhdr')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[2], line 1
----> 1 raw = mne.io.read_raw_brainvision('data/sub-001/sub-001.vhdr')
NameError: name 'mne' is not defined
View raw data attributes#
We now have our raw EEG data represented in Python as an MNE Raw
object. We can view it’s basic information by asking for the .info
property:
raw.info
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[3], line 1
----> 1 raw.info
NameError: name 'raw' is not defined
Although some of the information is not available, this shows us a few important things about our data set, including that there are 16 EEG channels (each channel contains the data from one EEG electrode), that the sampling rate is 500 Hz (i.e., EEG data were sampled 500 times per second, so we have a data point every 2 ms), and that the data were filtered during data collection between 0.01–80 Hz (more on filtering below).
The information provided above is actually a subset of the information stored in the Raw
object’s info
. We can use the .keys()
method to see all of the fields available in info
:
raw.info.keys()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[4], line 1
----> 1 raw.info.keys()
NameError: name 'raw' is not defined
We can access the values stored with any of these info attributes by putting it in square brackets, like this command to access the names of all of the channels (electrodes) in the data set:
raw.info['ch_names']
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[5], line 1
----> 1 raw.info['ch_names']
NameError: name 'raw' is not defined
The info
is a snapshot of a few pieces of information that researchers may want to know about a data file, but it’s far less than all the information stored in the MNE Raw
object. To see the entire contents of the object, we can ask for it’s __dict__
(note that there are two underscores below dict
and two after). Recall that Raw
is a Python class, and any instance of the Raw
class (such as our raw
data here) is a Python object. Any Python object has an atrribute __dict__
, and this attribute contains a dictionary of all of the object’s attributes, with keys being the attribute names, and values being the information stored for that attribute.
Because it generates a lot of output, we won’t show the entire __dict__
here, just some of the output — but you can try it yourself. You will see that there is a lot of information stored in the Raw
object, including the actual EEG data, the sampling rate, the names of the channels, and much more.
Where’s the data?#
For all the information that is in the output above, even if you run it yourself and look through all the output, you would find that there is no actual EEG data! This is because, by default, MNE’s routines to read raw data read the data file’s header into memory, but not the data itself. The reason for this is that data files are often quite large, and so reading them in sometimes takes a noticeable amount of time, and consume significant memory. The header of a raw EEG data file contains its metadata (information about the file), as you saw above in printing out the __dict__
this is often enough for MNE to perform some operations on the data. At some point, if you want to perform operations on the EEG data, then MNE will automatically load in the data to work on it.
The fact that the data is not loaded is indicated in the .__dict__
output above by the first dictionary entry: preload: False
. We can ask MNE to load the data using the .load_data()
method, as shown below.
Alternatively, when we originally read the raw data file, we could have told MNE to load the data with the preload
kwarg, like this:
raw.load_data()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[6], line 1
----> 1 raw.load_data()
NameError: name 'raw' is not defined
Now raw
has a _data
attribute, containing a NumPy array:
print(type(raw._data))
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[7], line 1
----> 1 print(type(raw._data))
NameError: name 'raw' is not defined
MNE represents the data as a 2D array with a row for each channel, and columns for time points. IF we ask for its shape
attribute, we see that our raw data comprises 16 channels and 408,640 time points:
print(raw._data.shape)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[8], line 1
----> 1 print(raw._data.shape)
NameError: name 'raw' is not defined
One challenge in working with the raw data directly in this way, is that each time point in the continuous data is not necessarily a multiple of 1 ms. When EEG data is recorded, the continuous data is sampled (measured) at a particular rate, called the sampling frequency. This is the number of measurements per second. We can see the sampling rate of an MNE dataset by accessing the sfreq
property of the .info
attribute:
raw.info['sfreq']
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[9], line 1
----> 1 raw.info['sfreq']
NameError: name 'raw' is not defined
So our data were sampled at 500 Hz, meaning that there are 500 measurements per second. This means that each measurement is 1/500th of a second, or 2 ms. So we can derive the length of the EEG recording, in seconds, by dividing the number of time points by the sampling rate:
scan_durn = raw._data.shape[1] / raw.info['sfreq']
print('Duration of EEG recording = ', scan_durn, 's, or', round(scan_durn / 60, 2), 'min.')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[10], line 1
----> 1 scan_durn = raw._data.shape[1] / raw.info['sfreq']
2 print('Duration of EEG recording = ', scan_durn, 's, or', round(scan_durn / 60, 2), 'min.')
NameError: name 'raw' is not defined
Accessing subsets of data#
Since the data is stored as a NumPy array, we can access specific data points using standard indexing. For example, if we wanted to select all of the data from only channel 15, we would use:
print(raw._data[15, :])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[11], line 1
----> 1 print(raw._data[15, :])
NameError: name 'raw' is not defined
If we wanted to slice in time, we would need to consider the sampling frequency. For example, if we wanted the data from channel 15, but only from 1000 - 1100 ms, we would need to convert those times to samples, and then use those indices to slice the data. We can do this by multiplying the times (in seconds) by the sampling rate, and then converting the result to an integer (since the result of the multiplication will be a float), as shown bnelow. While this is not necessarily difficult, it’s a bit tricky. It would be nice if we could work with our data in terms of uinits of time, and not samples.
start_time = int(1.0 * raw.info['sfreq'])
end_time = int(1.1 * raw.info['sfreq'])
chan = 15 # set channel we want
raw._data[chan, start_time:end_time]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[12], line 1
----> 1 start_time = int(1.0 * raw.info['sfreq'])
2 end_time = int(1.1 * raw.info['sfreq'])
4 chan = 15 # set channel we want
NameError: name 'raw' is not defined
Fortunately, MNE provides a way to do this. The .get_data()
method of the Raw
object allows us to specify the time range we want, in seconds, and it will return the data in that time range. We can use the tmin
and tmax
kwargs to specify the time range, and the picks
kwarg to specify the channels using their 10-10 system labels, rather than needing to know which row in the data array corresponds to which channel.
For example, to get the data from channel Oz, from 1000 - 1100 ms, we can use:
raw.get_data(picks='Oz', tmin=1.0, tmax=1.1)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[13], line 1
----> 1 raw.get_data(picks='Oz', tmin=1.0, tmax=1.1)
NameError: name 'raw' is not defined
Visualizing raw EEG data#
Plot EEG data with Matplotlib#
Since we understand how to select a particular segment of data and return it as a NumPy array, it’s eay to pass the result of doing this to Matplotlib, and visualize the electrical potential of a specific electrode over a segment of time. It’s good practice to use variables to pass to the various kwargs we want to use in the plot command — this makes it easy to modify them as necessary. But since storing a slice of data could consume a lot of memory, it’s better just to pass the slice directly to the plot command, as shown below.
One thing to note here is the .T
operator, which transposes the data array (i.e., swaps the rows and columns). To be honest, I didn’t understand why this was — Matplotlib expects the data as rows (x axis) columns (y axis), which is how are data are shaped. So I asked Copilot Chat, and got the following answer:
Copilot Chat output:
In Python, when you’re dealing with a 2D numpy array, each row represents a separate data sequence. When you’re plotting this data, you typically want each sequence to be plotted along the x-axis, with the y-axis representing the values in the sequence.
However, when you extract a row from a 2D numpy array, you get a 1D array. If you try to plot this directly, matplotlib will plot the indices of the array on the x-axis and the values on the y-axis. This might not be what you want if your data sequence doesn’t naturally correspond to these indices.
By transposing the 1D array (turning it into a 2D array with one column), you’re telling matplotlib to treat the single data sequence as a column of values to be plotted along the y-axis, with the x-axis representing the sequence position within the column. This is typically the desired behavior when plotting a single data sequence.
channel = 'Oz'
start_time = 1.0 # in seconds
end_time = 1.1
fig, ax = plt.subplots(figsize=[15, 5])
ax.plot(raw.get_data(picks=chan, tmin=start_time, tmax=end_time).T)
plt.show()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[14], line 5
2 start_time = 1.0 # in seconds
3 end_time = 1.1
----> 5 fig, ax = plt.subplots(figsize=[15, 5])
6 ax.plot(raw.get_data(picks=chan, tmin=start_time, tmax=end_time).T)
7 plt.show()
NameError: name 'plt' is not defined
If we look at the axes of the plot below, we see that the x axis is time — but in units of samples, not seconds, with zero being the first sample (even though it’s actually 1 s from the start of the recording). This is because we passed the data directly to Matplotlib, and it doesn’t know anything about the sampling rate. We can fix this by passing the time values to the x
kwarg, and the data to the y
kwarg. We can get the time values by using the linspace()
function from NumPy, which will return an array of evenly spaced values between the start and end times we specify. We can specify the start time as 1, and the end time as the length of the data divided by the sampling rate, as shown below. We can also specify the ylim
kwarg to set the y-axis limits to a reasonable range.
Since we use the data slice twice in the code below, it’s convenient to store it in a variable.
channel = 'Oz'
start_time = 1.0 # in seconds
end_time = 1.1
# Get the data from channel Oz, from 1000 - 1100 ms
data = raw.get_data(picks=chan, tmin=start_time, tmax=end_time).T
# Get the time values
times = np.linspace(start_time, end_time, num=data.shape[0])
# Plot the data
fig, ax = plt.subplots(figsize=[15, 5])
ax.plot(times, data, color='black')
# Label the axes
plt.xlabel('Time (s)')
plt.ylabel('Voltage (uV)')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[15], line 6
3 end_time = 1.1
5 # Get the data from channel Oz, from 1000 - 1100 ms
----> 6 data = raw.get_data(picks=chan, tmin=start_time, tmax=end_time).T
8 # Get the time values
9 times = np.linspace(start_time, end_time, num=data.shape[0])
NameError: name 'raw' is not defined
Summary#
MNE can load raw EEG data from standard file formats, and stores the data in a
Raw
object.The
Raw
object contains a lot of information about the data (metadata), but does not load the data into memory by default.The metadata can be accessed using the
.info
attribute of theRaw
object. Many other attribues fo theRaw
object are visible using the.__dict__
attributeBy default, MNE does not load the data into memory. We can load it either with the
preload
kwarg when reading the data, or by calling the.load_data()
method of theRaw
objectSince the raw data is stored in a NumPy array, we can access subsets of the data using standard NumPy indexing
Likewise, we can use Matpltlib to plot data from the NumPy array. When plotting a 1D NumPy array, we need to transpose it first, using
.T
In the next lesson, we will learn how to use MNE’s built-in functions and methods for visualization. These are much more convenient, because they allow you to think of your data in specifically EEG/MNE terms, rather than metnally translating between EEG concepts and NumPy arrays. MNE automatically does thinks like converting samples to time, and taking inputs as challen labels and times in seconds, rather than numerical indexes of NumPy arrays.