Reading Data Files with pandas

Reading Data Files with pandas#

There are three data files. Each file contains the reaction times (RTs) from 10 trials of a relatively simple task in which participants had to indicate which direction a briefly-presented arrow was pointing. The RTs are in seconds (s). Each file contains the RTs from a different participant. In each file there are three columns. You can determine what the columns are by looking at the first row (header) of each file.

In this exercise, we will read in the data from the three files in the data directory: s1.csv, s2.csv, and s3.csv, and combine them into a single DataFrame. We will save that DataFrame, and then then calculate the mean RT for each participant.

Read in the Data#

To start the process, I typed the first two lines of the prompt below:

# read in three files from the data folder, whose names start with "s" and end in "csv"
# concatenate them into one dataframe

After I hit Enter, Copilot suggested the third line of the prompt, which I accepted by hitting Tab:

# write the result to a csv file

The cells below all reflect prompts written on the basis of the instructions above, and code generated entirely by Copilot. While your experience may be different, in writing this lesson we only had to type the first 2-line prompt, and Copilot generated not only the code but the other prompts/comments.

However, you have to get used to the flow of working with Copilot. Copilot encourages a good coding style, and so it will sometimes only generate code if you hit Enter twice, so that there’s an empty line between your prompt and the code. The same is true after a line of code: Copilot may force you to hit Enter twice before it generates the next prompt or line of code.

After I typed the first two lines of the prompt below, and accepted the third line of the prompt (which was suggested by Copilot), I had to hit Enter twice to get the first line of code (import pandas as pd) generated.

Although we intended to only read and combine the data files, after I typed the first two lines of the prompt and hit Enter, Copilot added a third suggested line to the prompt after I hit Enter the first time, which was to write the dataframe to a csv file. This is pretty cool, since that was also part of the instructions above! Is Copilot a mind-reader? No, it’s just a very good predictor of what you might want to do next, based on the prompt you’ve written. It’s a pretty common task in data science to read in a set of individual data files, concatenate them into a single DataFrame, and then save that DataFrame to a file for later use. So, Copilot is just doing what it’s trained to do, which is to predict what you might want to do next, based on the prompt you’ve written.

# read in three files from the data folder, whose names start with "s" and end in "csv"
# concatenate them into one dataframe
# write the result to a csv file

import pandas as pd
import glob

# read in all files that start with "s" and end with "csv"
# use glob to get a list of file names
# use a list comprehension to read in the files
# use the concat method to concatenate the dataframes
# use the to_csv method to write the result to a csv file

# get a list of file names
file_list = glob.glob('data/s*.csv')

# read in the files
df_list = [pd.read_csv(file) for file in file_list]

# concatenate the dataframes
df = pd.concat(df_list)

# write the result to a csv file
df.to_csv('data/combined_csv_files.csv', index=False)

The code ran without errors, but we didn’t ask for any output. Next we’ll to some checks to confirm that the code ran correctly, as described in the instructions. We’ll start by checking the number of columns in the DataFrame. As noted above, there should be three columns.

# print the first 5 rows of the dataframe
df.head()

	participantID	trial	RT
0	s2	1	0.433094
1	s2	2	0.392526
2	s2	3	0.396831
3	s2	4	0.417988
4	s2	5	0.371810

We see that the DataFrame has 3 columns, as expected. Next, we’ll check that we get the expected number of rows (30) and columns (3):

# Check that the dataframe has 30 rows and 3 columns
df.shape

(30, 3)

We also were instructed to raise an error message if the number of rows or columns is incorrect:

# raise an error if the dataframe does not have 30 rows and 3 columns
assert df.shape == (30, 3), "The dataframe does not have 30 rows and 3 columns"

Test Your Test#

This generates no output. This is a Good Thing, because it means that the number of rows and columns is correct. However, it’s a bit dangerous to assume no news is good news, because no news could also mean your code is not working. So, we should test that the error message is generated if we change the expected number of rows or columns.

We don’t want to actually remove any rows from the DataFrame, but we can use slicing to create a view of the DataFrame that has fewer rows:

# create a slice of df that contains 29 rows, 
# then raise an error if the dataframe does not have 30 rows and 3 columns
df_slice = df.iloc[0:29]
assert df_slice.shape == (30, 3), "The dataframe does not have 30 rows and 3 columns"

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[5], line 4
      1 # create a slice of df that contains 29 rows, 
      2 # then raise an error if the dataframe does not have 30 rows and 3 columns
      3 df_slice = df.iloc[0:29]
----> 4 assert df_slice.shape == (30, 3), "The dataframe does not have 30 rows and 3 columns"

AssertionError: The dataframe does not have 30 rows and 3 columns

Now we get an AssertionError, which is a Good Thing because it confirms that our error-checking code is working. We can do the same thing to test the error-checking code for the number of columns:

# create a slice of df that contains 2 columns, 
# then raise an error if the dataframe does not have 30 rows and 3 columns
df_slice = df.iloc[:, 0:2]
assert df_slice.shape == (30, 3), "The dataframe does not have 30 rows and 3 columns"

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[6], line 4
      1 # create a slice of df that contains 2 columns, 
      2 # then raise an error if the dataframe does not have 30 rows and 3 columns
      3 df_slice = df.iloc[:, 0:2]
----> 4 assert df_slice.shape == (30, 3), "The dataframe does not have 30 rows and 3 columns"

AssertionError: The dataframe does not have 30 rows and 3 columns

So yes, our code will throw errors if the number of rows or columns is incorrect.

Calculating the Mean RT for Each Participant… and Our First Bug#

Our next instruction is to calculate the mean RT for each participant. Let’s prompt Copilot to do that:

# calculate mean rt for each participant
# use the groupby method to group by participant
# use the mean method to calculate the mean rt for each participant
df.groupby('participant').mean()

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[7], line 4
# calculate mean rt for each participant
# use the groupby method to group by participant
# use the mean method to calculate the mean rt for each participant
----> 4 df.groupby('participant').mean()

File ~/miniforge3/envs/neural_data_science/lib/python3.11/site-packages/pandas/core/frame.py:9156, in DataFrame.groupby(self, by, axis, level, as_index, sort, group_keys, observed, dropna)
if level is None and by is None:
   raise TypeError("You have to supply one of 'by' and 'level'")
-> 9156 return DataFrameGroupBy(
   obj=self,
   keys=by,
   axis=axis,
   level=level,
   as_index=as_index,
   sort=sort,
   group_keys=group_keys,
   observed=observed,
   dropna=dropna,
)

File ~/miniforge3/envs/neural_data_science/lib/python3.11/site-packages/pandas/core/groupby/groupby.py:1329, in GroupBy.__init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, observed, dropna)
self.dropna = dropna
if grouper is None:
-> 1329     grouper, exclusions, obj = get_grouper(
       obj,
       keys,
       axis=axis,
       level=level,
       sort=sort,
       observed=False if observed is lib.no_default else observed,
       dropna=self.dropna,
   )
if observed is lib.no_default:
   if any(ping._passed_categorical for ping in grouper.groupings):

File ~/miniforge3/envs/neural_data_science/lib/python3.11/site-packages/pandas/core/groupby/grouper.py:1043, in get_grouper(obj, key, axis, level, sort, observed, validate, dropna)
       in_axis, level, gpr = False, gpr, None
   else:
-> 1043         raise KeyError(gpr)
elif isinstance(gpr, Grouper) and gpr.key is not None:
   # Add key to exclusions
   exclusions.add(gpr.key)

KeyError: 'participant'

Debugging Copilot-Generated Code#

Typically, when you get a long, scary error message like the one above, you can ignore a lof of what is in the middle. The most important parts are the last line, which tells you what the error is, and the first lines, which usually indicate what line in the code you tried to run caused the error. What’s in between is the stack trace, which is a list of all the functions that were called in the process of trying to run the code. But most of the time, the error is a result of the code you wrote (the first lines), not the code in the underlying Python functions that your code called (the middle lines).

In this case, we see a KeyError: 'participant' at the bottom of the error message. Recall that Python dictionaries are sets of key-value pairs. The keys are the names of the columns, and the values are the data in those columns.

You can think of a pandas DataFrame as a dictionary in which the column names are keys, and the values in that column are its values. This is a common way that pandas functions refer to column names and their values. So the error message above indicates that the code tried to access a column called subject in the DataFrame, but that column doesn’t exist. We can see at the top of the error message that the code that caused the error is: ---> 2 df.groupby('participant').mean(). This means it’s line 2 of the cell above that caused the error.

As a novice coder, however, you might not know or remember that pandas DataFrames represent column names and data as dictionary key-value pairs. If you don’t understand what an error means, you can use Copilot to help you!

Using Copilot to Explain Errors and Suggest Fixes#

Somewhat counterintuitively, even though Copilot can generate code with bugs, it can also provide feedback on errors in the code it generated.

Copilot has a few features that can help you diagnose and fix your code. The one we have found most effective is Copilot Chat. For this reason, we won’t describe other methods, such as Copilot’s “Fix” feature, which is described in the Copilot documentation.

Copilot Chat#

Another, really useful option is the Chat option, which you can option by clicking the Chat icon (shown above) in the Activity Bar (left side of VS Code). This allows you to have a conversation with Copilot about what you want to do. If you’ve used Chat-GPT before, this will be familiar to you as it’s essentially the same type of natural-language interface.

There are different approaches to using Copilot Chat to help fix an error. One is to click in the cell that generates the error, and then click in the Copilot Chat input box, and typing:

the selected cell generates an error. how can i fix this error?

In response to this prompt for the present error, I got the following response:

If you are getting an error with the selected cell, there could be a few reasons why. Here are some possible solutions:

Check that the ‘participant’ column is present in the dataframe. You can do this by printing out the dataframe using the print() function and checking that the ‘participant’ column is present.

Check that the ‘participant’ column is spelled correctly. If the column is misspelled, you will get a “KeyError” when trying to group by that column.

Check that the ‘participant’ column is of the correct data type. If the ‘participant’ column is not a string or categorical data type, you may need to convert it using the astype() method.

Check that the dataframe contains numerical columns. If the dataframe only contains non-numerical columns, you will get a “DataError” when trying to calculate the mean.

Check that the dataframe does not contain missing values. If the dataframe contains missing values, you may need to remove them using the dropna() method before calculating the mean.

Finding the Error with Copilot’s Help#

These are actually really good responses, and they are provided in a logical order to go through in debugging. We could work through them in order, but — spoiler alert — the first suggestion actually leads to a solution.

Check that the ‘participant’ Column is Present in the DataFrame#

The first thing to do is to check that the column exists, and is spelled correctly. So let’s start with that. Do you remember how to check the names of the columns in a pandas DataFrame? If not, you can write a Copilot prompt to help you out. In the cell below, write a prompt that will print the names of the columns in the DataFrame. Then run the cell, and see what happens.

# print the column names of the dataframe
df.columns

Index(['participantID', 'trial', 'RT'], dtype='object')

Another option is to look at the first few rows of the DataFrame, which includes the column names:

# print the first few rows of the dataframe
df.head()

	participantID	trial	RT
0	s2	1	0.433094
1	s2	2	0.392526
2	s2	3	0.396831
3	s2	4	0.417988
4	s2	5	0.371810

A third option, when using Jupyter notebooks with VS Code, is to click on the Variables button in the toolbar at the top of the notebook window. This will pop up a variable viewer in sub-window below your notebook. You can click on the variable names to see their values. For DataFrames, it actually shows a list of the columns in the window, and you can double-click on the variable name to see the contents of the DataFrame in another window, the Data Viewer. This view is similar to a spreadsheet. In fact, you can directly edit values in the Data Viewer. You should never directly edit values like this, however. Any steps you do manually are not documented in your code, and are not reproducible.

The screenshot below shows the variables and Data Viewer for the current context.

The Solution#

Using any of the three approaches above, when we look at the column names, we see that they are participantID, trial, and RT. The code that generated the error was trying to access a column called participant, which doesn’t exist. It should be participantID. So we need to change the code to access the correct column name. Or, engineer our prompt to do so.

This is an important lesson on prompt engineering. You're more likely to get working code if you use the actual variable names in the prompt, rather than expecting Copilot to make inferences about what variable you're referring to based on the context.

# calculate mean rt for each participantID
df.groupby('participantID').mean()

	trial	RT
participantID
s1	5.5	0.389548
s2	5.5	0.444785
s3	5.5	0.446009

This looks good, however the code is providing means for both columns in the DataFrame (trial and RT), not just for RT (sometimes the same prompt actually does select only RT but we’ll explore when it doesn’t). We can add to our prompt to tell it not to include trial in the output:

# calculate mean rt for each participantID. Do not show the mean for trial
df.groupby('participantID').mean().drop(columns='trial')

	RT
participantID
s1	0.389548
s2	0.444785
s3	0.446009

The above generated code does what we want. However, from the perspectives of coding style and efficiency, it’s not optimal. Python executes this chained command from left to right. So, it first computes the mean for each column in the DataFrame, and then drops the column trial.

It seems unnecessary to compute the mean for trial and then drop it. This isn’t really Copilot’s fault — we did explicitly tell it not to show the mean for trial, but it’s not smart enough to know that we don’t want to compute it in the first place; it seems to have interpreted our prompt as a literal sequence of commands.

We can modify the prompt in a way that generates more efficient code, by being specific about the column that we do want, rather than what we don’t want:

# calculate mean for each participantID using the RT column
df.groupby('participantID')['RT'].mean()

participantID
s1    0.389548
s2    0.444785
s3    0.446009
Name: RT, dtype: float64

By way of showing how sensitive Copilot is to the structure of your prompt, a slightly different (and arguably more logical) phrasing of the prompt above generates the less-efficient code:

# calculate mean RT for each participantID
df.groupby('participantID').mean().drop('trial', axis=1)

	RT
participantID
s1	0.389548
s2	0.444785
s3	0.446009

Note

One thing you may notice is that the result of the last command above is nicely-formatted when it is displayed, whereas the one before it is in a more “raw” format. This is not really important here, but it’s worth understanding why the difference occurs. When you call a pandas DataFrame it prints in a nicely formatted output. However, when you call a pandas Series (which is a single column), it prints in a more detailed but less “pretty” way.

In the output immediately above, the code created a DataFrame with two columns (trial and RT) and then dropped the trial column, but as such it remained a DataFrame and so was nicely formatted.

In contrast, the output of using the mean() method on a single column (RT) in the cell above that is a Series.

We'll worry about the formatting later, but it's good to understand why it happens because the distinction between DataFrames and Series often causes confusion and errors if it's not understood.