Reading Data Files with pandas#
There are three data files. Each file contains the reaction times (RTs) from 10 trials of a relatively simple task in which participants had to indicate which direction a briefly-presented arrow was pointing. The RTs are in seconds (s). Each file contains the RTs from a different participant. In each file there are three columns. You can determine what the columns are by looking at the first row (header) of each file.
In this exercise, we will read in the data from the three files in the
s3.csv, and combine them into a single DataFrame. We will save that DataFrame, and then then calculate the mean RT for each participant.
Read in the Data#
To start the process, I typed the first two lines of the prompt below:
# read in three files from the data folder, whose names start with "s" and end in "csv" # concatenate them into one dataframe
After I hit
Enter, Copilot suggested the third line of the prompt, which I accepted by hitting
# write the result to a csv file
The cells below all reflect prompts written on the basis of the instructions above, and code generated entirely by Copilot. While your experience may be different, in writing this lesson we only had to type the first 2-line prompt, and Copilot generated not only the code but the other prompts/comments.
However, you have to get used to the flow of working with Copilot. Copilot encourages a good coding style, and so it will sometimes only generate code if you hit
Enter twice, so that there’s an empty line between your prompt and the code. The same is true after a line of code: Copilot may force you to hit
Enter twice before it generates the next prompt or line of code.
After I typed the first two lines of the prompt below, and accepted the third line of the prompt (which was suggested by Copilot), I had to hit
Enter twice to get the first line of code (
import pandas as pd) generated.
Although we intended to only read and combine the data files, after I typed the first two lines of the prompt and hit
Enter, Copilot added a third suggested line to the prompt after I hit
Enter the first time, which was to
write the dataframe to a csv file. This is pretty cool, since that was also part of the instructions above! Is Copilot a mind-reader? No, it’s just a very good predictor of what you might want to do next, based on the prompt you’ve written. It’s a pretty common task in data science to read in a set of individual data files, concatenate them into a single DataFrame, and then save that DataFrame to a file for later use. So, Copilot is just doing what it’s trained to do, which is to predict what you might want to do next, based on the prompt you’ve written.
# read in three files from the data folder, whose names start with "s" and end in "csv" # concatenate them into one dataframe # write the result to a csv file import pandas as pd import glob # read in all files that start with "s" and end with "csv" # use glob to get a list of file names # use a list comprehension to read in the files # use the concat method to concatenate the dataframes # use the to_csv method to write the result to a csv file # get a list of file names file_list = glob.glob('data/s*.csv') # read in the files df_list = [pd.read_csv(file) for file in file_list] # concatenate the dataframes df = pd.concat(df_list) # write the result to a csv file df.to_csv('data/combined_csv_files.csv', index=False)
The code ran without errors, but we didn’t ask for any output. Next we’ll to some checks to confirm that the code ran correctly, as described in the instructions. We’ll start by checking the number of columns in the DataFrame. As noted above, there should be three columns.
# print the first 5 rows of the dataframe df.head()
We see that the DataFrame has 3 columns, as expected. Next, we’ll check that we get the expected number of rows (30) and columns (3):
# Check that the dataframe has 30 rows and 3 columns df.shape
We also were instructed to raise an error message if the number of rows or columns is incorrect:
# raise an error if the dataframe does not have 30 rows and 3 columns assert df.shape == (30, 3), "The dataframe does not have 30 rows and 3 columns"
Test Your Test#
This generates no output. This is a Good Thing, because it means that the number of rows and columns is correct. However, it’s a bit dangerous to assume no news is good news, because no news could also mean your code is not working. So, we should test that the error message is generated if we change the expected number of rows or columns.
We don’t want to actually remove any rows from the DataFrame, but we can use slicing to create a view of the DataFrame that has fewer rows:
# create a slice of df that contains 29 rows, # then raise an error if the dataframe does not have 30 rows and 3 columns df_slice = df.iloc[0:29] assert df_slice.shape == (30, 3), "The dataframe does not have 30 rows and 3 columns"
--------------------------------------------------------------------------- AssertionError Traceback (most recent call last) Cell In, line 4 1 # create a slice of df that contains 29 rows, 2 # then raise an error if the dataframe does not have 30 rows and 3 columns 3 df_slice = df.iloc[0:29] ----> 4 assert df_slice.shape == (30, 3), "The dataframe does not have 30 rows and 3 columns" AssertionError: The dataframe does not have 30 rows and 3 columns
Now we get an
AssertionError, which is a Good Thing because it confirms that our error-checking code is working. We can do the same thing to test the error-checking code for the number of columns:
# create a slice of df that contains 2 columns, # then raise an error if the dataframe does not have 30 rows and 3 columns df_slice = df.iloc[:, 0:2] assert df_slice.shape == (30, 3), "The dataframe does not have 30 rows and 3 columns"
--------------------------------------------------------------------------- AssertionError Traceback (most recent call last) Cell In, line 4 1 # create a slice of df that contains 2 columns, 2 # then raise an error if the dataframe does not have 30 rows and 3 columns 3 df_slice = df.iloc[:, 0:2] ----> 4 assert df_slice.shape == (30, 3), "The dataframe does not have 30 rows and 3 columns" AssertionError: The dataframe does not have 30 rows and 3 columns
So yes, our code will throw errors if the number of rows or columns is incorrect.
Calculating the Mean RT for Each Participant… and Our First Bug#
Our next instruction is to calculate the mean RT for each participant. Let’s prompt Copilot to do that:
# calculate mean rt for each participant # use the groupby method to group by participant # use the mean method to calculate the mean rt for each participant df.groupby('participant').mean()
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In, line 4 1 # calculate mean rt for each participant 2 # use the groupby method to group by participant 3 # use the mean method to calculate the mean rt for each participant ----> 4 df.groupby('participant').mean() File ~/miniforge3/envs/neural_data_science/lib/python3.11/site-packages/pandas/core/frame.py:8869, in DataFrame.groupby(self, by, axis, level, as_index, sort, group_keys, observed, dropna) 8866 if level is None and by is None: 8867 raise TypeError("You have to supply one of 'by' and 'level'") -> 8869 return DataFrameGroupBy( 8870 obj=self, 8871 keys=by, 8872 axis=axis, 8873 level=level, 8874 as_index=as_index, 8875 sort=sort, 8876 group_keys=group_keys, 8877 observed=observed, 8878 dropna=dropna, 8879 ) File ~/miniforge3/envs/neural_data_science/lib/python3.11/site-packages/pandas/core/groupby/groupby.py:1278, in GroupBy.__init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, observed, dropna) 1275 self.dropna = dropna 1277 if grouper is None: -> 1278 grouper, exclusions, obj = get_grouper( 1279 obj, 1280 keys, 1281 axis=axis, 1282 level=level, 1283 sort=sort, 1284 observed=False if observed is lib.no_default else observed, 1285 dropna=self.dropna, 1286 ) 1288 if observed is lib.no_default: 1289 if any(ping._passed_categorical for ping in grouper.groupings): File ~/miniforge3/envs/neural_data_science/lib/python3.11/site-packages/pandas/core/groupby/grouper.py:1009, in get_grouper(obj, key, axis, level, sort, observed, validate, dropna) 1007 in_axis, level, gpr = False, gpr, None 1008 else: -> 1009 raise KeyError(gpr) 1010 elif isinstance(gpr, Grouper) and gpr.key is not None: 1011 # Add key to exclusions 1012 exclusions.add(gpr.key) KeyError: 'participant'
Debugging Copilot-Generated Code#
Typically, when you get a long, scary error message like the one above, you can ignore a lof of what is in the middle. The most important parts are the last line, which tells you what the error is, and the first lines, which usually indicate what line in the code you tried to run caused the error. What’s in between is the stack trace, which is a list of all the functions that were called in the process of trying to run the code. But most of the time, the error is a result of the code you wrote (the first lines), not the code in the underlying Python functions that your code called (the middle lines).
In this case, we see a
KeyError: 'participant' at the bottom of the error message. Recall that Python dictionaries are sets of key-value pairs. The keys are the names of the columns, and the values are the data in those columns.
You can think of a pandas DataFrame as a dictionary in which the column names are keys, and the values in that column are its values. This is a common way that pandas functions refer to column names and their values. So the error message above indicates that the code tried to access a column called
subject in the DataFrame, but that column doesn’t exist. We can see at the top of the error message that the code that caused the error is:
---> 2 df.groupby('participant').mean(). This means it’s line 2 of the cell above that caused the error.
As a novice coder, however, you might not know or remember that pandas DataFrames represent column names and data as dictionary key-value pairs. If you don’t understand what an error means, you can use Copilot to help you!
Using Copilot to Explain Errors and Suggest Fixes#
Somewhat counterintuitively, even though Copilot can generate code with bugs, it can also provide feedback on errors in the code it generated.
Copilot has a few features that can help you diagnose and fix your code. The one we have found most effective is Copilot Chat. For this reason, we won’t describe other methods, such as Copilot’s “Fix” feature, which is described in the Copilot documentation.
Another, really useful option is the
Chat option, which you can option by clicking the
Chat icon (shown above) in the Activity Bar (left side of VS Code). This allows you to have a conversation with Copilot about what you want to do. If you’ve used Chat-GPT before, this will be familiar to you as it’s essentially the same type of natural-language interface.
There are different approaches to using Copilot Chat to help fix an error. One is to click in the cell that generates the error, and then click in the Copilot Chat input box, and typing:
the selected cell generates an error. how can i fix this error?
In response to this prompt for the present error, I got the following response:
If you are getting an error with the selected cell, there could be a few reasons why. Here are some possible solutions:
Check that the ‘participant’ column is present in the dataframe. You can do this by printing out the dataframe using the
print()function and checking that the ‘participant’ column is present.
Check that the ‘participant’ column is spelled correctly. If the column is misspelled, you will get a “KeyError” when trying to group by that column.
Check that the ‘participant’ column is of the correct data type. If the ‘participant’ column is not a string or categorical data type, you may need to convert it using the
Check that the dataframe contains numerical columns. If the dataframe only contains non-numerical columns, you will get a “DataError” when trying to calculate the mean.
Check that the dataframe does not contain missing values. If the dataframe contains missing values, you may need to remove them using the
dropna()method before calculating the mean.
Finding the Error with Copilot’s Help#
These are actually really good responses, and they are provided in a logical order to go through in debugging. We could work through them in order, but — spoiler alert — the first suggestion actually leads to a solution.
Check that the ‘participant’ Column is Present in the DataFrame#
The first thing to do is to check that the column exists, and is spelled correctly. So let’s start with that. Do you remember how to check the names of the columns in a pandas DataFrame? If not, you can write a Copilot prompt to help you out. In the cell below, write a prompt that will print the names of the columns in the DataFrame. Then run the cell, and see what happens.
# print the column names of the dataframe df.columns
Index(['participantID', 'trial', 'RT'], dtype='object')
Another option is to look at the first few rows of the DataFrame, which includes the column names:
# print the first few rows of the dataframe df.head()
A third option, when using Jupyter notebooks with VS Code, is to click on the
Variables button in the toolbar at the top of the notebook window. This will pop up a variable viewer in sub-window below your notebook. You can click on the variable names to see their values. For DataFrames, it actually shows a list of the columns in the window, and you can double-click on the variable name to see the contents of the DataFrame in another window, the Data Viewer. This view is similar to a spreadsheet. In fact, you can directly edit values in the Data Viewer. You should never directly edit values like this, however. Any steps you do manually are not documented in your code, and are not reproducible.
The screenshot below shows the variables and Data Viewer for the current context.
Using any of the three approaches above, when we look at the column names, we see that they are
RT. The code that generated the error was trying to access a column called
participant, which doesn’t exist. It should be
participantID. So we need to change the code to access the correct column name. Or, engineer our prompt to do so.
# calculate mean rt for each participantID df.groupby('participantID').mean()
This looks good, however the code is providing means for both columns in the DataFrame (
RT), not just for
RT (sometimes the same prompt actually does select only
RT but we’ll explore when it doesn’t). We can add to our prompt to tell it not to include
trial in the output:
# calculate mean rt for each participantID. Do not show the mean for trial df.groupby('participantID').mean().drop(columns='trial')
The above generated code does what we want. However, from the perspectives of coding style and efficiency, it’s not optimal. Python executes this chained command from left to right. So, it first computes the mean for each column in the DataFrame, and then drops the column
It seems unnecessary to compute the mean for
trial and then drop it. This isn’t really Copilot’s fault — we did explicitly tell it not to show the mean for trial, but it’s not smart enough to know that we don’t want to compute it in the first place; it seems to have interpreted our prompt as a literal sequence of commands.
We can modify the prompt in a way that generates more efficient code, by being specific about the column that we do want, rather than what we don’t want:
# calculate mean for each participantID using the RT column df.groupby('participantID')['RT'].mean()
participantID s1 0.389548 s2 0.444785 s3 0.446009 Name: RT, dtype: float64
By way of showing how sensitive Copilot is to the structure of your prompt, a slightly different (and arguably more logical) phrasing of the prompt above generates the less-efficient code:
# calculate mean RT for each participantID df.groupby('participantID').mean().drop('trial', axis=1)
One thing you may notice is that the result of the last command above is nicely-formatted when it is displayed, whereas the one before it is in a more “raw” format. This is not really important here, but it’s worth understanding why the difference occurs. When you call a pandas DataFrame it prints in a nicely formatted output. However, when you call a pandas Series (which is a single column), it prints in a more detailed but less “pretty” way.
In the output immediately above, the code created a DataFrame with two columns (
RT) and then dropped the
trial column, but as such it remained a DataFrame and so was nicely formatted.
In contrast, the output of using the
mean() method on a single column (
RT) in the cell above that is a Series.
We'll worry about the formatting later, but it's good to understand why it happens because the distinction between DataFrames and Series often causes confusion and errors if it's not understood.