Formatting the Output in a Table#
This continues the previous lesson, which got a bit long for one lesson. Recall that our task was to
read three data files in the data
directory, each RT data from a different participant, and combine them in a DataFrame named df
. In the previous lessons we computed means and 95% CIs for individual participants and the group. In this final lesson in the series, we will print a table that includes all of the results: with columns for mean RT, lower 95% CI, and upper 95% CI, and with one row for each participant and the bottom row showing the mean and CIs across participants.
Since this is a new notebook file, we need to import the libraries we will use, and import the data. To do this, we’ve copied and pasted the code from the previous lesson into the code cell below (a few lines were deleted for simplicity).
# read in three files from the data folder, whose names start with "s" and end in "csv"
# concatenate them into one dataframe
import pandas as pd
import glob
# use glob to get all the files that start with "s" and end with "csv"
# glob returns a list of file names
filenames = glob.glob("data/s*.csv")
# read in the files and concatenate them into one dataframe
dataframes = []
for f in filenames:
dataframes.append(pd.read_csv(f))
df = pd.concat(dataframes, ignore_index=True)
Formatting the output nicely#
The final instruction was, “Finally, show a table that includes all of the results: with columns for mean RT, lower 95% CI, and upper 95% CI, and with one row for each participant and the bottom row showing the mean and CIs across participants.”
# show a table that has columns for mean RT, lower 95% CI, and upper 95% CI,
# each rounded to 2 decimal places,
# with one row for each participantID,
# and the final row for the mean across all participantIDD
mean_rt = df.groupby('participantID')['RT'].mean().round(2)
sem_rt = df.groupby('participantID')['RT'].sem().round(2)
ci_rt = sem_rt * 1.96
upper_ci = mean_rt + ci_rt
lower_ci = mean_rt - ci_rt
ci_table = pd.DataFrame({'Mean RT': mean_rt, 'Lower CI': lower_ci, 'Upper CI': upper_ci})
print(ci_table)
Mean RT Lower CI Upper CI
participantID
s1 0.39 0.3508 0.4292
s2 0.44 0.3812 0.4988
s3 0.45 0.3716 0.5284
This is a decent first pass, but the CIs are rounded to 4, not 2, decimal places, and the table is missing the mean across participants. These are two different problems, so let’s try to solve them one at a time. We’ll start with rounding. Our prompt above was pretty specific in asking Copilot to round each of the stated values to 2 decimal places, so maybe instead of prompt engineering we can ask Chat for help. Let’s try this prompt in Copilot Chat: modify the code in the selected cell so that the CIs are rounded to 2 decimal places
. The result is:
To modify the code to round the CIs to 2 decimal places, you can change the
round()
method to round to 2 decimal places instead of the default 4 decimal places. Here’s the modified code:
mean_rt = df.groupby('participantID')['RT'].mean().round(2)
sem_rt = df.groupby('participantID')['RT'].sem().round(2)
ci_rt = sem_rt * 1.96
upper_ci = (mean_rt + ci_rt).round(2)
lower_ci = (mean_rt - ci_rt).round(2)
ci_table = pd.DataFrame({'Mean RT': mean_rt, 'Lower CI': lower_ci, 'Upper CI': upper_ci})
print(ci_table)
In this code, the
round()
method is used to round the mean RT and SEM values to 2 decimal places, and theround()
method is also used to round the upper and lower CI values to 2 decimal places.The
upper_ci
andlower_ci
variables calculate the upper and lower CIs by adding and subtracting theci_rt
value from themean_rt
value, respectively. Theround()
method is used to round the resulting values to 2 decimal places.Finally, the
ci_table
variable creates a table of the mean RT, lower CI, and upper CI values using thepd.DataFrame()
function, and theprint()
function is used to print the table.This code should give you the desired output of a table containing the mean RT, lower CI, and upper CI values rounded to 2 decimal places.
And indeed, as you can see below, it works!
mean_rt = df.groupby('participantID')['RT'].mean().round(2)
sem_rt = df.groupby('participantID')['RT'].sem().round(2)
ci_rt = sem_rt * 1.96
upper_ci = (mean_rt + ci_rt).round(2)
lower_ci = (mean_rt - ci_rt).round(2)
ci_table = pd.DataFrame({'Mean RT': mean_rt, 'Lower CI': lower_ci, 'Upper CI': upper_ci})
print(ci_table)
Mean RT Lower CI Upper CI
participantID
s1 0.39 0.35 0.43
s2 0.44 0.38 0.50
s3 0.45 0.37 0.53
Now let’s address the problem of the missing mean across participants. Based on what we learned earlier in the course, this might not be surprising. While we can use .groupby()
to split the data by participantID, or simply .mean()
to get the average across participantIDs, there is no way of generating both split and averaged data using the same command.
Instead, what we could try doing is create the across-subject data as a separate table, and then merge that table with the by-participant table. In writing the prompt below I also tried a slightly different way of asking for rounding to 2 decimal places, which turns out to be effective:
# show a table with columns for mean RT, lower 95% CI, and upper 95% CI,
# averaged across all participant ID
# round all values in the table to 2 decimal places
mean_rt = df['RT'].mean().round(2)
sem_rt = df['RT'].sem().round(2)
ci_rt = sem_rt * 1.96
upper_ci = (mean_rt + ci_rt).round(2)
lower_ci = (mean_rt - ci_rt).round(2)
ci_table = pd.DataFrame({'Mean RT': mean_rt, 'Lower CI': lower_ci, 'Upper CI': upper_ci}, index=['RT'])
print(ci_table)
Mean RT Lower CI Upper CI
RT 0.43 0.39 0.47
Now we want to merge the two tables. In the cell below, I only had to type the first one-line prompt, and Copilot generated all the subsequent code and prompts/comments. However, I had to hit Enter
twice after the first prompt, and after some other lines (where you see blank lines below) to get all the code. I would tell when Copilot had finished because it started regenerating the same prompts and code it had already generated.
Unfortunately when we try to run the cell below we get an error:
# merge the two tables that are the output of each of the two code cells above
# calculate the mean RT for each participantID
mean_rt = df.groupby('participantID')['RT'].mean().round(2)
# calculate the 95% confidence intervals for the mean RT for each participantID
sem_rt = df.groupby('participantID')['RT'].sem().round(2)
ci_rt = sem_rt * 1.96
upper_ci = (mean_rt + ci_rt).round(2)
lower_ci = (mean_rt - ci_rt).round(2)
ci_table = pd.DataFrame({'Mean RT': mean_rt, 'Lower CI': lower_ci, 'Upper CI': upper_ci})
# calculate the mean RT across all participantID
mean_rt = df['RT'].mean().round(2)
# calculate the 95% confidence intervals for the mean RT across all participantID
sem_rt = df['RT'].sem().round(2)
ci_rt = sem_rt * 1.96
upper_ci = (mean_rt + ci_rt).round(2)
lower_ci = (mean_rt - ci_rt).round(2)
ci_table2 = pd.DataFrame({'Mean RT': mean_rt, 'Lower CI': lower_ci, 'Upper CI': upper_ci}, index=['RT'])
# merge the two tables
ci_table.append(ci_table2)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/var/folders/nq/chbqhzjx5hb6hfjx708rh2_40000gn/T/ipykernel_70577/2646225902.py in ?()
20 lower_ci = (mean_rt - ci_rt).round(2)
21 ci_table2 = pd.DataFrame({'Mean RT': mean_rt, 'Lower CI': lower_ci, 'Upper CI': upper_ci}, index=['RT'])
22
23 # merge the two tables
---> 24 ci_table.append(ci_table2)
~/miniforge3/envs/neural_data_science/lib/python3.12/site-packages/pandas/core/generic.py in ?(self, name)
6295 and name not in self._accessors
6296 and self._info_axis._can_hold_identifiers_and_holds_name(name)
6297 ):
6298 return self[name]
-> 6299 return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'append'
The error message is that 'DataFrame' object has no attribute 'append'
. This means that the code is trying to append the across-subject table to the by-participant table, but that the append()
method doesn’t exist for DataFrames.
If you do a Web search for “pandas dataframe append”, you may find this error puzzling, because there are many descriptions of how to use this method, including in the official pandas documentation. However, what’s not immediately obvious is that the .append()
method in pandas was depricated in November, 2021. “Depricated” in a programming language means that a feature (function, method, etc.) was made obsolete — typically replaced by a newer feature, or perhaps because it was redundant with another way of doing the same thing. Since Copilot is trained on a vast amount of code, this will inevitably include legacy code that is no longer used.
We’ll ask Copilot Chat for help again, this time telling it what the error message is, and asking how to fix it:
the selected cell generates an error: "'DatFrame' object has no attribute 'append'". how can i fix this error?
The response is:
The error message “‘DataFrame’ object has no attribute ‘append’” suggests that the
append()
method is not available for theDataFrame
object in the code.To fix this error, you can use the
concat()
function instead of theappend()
method to concatenate the two dataframes vertically. Here’s the modified code:
# merge the two tables that are the output of each of the two code cells above
# calculate the mean RT for each participantID
mean_rt = df.groupby('participantID')['RT'].mean().round(2)
# calculate the 95% confidence intervals for the mean RT for each participantID
sem_rt = df.groupby('participantID')['RT'].sem().round(2)
ci_rt = sem_rt * 1.96
upper_ci = (mean_rt + ci_rt).round(2)
lower_ci = (mean_rt - ci_rt).round(2)
ci_table = pd.DataFrame({'Mean RT': mean_rt, 'Lower CI': lower_ci, 'Upper CI': upper_ci})
# calculate the mean RT across all participantID
mean_rt = df['RT'].mean().round(2)
# calculate the 95% confidence intervals for the mean RT across all participantID
sem_rt = df['RT'].sem().round(2)
ci_rt = sem_rt * 1.96
upper_ci = (mean_rt + ci_rt).round(2)
lower_ci = (mean_rt - ci_rt).round(2)
ci_table2 = pd.DataFrame({'Mean RT': mean_rt, 'Lower CI': lower_ci, 'Upper CI': upper_ci}, index=['RT'])
# concatenate the two tables vertically
ci_table = pd.concat([ci_table, ci_table2])
# print the merged table
print(ci_table)
In this code, the
concat()
function is used to concatenate the two dataframesci_table
andci_table2
vertically. The resulting dataframe contains the rows of both dataframes stacked on top of each other.Finally, the
print()
function is used to print the merged table.This code should fix the error and give you the desired output of a merged table containing the mean RT, lower CI, and upper CI values for each participant, as well as the mean RT, lower CI, and upper CI values across all participants.
When we try this, it works. The only lingering issue is that the last row is labelled RT
, which does not tell us the values represent the mean across all participants:
# merge the two tables that are the output of each of the two code cells above
# calculate the mean RT for each participantID
mean_rt = df.groupby('participantID')['RT'].mean().round(2)
# calculate the 95% confidence intervals for the mean RT for each participantID
sem_rt = df.groupby('participantID')['RT'].sem().round(2)
ci_rt = sem_rt * 1.96
upper_ci = (mean_rt + ci_rt).round(2)
lower_ci = (mean_rt - ci_rt).round(2)
ci_table = pd.DataFrame({'Mean RT': mean_rt, 'Lower CI': lower_ci, 'Upper CI': upper_ci})
# calculate the mean RT across all participantID
mean_rt = df['RT'].mean().round(2)
# calculate the 95% confidence intervals for the mean RT across all participantID
sem_rt = df['RT'].sem().round(2)
ci_rt = sem_rt * 1.96
upper_ci = (mean_rt + ci_rt).round(2)
lower_ci = (mean_rt - ci_rt).round(2)
ci_table2 = pd.DataFrame({'Mean RT': mean_rt, 'Lower CI': lower_ci, 'Upper CI': upper_ci}, index=['RT'])
# concatenate the two tables vertically
ci_table = pd.concat([ci_table, ci_table2])
# print the merged table
print(ci_table)
Mean RT Lower CI Upper CI
s1 0.39 0.35 0.43
s2 0.44 0.38 0.50
s3 0.45 0.37 0.53
RT 0.43 0.39 0.47
We can as Chat to do this for us: modify the code in the selected cell so that the last row in the final dataframe is labelled "Average"
To modify the code to label the last row in the final dataframe as “Average”, you can change the index label of the last row in the
ci_table2
dataframe to “Average” before concatenating the two dataframes. Here’s the modified code:
# merge the two tables that are the output of each of the two code cells above
# calculate the mean RT for each participantID
mean_rt = df.groupby('participantID')['RT'].mean().round(2)
# calculate the 95% confidence intervals for the mean RT for each participantID
sem_rt = df.groupby('participantID')['RT'].sem().round(2)
ci_rt = sem_rt * 1.96
upper_ci = (mean_rt + ci_rt).round(2)
lower_ci = (mean_rt - ci_rt).round(2)
ci_table = pd.DataFrame({'Mean RT': mean_rt, 'Lower CI': lower_ci, 'Upper CI': upper_ci})
# calculate the mean RT across all participantID
mean_rt = df['RT'].mean().round(2)
# calculate the 95% confidence intervals for the mean RT across all participantID
sem_rt = df['RT'].sem().round(2)
ci_rt = sem_rt * 1.96
upper_ci = (mean_rt + ci_rt).round(2)
lower_ci = (mean_rt - ci_rt).round(2)
ci_table2 = pd.DataFrame({'Mean RT': mean_rt, 'Lower CI': lower_ci, 'Upper CI': upper_ci}, index=['Average'])
# concatenate the two tables vertically
ci_table = pd.concat([ci_table, ci_table2])
# print the merged table
print(ci_table)
In this code, the
index
parameter of thepd.DataFrame()
function is used to set the index label of the last row in theci_table2
dataframe to “Average”.The
concat()
function is then used to concatenate the two dataframesci_table
andci_table2
vertically. The resulting dataframe contains the rows of both dataframes stacked on top of each other.Finally, the
print()
function is used to print the merged table.This code should give you the desired output of a merged table containing the mean RT, lower CI, and upper CI values for each participant, as well as the mean RT, lower CI, and upper CI values across all participants, with the last row labelled as “Average”.
# merge the two tables that are the output of each of the two code cells above
# calculate the mean RT for each participantID
mean_rt = df.groupby('participantID')['RT'].mean().round(2)
# calculate the 95% confidence intervals for the mean RT for each participantID
sem_rt = df.groupby('participantID')['RT'].sem().round(2)
ci_rt = sem_rt * 1.96
upper_ci = (mean_rt + ci_rt).round(2)
lower_ci = (mean_rt - ci_rt).round(2)
ci_table = pd.DataFrame({'Mean RT': mean_rt, 'Lower CI': lower_ci, 'Upper CI': upper_ci})
# calculate the mean RT across all participantID
mean_rt = df['RT'].mean().round(2)
# calculate the 95% confidence intervals for the mean RT across all participantID
sem_rt = df['RT'].sem().round(2)
ci_rt = sem_rt * 1.96
upper_ci = (mean_rt + ci_rt).round(2)
lower_ci = (mean_rt - ci_rt).round(2)
ci_table2 = pd.DataFrame({'Mean RT': mean_rt, 'Lower CI': lower_ci, 'Upper CI': upper_ci}, index=['Average'])
# concatenate the two tables vertically
ci_table = pd.concat([ci_table, ci_table2])
# print the merged table
print(ci_table)
Mean RT Lower CI Upper CI
s1 0.39 0.35 0.43
s2 0.44 0.38 0.50
s3 0.45 0.37 0.53
Average 0.43 0.39 0.47
Conclusion#
We’ve now completed the task of calculating the mean RT and 95% CIs for each participant, and for all participants. We’ve also learned a lot about how to use Copilot to help us write code, and how to use Copilot Chat to help us understand errors and fix them.
As you’ve seen, the process of using AI to generate code isn’t always smooth, nor are the results always accurate. It’s critical to critically assess the outputs you see, and to understand what you are asking Copilot to do — as well as what the code it generates is doing. In this lesson, for the sake of expediency and length we did not provide detailed explanations of what the code was doing every step of the way. However, you’ve seen that Copilot can be useful at explaining code for you.
There were also many points in this lesson where, as an experienced coder, I knew what the problem was and how to fix it. However, I deliberately did not fix the problems myself, because I recognize that as a novice coder you may not know how to fix the problems. Hopefully this lesson has shown you how to use Copilot effectively, but also underscored the value of actually knowing Python and being able to read and debug code yourself.
The best approach to learning to use Python for data science is to write code yourself when you can, because this helps build and reinforce your knowledge (remember, coding is a procedural skill that you only learn by doing — and you lose the knowledge if you don’t use it). But, when you encounter a problem you’ve never solved before, or are getting errors in your Code, Copilot can be a huge help to deepen your knowledge and solve problems. As we’ve seen, Copilot Chat can be much more useful in this regard than simply writing prompts in code cells of your notebook, because it gives you both code, and detailed explanations of what that code is doing.
Copilot in Later Chapters of This Book
Throughout the rest of this book, we focus on teaching you to write and understand code yourself, rather than using Copilot. We believe — and feel this lesson has demonstrated — that learning to write code yourself is the best way to learn to code. However, whether you’re taking this course for credit and have assignments that Copilot may be useful in helping with, or simply using this book to learn how to apply Python to your own data, we recognize that Copilot can be a useful tool. We hope this chapter has given you a good introduction to how to use Copilot effectively, and that you will continue to use it on your coding journey.