Computing Mean Reaction Times and Confidence Intervals#

This continues the previous lesson, which got a bit long for one lesson. Recall that our task was to read three data files in the data directory, each RT data from a different participant, and combine them in a DataFrame named df. We did all that in the previous lesson, and got as far as calculating the mean RT for each participant. In this lesson, we’ll continue the task by calculating the mean RT across all participants, and calculating the 95% confidence intervals (CIs) for the mean RT for each participant and for the mean RT across all participants.

Since this is a new notebook file, we need to import the libraries we will use, and import the data. To do this, we’ve copied and pasted the code from the previous lesson into the code cell below (a few lines were deleted for simplicity).

# read in three files from the data folder, whose names start with "s" and end in "csv"
# concatenate them into one dataframe

import pandas as pd
import glob

# use glob to get all the files that start with "s" and end with "csv"
# glob returns a list of file names
filenames = glob.glob("data/s*.csv")

# read in the files and concatenate them into one dataframe
dataframes = []
for f in filenames:
    dataframes.append(pd.read_csv(f))
df = pd.concat(dataframes, ignore_index=True)
df.sample(5)
participantID trial RT
0 s2 1 0.433094
1 s2 2 0.392526
2 s2 3 0.396831
3 s2 4 0.417988
4 s2 5 0.371810

Compute Mean RT Across Participants#

Our next task is to calculate the mean RT across subjects, so let’s write a prompt for that. We learned in the last lesson that we should be specific about variable names and capitalization, so let’s try the prompt:

# calculate the mean RT across all participantID

Copilot generated two possible lines of code in response to the prompt, which we can see by typing Option-[ (Mac) or Alt-[ (Windows). The options are:

# calculate the mean RT across all participantID
df.groupby("participantID").mean()

and

# calculate the mean RT across all participantID
df.groupby("participantID")["RT"].mean()

Rather than trying either of these, let’s first analyze them to try to understand what each is doing. The first option uses the groupby method to group the data by the participantID column, and then calculates the mean for each group. However, we don’t want to calculate the mean for each group, we want to calculate the mean for all groups. Secondly, the command does not specify that we want the mean only for the reaction time column.

The second option is essentially the same command, except it’s an improvement in that it does specify the column name ["RT"]. However, it still calculates the mean for each group, rather than for all groups.

At this point, we could use Copilot Chat again, but it might be more efficient to try writing the code ourselves, by adapting the second Copilot suggestion. The command is an example of chaining, in which several methods are applied in sequence to the same object. In this case, the object is the DataFrame df, and the methods are .groupby() and .mean(), which would be executed in sequence, from left to right. So what if we simply remove the .groupby() method? Let’s try that; we’ll copy and past the command (and its comment) into the code cell below, and then edit it. We need to be sure to keep the ["RT"] part though!

# calculate the mean RT across all participantID
df["RT"].mean()
0.4267806816333334

This worked.

95% CIs for Each Participant#

Let’s move on to the next step, which is to calculate the 95% confidence intervals for the mean RT for each participant.

If you are trying this yourself, our experience is that you may get a number of very different solutions from Copilot. We tried several, and all had errors! So, if you get a different error than the one we work through below, you can follow the same process with Copilot Chat to solve your particular problem.

# calculate 95% confidence intervals for each participantID
df.groupby('participantID').mean().apply(lambda x: x.sem() * 1.96, axis=1)
participantID
s1    5.008243
s2    4.954111
s3    4.952911
dtype: float64

Pink Elephants#

This is a frightening example of a Copilot hallucination. The above code generates a table of numbers, one for each participantID. You might think this is a plausible result. However, there are a couple of problems. Firstly, confidence intervals (CIs) are typically reported as a pair of values. The 95% CIs reflect the range within which it is 95% likely that the true mean lies, so we need an upper and lower limits of the CI. These are typically the same amount plus or minus the mean (e.g., if the mean is 2, the 95% CIs might be 1.5 and 2.5). So given a conceptual understanding of what CIs are, we recognize that we need to modify the code to generate the upper and lower values of the CIs.

Secondly, if you look back at the mean RT for each participant, they were in the range of .339 – .449 (i.e., 339–449 ms). Again, confidence intervals reflect the range of possible true values for the mean, given the sample data. So for our confidence intervals to have values around 5 s seems implausible, given the mean RTs are < 500 ms. We would expect the CIs to be perhaps 100-200 ms. So it seems like Copilot is calculating the CIs incorrectly.

This highlights the critical importance of understanding not only what your code is doing, but what you are asking it to do. You should always check the output of your code to make sure it makes sense, and also make sure you understand what you are supposed to be doing with the code.

Copilot Chat to the Rescue#

Since Copilot Chat was helpful before, let’s try using it to help us figure out what is going on. If we click on the Chat icon in the Activity Bar, we can type in a question to Copilot. In this case, I asked Copilot the following:

the highlighted code is supposed to compute the 95% confidence intervals of mean RT. However, the output values are incorrect. Can you please suggest code that correctly calculates the 95% confidence intervals, and provides both the upper and lower CIs in a table

Copilot provided the answer below:

To correctly calculate the 95% confidence intervals for the mean RT and provide both the upper and lower CIs in a table, you can modify the existing code as follows:

import numpy as np

# calculate mean and standard error of the mean for each participant
mean_rt = df.groupby('participantID')['RT'].mean()
sem_rt = df.groupby('participantID')['RT'].sem()

# calculate 95% confidence intervals for the mean RT
ci_rt = sem_rt * 1.96

# calculate upper and lower CIs
upper_ci = mean_rt + ci_rt
lower_ci = mean_rt - ci_rt

# create a table of the upper and lower CIs
ci_table = pd.DataFrame({'Upper CI': upper_ci, 'Lower CI': lower_ci})

# print the table of CIs
print(ci_table)

In this code, the groupby() method is used to group the data by the ‘participantID’ column, and the mean() and sem() methods are used to calculate the mean and standard error of the mean for the ‘RT’ column for each group. The resulting series contains the mean and SEM for each unique value in the ‘participantID’ column.

The ci_rt variable calculates the 95% confidence intervals for the mean RT using the formula SEM * 1.96, where SEM is the standard error of the mean.

The upper_ci and lower_ci variables calculate the upper and lower CIs by adding and subtracting the ci_rt value from the mean_rt value, respectively.

The ci_table variable creates a table of the upper and lower CIs using the pd.DataFrame() function.

Finally, the print() function is used to print the table of CIs.

This code should give you the desired output of a table containing the upper and lower CIs for the mean RT for each participant.

If you move your cursor over the code that Copilot generated, you will see a Copy button appear. Clicking this will copy the code to your clipboard, so you can paste it into your notebook. We’ll run this code below:

import numpy as np

# calculate mean and standard error of the mean for each participant
mean_rt = df.groupby('participantID')['RT'].mean()
sem_rt = df.groupby('participantID')['RT'].sem()

# calculate 95% confidence intervals for the mean RT
ci_rt = sem_rt * 1.96

# calculate upper and lower CIs
upper_ci = mean_rt + ci_rt
lower_ci = mean_rt - ci_rt

# create a table of the upper and lower CIs
ci_table = pd.DataFrame({'Upper CI': upper_ci, 'Lower CI': lower_ci})

# print the table of CIs
print(ci_table)
               Upper CI  Lower CI
participantID                    
s1             0.430626  0.348470
s2             0.496197  0.393373
s3             0.515907  0.376111

OK, this is great progress. We now have our output in a table with both upper and lower CIs, and we can see that the CIs are in the range we would expect.

Notice as well that the code that Copilot Chat generated is chunked into a series of steps, each of which has it’s own prompt (comment). This teaches us something about prompt engineering: when we are writing prompts, we should try to break down the task into a sequence of steps, each of which is relatively simple. This will make it easier for Copilot to generate code that does what we want. In contrast, the above prompt that we wrote, which didn’t generate the correct output, was a single step simply asking for 95% CIs. Sometimes we can get away with such simple prompts, but when they fail, we can try to break down the task into a sequence of steps, and then write prompts for each step. Of course, this would depend on your having a detailed enough understanding of confidence intervals to know how to break that task into smaller steps! So, Copilot Chat can be a useful interface in situations like this.

Regardless, it’s still critical to cross-check the formula that Copilot is using against a textbook or trustable online resource.

Cross-Check Copilot’s Code with Other Sources#

There’s an old saying, “once bitten, twice shy”. We’ve already seen that Copilot can generate code that is incorrect. So, it’s a good idea to check the code that Copilot generates against other sources. In this case, we can look up the formula for calculating the 95% CI for the mean, and compare it to the code that Copilot generated.

If we do a Web search for “confidence interval formula”, we will find that the results across all of the top hits indicate that the formula for calculating the 95% CI for the mean is:

ci_upper = mean + 1.96 * (std / sqrt(n))
ci_lower = mean - 1.96 * (std / sqrt(n))

where mean is the mean of the data, std is the standard deviation of the data, and n is the number of observations.

However, Copilot’s code multiplied 1.96 by sem_rt, which is the standard error of the mean (if you didn’t know that, you could find out via a Web search or asking Copilot chat). But if you remember (or look up) the formula for standard error, it is the standard deviation of the data divided by the square root of the number of observations. So, Copilot’s code is correct, it’s just using a different formula than the one we found online. But at least now we can be confident that the code is doing what we want it to do.

95% CIs across all participants#

Our next instruction is to compute 95% CIs across all participants. Given our troubles above, we may not have much confidence that it will work the first time. But, let’s start by copying the phrasing that Copilot Chat used when describing its answer to the previous question, but adding in “across all participantID” to the prompt:

# calculate the 95% confidence intervals for the mean RT across all participantID
# and provide both the upper and lower CIs in a table
mean_rt = df.groupby('participantID')['RT'].mean()
sem_rt = df.groupby('participantID')['RT'].sem()
ci_rt = sem_rt * 1.96
upper_ci = mean_rt + ci_rt
lower_ci = mean_rt - ci_rt
ci_table = pd.DataFrame({'Upper CI': upper_ci, 'Lower CI': lower_ci})
print(ci_table)
               Upper CI  Lower CI
participantID                    
s1             0.430626  0.348470
s2             0.496197  0.393373
s3             0.515907  0.376111

Unfortunately Copilot doesn’t seem to have understood “across all participantID”, and it gave us CIs for each participant separately. So, let’s try again, but this time we’ll be more explicit about what we want:

# calculate the mean RT across all participantID
# then compute the 95% confidence intervals for the mean RT across all participantID
# and provide both the upper and lower CIs in a table
mean_rt = df['RT'].mean()
sem_rt = df['RT'].sem()
ci_rt = sem_rt * 1.96
upper_ci = mean_rt + ci_rt
lower_ci = mean_rt - ci_rt
ci_table = pd.DataFrame({'Upper CI': upper_ci, 'Lower CI': lower_ci}, index=['RT'])
print(ci_table) 
    Upper CI  Lower CI
RT  0.459111   0.39445

This looks better — there’s only one pair of CIs, for the whole group — and if we examine the code, we see that it’s using the same formula as the (correct) code that Copilot generated for the previous step.

The prompt above is an example of how programmatic thinking is required even when using an AI coding assistant. In order to get the desired result, we had to break down the problem into smaller chunks. In this case, we told it to compute the mean across all participantID, and then compute the 95% CIs.

Another prompt that also works is below; note that the output is formatted differently, but the values are the same:

# calculate the upper and lower 95% confidence interval limits for RT across participantID
df['RT'].mean() + df['RT'].sem() * 1.96, df['RT'].mean() - df['RT'].sem() * 1.96
(0.45911120518780957, 0.3944501580788572)

This code illustrates a feature of Python that we haven’t seen before: using a comma to separate two commands. This is a way of executing two commands in a single line of code. It’s not a very common way of writing code, but as you can see in this example, it can be useful: the output is a Python tuple, which is an immutable, ordered collection of values. This is a good type to use when you have a pair of values that belong together, and the immutable nature of tuples means that you can’t accidentally change one of the values. Furthermore, the ordering means you can access the values by their position in the tuple, which is useful in this case because, using this prompt, Copilot will always put the upper CI first and the lower CI second.


We’ve now seen how to use Copilot to perform some very basic data science tasks. In the next and final lesson in this series, we will see how to combine all of the steps for computing means and CIs, and display this output in a table.