Basic Statistics in Python: t tests with SciPy#

Learning Objectives#

  • implement paired, unpaired, and 1-sample t tests using the SciPy package


An introductory course in statistics is a prerequisite for this class, so we assume you remember (some of) the basics including t-tests, ANOVAs, and regression (or at least correlation).

Here we will demonstrate how to perform t tests in Python. Future lessons will cover ANOVA and regression.

The t test#

A t test is used to compare the means of two sets of data. For example, in the flanker experiment we used in the previous section, we could compare the mean RTs for the congruent and incongruent conditions. t tests consider the size of the difference between the means of the two data sets, relative to the variance in each one. The less the distributions of values in the two data sets overlap, the larger the t value will tend to be. We can then estimate the probability that the observed difference occurred simply by chance, rather than due to a true difference — this is the p value. Typically, researchers use a p < .05 threshold to determine statistical significance.

t tests are implemented in the SciPy library, which “provides many user-friendly and efficient numerical routines, such as routines for numerical integration, interpolation, optimization, linear algebra, and statistics.” Each of those type of routines is in a separate sub-module of SciPy; the one we’ll want is scipy.stats. We can import this specific module with the command:

from scipy import stats

We’ll also import some other packages we’ll need, and the flanker data from the previous lesson to work with:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import glob

# Import the data
df = pd.read_csv('data/flanker_rt_data.csv')

## Aggregate the data across participants
df_avg = pd.DataFrame(df.groupby(['participant', 'flankers'])['rt'].mean()).reset_index()

Paired t-test#

Let’s start by comparing the mean RTs for the congruent and incongruent flanker conditions.

Recall that we are working with repeated-measures data – for each participant, we have 160 trials across 4 conditions. t tests are not meant for within-condition repeated measures data — we need only one measurement per participant in each condition. This is for essentially the same reason discussed at the end of the previous section on repeated measures data: if we treat the within-participant variability the same as the between-participant variability, then we will tend to grossly under-estimate the true (between-participant) variance. When running a t test, this would result in erroneously large t values that could often falsely suggest a statistically significant result. So, we need to use the aggregated data, df_avg.

The other important characteristic of our data are that, even though aggregation has reduced the data to one measurement per participant, we still have repeated measures, across the two conditions. The default assumption of a t test is that each of the two data sets being compared come from different samples of the population (often called a between-subjects design in Psychology). This means that t tests assume there is no relationship between any particular measurement in each of the two data sets being compared. When we have measurements from the same people in both data sets (a within-subjects. design), we need to account for this, or the t test will again suggest an inflated (incorrect) value. We account for this by using a paired t test. In SciPy, this is the function ttest_rel(). (For a between-subjects — or independent groups design, which we will not cover here, you would use ttest_ind()).

Select the data#

Running ttest_rel() is as simple as giving it the two sets of data you want to compare, as arguments. We can pull these directly from our df_avg pandas DataFrame. We’ll do this in a few lines of code below, first assigning each data set to a new variable, and then running the t test.

congr = df_avg[df_avg['flankers'] == 'congruent']['rt']
incongr = df_avg[df_avg['flankers'] == 'incongruent']['rt']

Let’s make sure you understand the code above before we go on. We’ve seen it before, but maybe not in exactly this form — and it is quite complex, but logical.

We start on the first line by selecting only the rows of the DataFrame associated with congruent trials, which returns a Boolean mask:

df_avg['flankers'] == 'congruent'
0      True
1     False
2     False
3      True
4     False
76    False
77    False
78     True
79    False
80    False
Name: flankers, Length: 81, dtype: bool

We embed this inside another selector df_avg[df_avg['flankers'] == 'congruent'], which applies the Boolean mask to the DataFrame, essentially saying, “select from df_avg all the columns associated with congruent trials”.

df_avg[df_avg['flankers'] == 'congruent']
participant flankers rt
0 s1 congruent 0.455259
3 s10 congruent 0.471231
6 s11 congruent 0.417540
9 s12 congruent 0.429758
12 s13 congruent 0.419096
15 s14 congruent 0.437178
18 s15 congruent 0.548638
21 s16 congruent 0.433748
24 s17 congruent 0.437577
27 s18 congruent 0.488892
30 s19 congruent 0.539020
33 s2 congruent 0.438167
36 s20 congruent 0.462935
39 s21 congruent 0.417553
42 s22 congruent 0.410191
45 s23 congruent 0.549622
48 s24 congruent 0.568396
51 s25 congruent 0.450102
54 s26 congruent 0.528508
57 s27 congruent 0.439243
60 s3 congruent 0.570766
63 s4 congruent 0.401993
66 s5 congruent 0.462927
69 s6 congruent 0.446840
72 s7 congruent 0.628185
75 s8 congruent 0.428642
78 s9 congruent 0.431829

Finally, we add ['rt'] to the end to indicate that, having selected the incongruent rows, we actually only want the column with the RT values, because those are what we want to perform the t test on. The second line does the same thing for incongruent trials.

df_avg[df_avg['flankers'] == 'congruent']['rt']
0     0.455259
3     0.471231
6     0.417540
9     0.429758
12    0.419096
15    0.437178
18    0.548638
21    0.433748
24    0.437577
27    0.488892
30    0.539020
33    0.438167
36    0.462935
39    0.417553
42    0.410191
45    0.549622
48    0.568396
51    0.450102
54    0.528508
57    0.439243
60    0.570766
63    0.401993
66    0.462927
69    0.446840
72    0.628185
75    0.428642
78    0.431829
Name: rt, dtype: float64

This last result is what we assign to congr (note, by the way, that this is a pandas Series, not a DataFrame).

Likewise, incongr is a Series of the same length (the number of participants):

1     0.471838
4     0.499031
7     0.473012
10    0.506722
13    0.478367
16    0.453524
19    0.591644
22    0.492921
25    0.504452
28    0.527152
31    0.591181
34    0.518216
37    0.507257
40    0.507033
43    0.474612
46    0.554172
49    0.595977
52    0.513179
55    0.565531
58    0.501069
61    0.591022
64    0.428867
67    0.530722
70    0.490298
73    0.650769
76    0.494878
79    0.437926
Name: rt, dtype: float64

Run the t test#

Now we just pass congr and incongr as the first (and only) two arguments to ttest_rel(), and print the results out with some explanatory text. Note that we have to write stats.ttest_rel(), because we imported the library as stats.

t, p = stats.ttest_rel(congr, incongr)
print('Congruent vs. Incongruent t = ', str(t), ' p = ', str(p))
Congruent vs. Incongruent t =  -10.209634805365013  p =  1.3739296579820675e-10

We can make the output nicer by rounding to a reasonable level of precision:

print('Congruent vs. Incongruent t = ', str(round(t, 2)), ' p = ', str(round(p, 4)))
Congruent vs. Incongruent t =  -10.21  p =  0.0

Now that’s a results any researcher would be happy to see! The p value is not actually zero by the way, but note in the original output the p value was reported in scientific notation, ending in e-10. This means that the p value is actually 0.00000000013739. We would typically report this as p < .0001, since we rounded to 4 decimal places (which is fairly typical for reporting p values).

1-tailed vs. 2-tailed p values#

By default, SciPy’s ttest_ functions return 2-tailed p values. This means that the p value considers both possible directions of difference between the two conditions. In the present example, that means either RTs for congruent are faster than incongruent, or they are slower for congruent than incongruent. In contrast, a 1-tailed p value should be used if we have a specific prediction of a “direction” of the difference. Using a 1-tailed p value will tend to be less conservative, i.e., more likely to find a significant effect. This is because, for a given p threshold (e.g., \(\alpha = .05\)), a 2-tailed test effectively splits the p in half, and reflects a probability of 2.5% that the result occurred by chance in one direction (e.g., congruent slower) and a 2.5% probability of getting the revser result (e.g., congruent faster) by chance. In contrast, a 1-tailed test allocates all of the 5% chance probability to the likelihood of a difference in one direction (e.g., congruent faster).

Practically speaking, 2-tailed tests should be used by default, but if you have a specific a priori hypothesis regarding the direction of the difference, you can use a 1-tailed test. For example, for the flanker experiment we’re working with here, previous research would lead us to the congruent-faster hypothesis.

In the present example, it really doesn’t matter since the two-tailed p value is wildly significant. However, if you want to convert to one-tailed p values, you just need to divide p in half:

print('Congruent vs. Incongruent t = ', str(round(t, 2)), ' p (one-tailed) = ', str(p / 2))
Congruent vs. Incongruent t =  -10.21  p (one-tailed) =  6.869648289910338e-11

Be careful about order of data values#

The above paired t test worked properly because in our pandas DataFrame, participants are listed in a consistent order. So when we create separate Series for congruent and incongruent, the same rows of the two Series belong to the same participant. However, this isn’t always guaranteed, and so it’s good practice to do things in a way that ensures proper pairing of participants between data sets.

pandas indexing allows us to do this. Recall that indexes are row labels. By default, when we read a CSV file to a DataFrame, the rows are indexed numerically starting from zero. Indeed, if you look back above at the contents of congr and incongr, you’ll see the indexes in the left column are discontinuous and different between the two, because each data point came from a separate row. To ensure alignment of each participant’s data across the two series, we can first use the participant ID as the index of df_avg, and then create separate Series for each condition:

df_avg = df_avg.set_index('participant')
congr = df_avg[df_avg['flankers'] == 'congruent']['rt']
incongr = df_avg[df_avg['flankers'] == 'incongruent']['rt']

Now when we look at the resulting Series, we see that the participant indexes are preserved:

s1     0.455259
s10    0.471231
s11    0.417540
s12    0.429758
s13    0.419096
s14    0.437178
s15    0.548638
s16    0.433748
s17    0.437577
s18    0.488892
s19    0.539020
s2     0.438167
s20    0.462935
s21    0.417553
s22    0.410191
s23    0.549622
s24    0.568396
s25    0.450102
s26    0.528508
s27    0.439243
s3     0.570766
s4     0.401993
s5     0.462927
s6     0.446840
s7     0.628185
s8     0.428642
s9     0.431829
Name: rt, dtype: float64
s1     0.471838
s10    0.499031
s11    0.473012
s12    0.506722
s13    0.478367
s14    0.453524
s15    0.591644
s16    0.492921
s17    0.504452
s18    0.527152
s19    0.591181
s2     0.518216
s20    0.507257
s21    0.507033
s22    0.474612
s23    0.554172
s24    0.595977
s25    0.513179
s26    0.565531
s27    0.501069
s3     0.591022
s4     0.428867
s5     0.530722
s6     0.490298
s7     0.650769
s8     0.494878
s9     0.437926
Name: rt, dtype: float64

Ensure pandas indexing is used in t tests#

What could go wrong?#

It turns out that SciPy’s ttest functions ignore pandas indexes, so indexing on its own won’t ensure that the t test compares data points from the same individuals. We can see that by randomizing the order of the rows of the incongr2 series, while preserving the relationship between indexes (participant IDs) and RTs (you can compare with above data to confirm that the same RT values are associated with the same IDs as in the original incongr Series):

df_avg = df_avg.reset_index()
inc_arr = np.array(df_avg[df_avg['flankers']=='incongruent'].iloc[:, [0, 2]])
incongr2 = pd.DataFrame(inc_arr, columns=['participant', 'rt']).set_index('participant')
incongr2 = pd.Series(incongr2['rt'])
s5     0.530722
s6     0.490298
s9     0.437926
s19    0.591181
s2     0.518216
s11    0.473012
s21    0.507033
s10    0.499031
s4     0.428867
s14    0.453524
s23    0.554172
s20    0.507257
s25    0.513179
s27    0.501069
s24    0.595977
s3     0.591022
s22    0.474612
s8     0.494878
s12    0.506722
s17    0.504452
s15    0.591644
s18    0.527152
s26    0.565531
s7     0.650769
s1     0.471838
s13    0.478367
s16    0.492921
Name: rt, dtype: object

Now when we run the t test, the t value doesn’t match the t value that we got above with the properly-paired data, and in fact if you run the code below multiple times, you will get diferent t and p values each time due to the random shuffling.

t, p = stats.ttest_rel(congr, incongr2)
print('Congruent vs. Incongruent t = ', str(round(t, 2)), ' p = ', str(round(p, 4)))
Congruent vs. Incongruent t =  -3.04  p =  0.0053

Solution 1: Use DataFrame columns rather than extracting Series#

Above, we extracted the two data sets we wanted to compare with a t test from a DataFrame (df) to two pandas Series, congr and incongr. On the one hand, this simplifies the syntax of the t test command, but on the other hand we lose the structure of the pandas DataFrame. That is, in the DataFrame, the values from each participant are in the same row, and so we don’t have to worry about the order of the data values. We can run the t test on two columns of the pandas DataFrame, the code is just a little more complex to look at:

t, p = stats.ttest_rel(df_avg[df_avg['flankers'] == 'congruent']['rt'],
                       df_avg[df_avg['flankers'] == 'incongruent']['rt'])
print('Congruent vs. Incongruent t = ', str(round(t, 2)), ' p = ', str(round(p, 4)))
Congruent vs. Incongruent t =  -10.21  p =  0.0

This is probably the best approach to use in most cases, because:

  1. It ensures that the repeated-measures structure of the data is preserved

  2. It uses less memory resources, because we aren’t copying columns of our DataFrame to new Series/vairables.

It may in fact seem overly convoluted to have first demonstrated the extract-to-Series approach, then explain that it’s not the ideal way to do things! However, for many people, it’s intuitive to extract subsets of data to perform further processing on. One point of this lesson was to illustrate how that can create problems, even though it might seem like a logical approach.

Solution 2: Use .sort_index() to ensure paired data are aligned#

If you do choose to work with a pair of Series, the way we can ensure that the indexes of the two data sets align this is by re-ordering the data in both Series that we’re comparing (congr and incongr2 in this case), using pandas .sort_index() method:

t, p = stats.ttest_rel(congr.sort_index(), incongr2.sort_index())
print('Congruent vs. Incongruent t = ', str(round(t, 2)), ' p = ', str(round(p, 4)))
Congruent vs. Incongruent t =  -10.21  p =  0.0

Long story short, it is good practice to index by participant ID, and use the .sort_index() method when applying t tests to pandas Series or DataFrames, to ensure that values are appropriately paired.

Testing differences: one-sample t tests#

An alternative way to compare the congruent and incongruent conditions is to compute the difference in mean RTs between the two conditions for each participant (since it is a paired design), and then run a t test on the differences. In this case, we use a one sample t test, in which we compare the data set to zero. In other words, is the difference between the conditions basically zero, or is it significantly different from zero (i.e., a believable difference)?

We can compute the difference between two pandas Series easily just using the - (minus) operator, so in this case we could use congr - incongr

Note that this subtraction only works if the two Series are indexed by participant ID (or in some way that preserves the alignment of values between the two data sets). However, because we are subtracting two pandas objects, pandas recognizes the indexes in each and aligns them, even if the indexes aren’t in the same order in the two input Series. So we don’t have to worry about using .sort_index() as we did above for paired t tests.

congr_vs_incongr = congr - incongr
t, p = stats.ttest_1samp(congr_vs_incongr, 0)
print('Congruent vs. Incongruent t = ', str(round(t, 2)), ' p = ', str(round(p, 4)))
Congruent vs. Incongruent t =  -10.21  p =  0.0

Note that we get the same result from the 1 sample t test if we perform the subtraction on the two Series that have the same order of indexes, as when we perform the subtraction using incongr2, which has a randomly shuffled order of indexes. We don’t need to explicitly .sort_index() in this case:

congr_vs_incongr = congr - incongr2
t, p = stats.ttest_1samp(congr_vs_incongr, 0)
print('Congruent vs. Incongruent t = ', str(round(t, 2)), ' p = ', str(round(p, 4)))
ValueError                                Traceback (most recent call last)
Cell In[19], line 2
      1 congr_vs_incongr = congr - incongr2
----> 2 t, p = stats.ttest_1samp(congr_vs_incongr, 0)
      3 print('Congruent vs. Incongruent t = ', str(round(t, 2)), ' p = ', str(round(p, 4)))

File ~/mambaforge/envs/ncil/lib/python3.10/site-packages/scipy/stats/, in _axis_nan_policy_factory.<locals>.axis_nan_policy_decorator.<locals>.axis_nan_policy_wrapper(***failed resolving arguments***)
    521 if sentinel:
    522     samples = _remove_sentinel(samples, paired, sentinel)
--> 523 res = hypotest_fun_out(*samples, **kwds)
    524 res = result_to_tuple(res)
    525 res = _add_reduced_axes(res, reduced_axes, keepdims)

File ~/mambaforge/envs/ncil/lib/python3.10/site-packages/scipy/stats/, in ttest_1samp(a, popmean, axis, nan_policy, alternative)
   6929     raise ValueError("`popmean.shape[axis]` must equal 1.") from e
   6930 d = mean - popmean
-> 6931 v = _var(a, axis, ddof=1)
   6932 denom = np.sqrt(v / n)
   6934 with np.errstate(divide='ignore', invalid='ignore'):

File ~/mambaforge/envs/ncil/lib/python3.10/site-packages/scipy/stats/, in _var(x, axis, ddof, mean)
   1098 def _var(x, axis=0, ddof=0, mean=None):
   1099     # Calculate variance of sample, warning if precision is lost
-> 1100     var = _moment(x, 2, axis, mean=mean)
   1101     if ddof != 0:
   1102         n = x.shape[axis] if axis is not None else x.size

File ~/mambaforge/envs/ncil/lib/python3.10/site-packages/scipy/stats/, in _moment(a, moment, axis, mean)
   1068 mean = (a.mean(axis, keepdims=True) if mean is None
   1069         else dtype(mean))
   1070 a_zero_mean = a - mean
-> 1072 eps = np.finfo(a_zero_mean.dtype).resolution * 10
   1073 with np.errstate(divide='ignore', invalid='ignore'):
   1074     rel_diff = np.max(np.abs(a_zero_mean), axis=axis,
   1075                       keepdims=True) / np.abs(mean)

File ~/mambaforge/envs/ncil/lib/python3.10/site-packages/numpy/core/, in finfo.__new__(cls, dtype)
    490     dtype = newdtype
    491 if not issubclass(dtype, numeric.inexact):
--> 492     raise ValueError("data type %r not inexact" % (dtype))
    493 obj = cls._finfo_cache.get(dtype, None)
    494 if obj is not None:

ValueError: data type <class 'numpy.object_'> not inexact

Paired vs. 1-sample t tests?#

You’ll note the result of the 1-sample t test is the same as the paired t test above. This is expected, because in both cases we ran a t test to compare the difference between the same two sets of data. From a coding perspective, the paired t test is a bit simpler, because you don’t have to perform a subtraction on the data prior to running the t test.

The reasons we might want to run a 1-sample t test include cases where are data are already represented as a subtraction, or in some cases when we’re working with multiple variables, performing subtractions can be a way of simplifying our presentation of the results. As well, since pandas subtraction respects the indexes, computing differences and then 1-sample t tests can be a bit safer in ensuring that the proper within-participants nesting structure of your data is preserved.


  • t tests are used to compare the means of two sets of data to each other, or the mean of one set of data against a particular value (such as zero)

  • An unpaired t test is used to compare two independent sets of data (e.g., from two different samples of a population, two groups, etc.)

  • A paired t test must be used when the two sets of data come from the same samples (e.g., the same individual participants)

  • A 1-sample t test is used to compare the mean of one set of data against a specific value. This is often used to compare a data set to zero

  • Paired t tests and 1-sample t tests can both be used to determine whether differences between two samples are significantly different from zero (no difference).

    • In the 1-sample case, you must first compute the difference between the pairs of data in two conditions.

  • When working with pandas data objects, it is important to remember that SciPy’s functions (including ttests) do not use pandas indexes. So when doing paired t tests, you must ensure that the data are listed in the same order in the two Series being compared.

    • The best way to ensure that the within-participant/repeated measures structure of the data is preserved when doing a t test, is to use two columns from a DataFrame that is indexed by participant ID.

    • One alternative is to use the .sort_index() method on two series that are indexed by participant ID

    • Another alternative is to use the fact that pandas does respect its indexing when you subtract two Series, so if your data are indexed by participant ID, doing the subtraction followed by a 1-sample t test is a way of ensuring that the within-participants relationships between data sets are preserved.